the path to enlightened solutions for biodiversity's dark data

39
P. Bryan Heidorn University of Arizona and JRS Biodiversity Foundation 2011 Scripting Life: the science behind ViBRANT Paris, France 20-21 January 2011 The Path to Enlightened Solutions for Biodiversity's Dark Data

Upload: vbrant

Post on 15-May-2015

699 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: The Path to Enlightened Solutions for Biodiversity's Dark Data

P. Bryan HeidornUniversity of Arizona and JRS Biodiversity Foundation

2011 Scripting Life: the science behind ViBRANTParis, France

20-21 January 2011

The Path to Enlightened Solutions for Biodiversity's

Dark Data

Page 2: The Path to Enlightened Solutions for Biodiversity's Dark Data

University of Arizona

Today: 25°CSunny

Page 3: The Path to Enlightened Solutions for Biodiversity's Dark Data

Thesis

Large amounts of data remain uncurated

Most of that data is from small data sets and is currently largely invisible – Dark Data

This data should be curated locally but not by scientists alone

Need for long-lived institutions

Page 4: The Path to Enlightened Solutions for Biodiversity's Dark Data

Cyberinfrastructure Vision

“The anticipated growth in both the production and repurposing of digital data raises complex issues not only of scale and heterogeneity, but also of stewardship, curation and long-term access.”

NSF Cyberinfrastructure Vision for 21st Century Discovery, Chapter 3

Page 5: The Path to Enlightened Solutions for Biodiversity's Dark Data

Recognition of need for data curation

“Recommendation 6: The NSF, working in partnership with collection managers and the community at large, should act to develop and mature the career path for data scientists and to ensure that the research enterprise includes a sufficient number of high-quality data scientists.”

Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, Recommendations

Page 6: The Path to Enlightened Solutions for Biodiversity's Dark Data

Recognition of the importance of Information

Recognition of the need for education

New work roles within traditional institutions

Interagency Working Group on Digital Data

Page 7: The Path to Enlightened Solutions for Biodiversity's Dark Data

Why Libraries and Museums

Long history of scholarly data management

Skills overlap such a development of metadata standards, ontologies, controlled vocabularies, thesauri

Long-lived institutionsExisting overlap with museums and

archives

Page 8: The Path to Enlightened Solutions for Biodiversity's Dark Data

The problem

Recognition of the problemInformation is not in accessible format Computer Science, Information

Science and Technology has not addressed the problem

No training or incentive for data generators

Page 9: The Path to Enlightened Solutions for Biodiversity's Dark Data

Dark data is the data that we know is/was there but we can’t see it.

Hubble Space Telescope composite image "ring" of dark matter in the galaxy cluster Cl 0024+17

Page 10: The Path to Enlightened Solutions for Biodiversity's Dark Data

Related Ideas

John Porter: Deep verses Wide databases

Swanson: Undiscovered Public Knowledge

Science Commons: Big Verses Small science

Page 11: The Path to Enlightened Solutions for Biodiversity's Dark Data

f(x)=axk+o(xk)

Power Law of Science Data

f(x)=axk+o(xk)| X<.20

Dat

a V

olum

e

Science Projects and Initiatives

Page 12: The Path to Enlightened Solutions for Biodiversity's Dark Data

Does NSF’s Data Follow the Power Law?

I do not know but if $1 = X bytes…..

Awarded Amount 2007

$0

$1,000,000

$2,000,000

$3,000,000

$4,000,000

$5,000,000

$6,000,000

$7,000,000

1 586 1171 1756 2341 2926 3511 4096 4681 5266 5851 6436 7021 7606 8191 8776

Page 13: The Path to Enlightened Solutions for Biodiversity's Dark Data

20-80 Rule The small are big!

Total Grants 9347

$2,137,636,716

20% 80%

Number Grants 1869 7478

Total Dollars $1,199,088,125 $938,548,595

Range $6,892,810-$350,000

$350,000-$831

Page 14: The Path to Enlightened Solutions for Biodiversity's Dark Data

Bio

logy

200

9

#Grants: 1886 $Total: $744,168,471 ≈ €550,000,000Distribution 1266 < $.5 million ≈ €370,000Mode: $304,691 ≈ €225,000

Myth of the mega-project

Page 15: The Path to Enlightened Solutions for Biodiversity's Dark Data

Because it is high volumeBecause it is information rich – high

entropyWhile needs of large data are

understood small data and integration are not understood

Heidorn, P. Bryan (2008). Shedding Light on the Dark Data in the Long Tail of Science. Library Trends 57(2) Fall 2008 . Institutional Repositories: Institutional Repositories: Current State and Future. Edited by Sarah Sheeves and Melissa Cragin. (http://hdl.handle.net/2142/9127).

Small data is big science

Page 16: The Path to Enlightened Solutions for Biodiversity's Dark Data

Where to find dark data

Scientist’s backpacks and desksLiterature/Biodiversity Heritage LibraryMuseum SpecimensField notesCitizen Observations

Page 17: The Path to Enlightened Solutions for Biodiversity's Dark Data

What is dark data good for?

Ecological Niche ModelingClimate Change niche change predictionTaxonomic Name ResolutionLiterature Search Support

Taxonomic intelligenceKey-like – character searching

Phenology and Phenology changeFood-web / trophic level

Page 18: The Path to Enlightened Solutions for Biodiversity's Dark Data

Problematic Transition

Personal Information Management vsKnowledge Organization

Pluralistic vs Unified (Hjørland, 2007)

Page 19: The Path to Enlightened Solutions for Biodiversity's Dark Data

Contrast in Styles (White, in press)

Personal Information ManagementOne-Few usersVisual/SpatialProject Oriented

Knowledge OrganizationMany usersLanguage basedLong-term orientation

Page 20: The Path to Enlightened Solutions for Biodiversity's Dark Data

New Information Disciplines

Digital Curator: an expert knowledgeable of and with responsibility for the content of a digital collection(s)

Digital Archivist: an expert competent to appraise, acquire, authenticate, preserve, and provide access to records in digital form

Data Scientists: the information and computer scientists, database and software engineers and programmers, disciplinary experts, expert annotators, and others, who are crucial to the successful management of a digital data collection

(Long Long-Lived Digital Data Collections: Enabling Research and Education in the 21st Century, report of the National Science Board, September, 2005)

Page 21: The Path to Enlightened Solutions for Biodiversity's Dark Data

Roles

Page 22: The Path to Enlightened Solutions for Biodiversity's Dark Data

Skills

Page 23: The Path to Enlightened Solutions for Biodiversity's Dark Data

Library Roles

Life Cycle PhasesPlanCreateKeep Dispose

Data Management FunctionAccessDocumentOrganizeProtect

Page 24: The Path to Enlightened Solutions for Biodiversity's Dark Data

How to Organize at a higher level?

It is difficult to find what is already knownClonal specimens may be stored in

different museums around the worldDNA analysis may be conducted on one

but not the otherMicrographs may be in a databaseTaxonomic treatments or revisions may

exist

Page 25: The Path to Enlightened Solutions for Biodiversity's Dark Data

Biological Science Collections (BiSciCol) Tracker

S1: KNM

S2: MNHN

Muséum national d'histoire naturelle

Nairobi National Museum

S3: MBG

Living Collection: Missouri Botanical Garden

DeterminationDetermination

?

?

Gene SequenceGene Sequence

GENBANK

?

?

?

?ParasitismParasitism

Agave sisalana

?

Page 26: The Path to Enlightened Solutions for Biodiversity's Dark Data

BiSciCol Tracker

Page 27: The Path to Enlightened Solutions for Biodiversity's Dark Data

The Future is all about Data

How do we get it?How do we analyze it?How do we disseminate it (Maps, charts

tables..)?How do we keep it?

Provenance, Storage Weeding

How do we make it sustainable?

Page 28: The Path to Enlightened Solutions for Biodiversity's Dark Data

Digital/Data Curation Programs

University of IllinoisGraduate School of Library and Information

Science

University of ArizonaSchool of Information Resources and Library

Science

University of North CarolinaSchool of Information and Library Science

Page 29: The Path to Enlightened Solutions for Biodiversity's Dark Data

Education Needs

Biological Information Specialist

Concentration in Data Curation (MSLIS)

Certificate of Advanced Study in Data Curation for Libraries and Scientist

Information and professional education in biodiversity informatics

Page 30: The Path to Enlightened Solutions for Biodiversity's Dark Data

MSLIS Data Curation Concentration

Data Curation Educational Program (DCEP)

IMLS – Laura Bush 21st Century Librarian Program,

RE-05-06-0036-06 (Heidorn, PI)

Students with the DC concentration will be trained to add value to data and promote sharing across labs and disciplinary specializations

Page 31: The Path to Enlightened Solutions for Biodiversity's Dark Data

Biological Information Specialists

At present:

Biologists at all degree levels self-trained in information technology

Information technologists at all degree levels self-trained in biology

(both with gaps in knowledge for many months, years)

Differing roles of BIS in large and small

Page 32: The Path to Enlightened Solutions for Biodiversity's Dark Data

Master of Science in Biological Informatics

Degree Program began September 2007

Part of campus-wide bioinformatics masters program

NSF/CISE/IIS, Education Research and Curriculum Development, 0534567 (Palmer, PI)

Combines Biology, Bioinformatics, Computer Science core with LIS courses

Page 33: The Path to Enlightened Solutions for Biodiversity's Dark Data

What does a BIS need to know?

Biological training and interest in solving biological research problems

Information skills Evaluation and implementation of information

systems: user based assessment and continual quality improvement for the development of tools that work and are used.

Information acquisition, management, and dissemination: development of digital libraries, data archives, institutional repositories, and related tools.

Information organization and integration: ontology development, structuring information for optimal use and sharing, and standards development.

Page 34: The Path to Enlightened Solutions for Biodiversity's Dark Data

UIUC bioinformatics core coursework

Cross-disciplinary course distribution requirement

Bioinformatics: Computing in Molecular

BiologyAlgorithms in BioinformaticsPrinciples of Systematics

Computer Science: AlgorithmsDatabase Systems

Biology:Human GeneticsIntroductory BiochemistryMacromolecular Modeling

Page 35: The Path to Enlightened Solutions for Biodiversity's Dark Data

Sample of existing LIS courses

Information Organization and Knowledge Representation

LIS 551 Interfaces to Information Systems

LIS 590DM Document Modeling LIS 590RO Representing and

Organizing Information Resources LIS590ON Ontologies in Natural

Science

Information Resources, Uses and users

LIS 503 Use and Users of Information

LIS 522 Information Sources in the Sciences

LIS 590TR Information Transfer and Collaboration in Science

Information Systems LIS 456 Information Storage

and Retrieval LIS 509 Building Digital Libraries LIS 566 Architecture of Network

Information Systems LIS 590EP Electronic Publishing

Disciplinary Focus LIS 530B Health Sciences

Information Services and Resources

LIS 590HI Healthcare Informatics (Healthcare Infrastructure)

LIS 590EI/BDI Ecological Informatics (Biodiversity Informatics)

Page 36: The Path to Enlightened Solutions for Biodiversity's Dark Data

University of ArizonaGraduate Certificate in Digital

Records Management

Six Graduate Courses within MLA program

Focus on repositoriesCross over with Knowledge

Representation and Metadata

Page 37: The Path to Enlightened Solutions for Biodiversity's Dark Data

Workforce

Data Curation Workforce Summit Dec 6th at IDCC ChicagoIdentify the Skill sets needed to government

data curationDepartment of Energy, US National Science

Foundation, Institute of Museum and Library Services, Oak Ridge National Laboratory, USGS National Biological Information Infrastructure, CIESIN

Page 38: The Path to Enlightened Solutions for Biodiversity's Dark Data

The Future is Collaboration and Data Sharing

• Libraries

• Museums

• Government

• Universities

To bring the best data to the major problems and opportunities

of our time and the future

• NGO• Private Land

Holders• Ranches• Farms

Page 39: The Path to Enlightened Solutions for Biodiversity's Dark Data

MerciMerci