© copyright 2009-2013, cambridge computer services, inc. – all rights reserved – 781-250-3000...

21
© Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved www.CambridgeComputer.com – 781-250-3000 End to End Life Cycle Management for Research Data Capturing Metadata Throughout the Research Pipeline and Facilitating the Handoff to Formal Curation Jacob Farmer, CTO Cambridge Computer

Upload: domenic-howard

Post on 28-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

© Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reservedwww.CambridgeComputer.com – 781-250-3000

End to End Life Cycle Management for Research

Data

Capturing Metadata Throughout the Research Pipeline and Facilitating the Handoff to Formal Curation

Jacob Farmer, CTOCambridge Computer

Page 2: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

2

A Little Background On Cambridge Computer

Page 3: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

3

A Little Background On Cambridge Computer

Founded in 1991 as a boutique integrator for backup and archive solutions

Approximately 75 employees nationwide

Clients of all shapes and sizes across all industries• Particularly strong in research and higher ed

Industry-wide reputation for:• defining best practices for enterprise class data

protection, and

• for the early adoption of next generation storage solutions

A unique business model that allows us to straddle the fence between academia and industry

Page 4: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

4End to End Life Cycle Management for Research PASIG - 2013

Seminars and Workshops Through The Usenix Association

Tiered Storage and Archiving:  Best Practices for Data Life Cycle Management and Digital Preservation

Cornell, Dartmouth, Duke, Harvard, Penn

LISA Data Storage Day• Storage Virtualization • Application Acceleration with Solid State • A Crash Course in Object Storage

LISA Conference, Broad Institute, Georgia State, University Maryland, Davenport, Princeton

Page 5: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

5

Our Product: Starfish

Page 6: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

6End to End Life Cycle Management for Research PASIG - 2013

Our Project – Defining Best Practices for File Management

Inspiration for our project comes from SRB/IRODS• Bring parts of the SRB/IRODS vision to reality

– Define a general purpose feature set

– Intuitive user interface

– Simplified API

Inspiration also comes from numerous home grown solutions in our client base. The paradigm: • Stat() your file systems • Make database records for each file and/or directory • Relate metadata to the file and directory records • Report and/or take action

Page 7: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

7End to End Life Cycle Management for Research PASIG - 2013

Starfish - *FS

Virtual Global File System • It’s not really a file system, but it

looks like one and serves as a hierarchical catalog of files

Like a file system • CIFS and POSIX permissions • File system attributes and

extended attributes

But more • User specified metadata • Persistent addresses • Versioning • Point in time collections

Page 8: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

8End to End Life Cycle Management for Research PASIG - 2013

Basic Starfish Topology

Page 9: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

9End to End Life Cycle Management for Research PASIG - 2013

Targetted Use Cases

1) Data life cycle management for unstructured data at very large scale

• Scientific research data • Media / entertainment workflows • Engineering data

2) Storage middleware for digital asset management systems at very large scale

• Fixity automation • Backup restore • Tiered storage • Persistent file addresses / links• Cloud interface

Page 10: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

10End to End Life Cycle Management for Research PASIG - 2013

Typical Content Management “Stack”

Page 11: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

11End to End Life Cycle Management for Research PASIG - 2013

Inserting File System Middleware

Page 12: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

12End to End Life Cycle Management for Research PASIG - 2013

Simple Storage Workflow While Mirroring File Systems to Object Store

Page 13: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

13End to End Life Cycle Management for Research PASIG - 2013

Metadata is the Great Enabler

Collaboration • How else would researchers know what to do with one

another’s data? • How can data be organized to meet different groups’ needs?

Storage management policies • How does a storage management system know what to do with

your files? File system attributes are not descriptive enough.

Preservation / retrieval / provenance• How do you know what to keep? • How do you find it again? • How do you know what it was used for and when?

Reporting / chargeback • File system permissions are not descriptive enough.

Page 14: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

14End to End Life Cycle Management for Research PASIG - 2013

What Would a Metadata System for Research Data Look Like?

Very flexible Allows scientists to work the way they want to work Out of the data path • The system cannot introduce latency to file I/O

Enormous scale • Billions of files, Petabytes of capacity, 1000s of file

systems

Device / vendor independence • Must work with all storage devices, object stores,

clouds, etc.

API driven

Page 15: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

15End to End Life Cycle Management for Research PASIG - 2013

The Real Trick – Getting the Metadata

The Golden Rule of Data Preservation – “Preserve at the time of creation”• Translation: Capture metadata throughout the research

pipeline

Perhaps capture metadata when storage is provisioned• The presumes that there is a structured process for

provisioning storage

Capture metadata through an API • This requires a simple API that anyone can use

Programmatically extract metadata from file headers, tags, and content Capture metadata through a GUI • Try to create incentives for users to key in metadata

Page 16: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

16

Getting from Here to There

Page 17: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

17End to End Life Cycle Management for Research PASIG - 2013

Problem Statements for Research Data Management

Scientists don’t want to enter metadata No one wants to pay for long term storage Data management planning disconnect between grant applicants and their institutions There are more pressing problems related to storing data • Collaboration • Cost control: Chargeback, Showback, Tiering • Backup

Organizational gridlock • Conflicting priorities • Unspecific mandates

Page 18: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

18End to End Life Cycle Management for Research PASIG - 2013

Yes, We Too Have a Triangle!

Page 19: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

19End to End Life Cycle Management for Research PASIG - 2013

Where it Starts: Scalable and Flexible Backup/Archive

Backup Clients Disk-BasedObject Storage

Tape Archive

NAS

NAS orFile Server

CloudService

Page 20: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

20

How To Play

Page 21: © Copyright 2009-2013, Cambridge Computer Services, Inc. – All Rights Reserved  – 781-250-3000 End to End Life Cycle Management

21End to End Life Cycle Management for Research PASIG - 2013

Looking for Collaborators

The ideal collaborator:• Has an immediate need that is within our current

feature set and scale – This tells us that you can/will invest time with us

• Has additional needs that will put us to test • Is an existing client of Cambridge Computer, or

– Is willing to become one, or

– Is able to contribute some funds

– Is able to make a meaningful investment in time

If not now, maybe next year! • Email me: [email protected]