getting started with unstructured data

27
November 17, 2011 Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC Thursday, November 17, 2011

Upload: dataversity

Post on 17-May-2015

2.318 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Getting Started with Unstructured Data

November 17, 2011

Getting Started with Unstructured DataChristine Connors & Kevin LynchTriviumRLG LLC

Thursday, November 17, 2011

Page 2: Getting Started with Unstructured Data

Meta

✤ Presenter: Christine Connors

✤ @cjmconnors

✤ Presenter: Kevin Lynch

✤ @kevinjohnlynch

✤ Principals at www.triviumrlg.com

✤ Partnering with Dataversity

Thursday, November 17, 2011

Page 3: Getting Started with Unstructured Data

Agenda

✤ What is unstructured data?

✤ Where do we find it?

✤ How important is it?

✤ How do we visualize it?

✤ Machine processing for actionable data

✤ Tools

Thursday, November 17, 2011

Page 4: Getting Started with Unstructured Data

What is unstructured data?

✤ Data which is

✤ Not in a database

✤ Does not adhere to a formal data model

✤ Content

Thursday, November 17, 2011

Page 5: Getting Started with Unstructured Data

Isn’t that a misnomer?

✤ Problematic term

✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word

✤ Object metadata = machine or applied properties

✤ Aesthetic markup = stylesheets; rendering information

✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis

Thursday, November 17, 2011

Page 6: Getting Started with Unstructured Data

Types of ‘un’structured data

✤ Text-based documents

✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web)

✤ Audio/video files

Thursday, November 17, 2011

Page 7: Getting Started with Unstructured Data

Where do we find it?

✤ Office productivity suites

✤ Content management systems

✤ Digital asset management systems

✤ Web content management systems

✤ Wikis, blogs, comment & discussion threads

✤ Social networking tools

✤ Twitter, Yammer, instant messengers

Thursday, November 17, 2011

Page 8: Getting Started with Unstructured Data

85%

15%

Structured Unstructured

Is it really that important?

Thursday, November 17, 2011

Page 9: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Progress reports - created in a word processor

Thursday, November 17, 2011

Page 10: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Dashboards - created in presentation software

Thursday, November 17, 2011

Page 11: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Progress reports - color coded text in a spreadsheet

Thursday, November 17, 2011

Page 12: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Brainstorming - in messaging systems

✤ Decision making - in email

Thursday, November 17, 2011

Page 13: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Business intelligence - on the web and more

Thursday, November 17, 2011

Page 14: Getting Started with Unstructured Data

How can we make the data more actionable?

✤ Identify it

✤ Convert to a format you can work with

✤ Add structure, meaning:

✤ information extraction

✤ annotation

✤ content analytics

Thursday, November 17, 2011

Page 15: Getting Started with Unstructured Data

What about enterprise search?

✤ First line of defense

✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis

✤ Does not assist in other visualizations or transformations without further machine processing

Thursday, November 17, 2011

Page 16: Getting Started with Unstructured Data

Information Extraction

✤ Token identification - “tokenization”

✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.)

✤ Phrase identification - noun phrase

✤ Entity extraction - people, places, events, dates, organizations

Thursday, November 17, 2011

Page 17: Getting Started with Unstructured Data

Information Extraction

✤ Cluster analysis - group related information, where relationship may not be known

✤ Classification - mapping to specific categories

✤ Dependency identification / Rule generation

✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”

✤ Summarization - key concepts or key sentences

Thursday, November 17, 2011

Page 18: Getting Started with Unstructured Data

Open Tools

✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation.

✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization.

Thursday, November 17, 2011

Page 19: Getting Started with Unstructured Data

Open Tools

✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project.

✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services).

Thursday, November 17, 2011

Page 20: Getting Started with Unstructured Data

UIMA

Fred is theCenter CEO of

OrganizationPerson

CeoOf

Arg2:OrgArg1:Person

PPVPNPParser

Named Entity

Relationship

Center Micros

Common Analysis Structure (CAS)

Artifact (e.g., Document)

Analysis Results (i.e., Artifact Metadata)

UIMA CASRepresentation now

Alignedwith XMI standard

UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for

upstream processing.

Chart byIBM

Thursday, November 17, 2011

Page 21: Getting Started with Unstructured Data

UIMA

Image byIBM

Thursday, November 17, 2011

Page 22: Getting Started with Unstructured Data

Commercial Tools

✤ Oracle Data Mining (Text Mining)

✤ IBM SPSS

✤ SAS Text Miner

✤ Smartlogic

✤ Lots of acquisitions going on in the “big data” space

✤ HP acquired Autonomy

✤ Oracle acquired Endeca

Thursday, November 17, 2011

Page 23: Getting Started with Unstructured Data

A Note on Tools

✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves.

✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc).

✤ Your mileage will vary. The biggest differentiator is your knowledge of your data.

Thursday, November 17, 2011

Page 24: Getting Started with Unstructured Data

What can unstructured data look like post-processing?

Thursday, November 17, 2011

Page 25: Getting Started with Unstructured Data

Machine Processing

Machine Processing Platform

Natural Language Processing

Statistical Analysis

Rules-based Classifica-

tion

Semantic Analysis

Unstructured Data

IndexAPI

Visualizations

Federated Search

Data StoresThursday, November 17, 2011

Page 26: Getting Started with Unstructured Data

Questions?

Thursday, November 17, 2011

Page 27: Getting Started with Unstructured Data

Thank youChristine ConnorsKevin Lynchwww.triviumrlg.com

Thursday, November 17, 2011