data discovery and reuse

21
Data discovery and reuse TTF-IP Workshop, 18.3. Monday, March 21, 2011

Upload: jonathan-gray

Post on 06-May-2015

1.802 views

Category:

Documents


3 download

DESCRIPTION

Slides for workshop led by Friedrich Lindenberg and Jonathan Gray at "Use of Information and Data for EnhancedCommunication and Advocacy" workshop in Budapest, 17th March 2011.

TRANSCRIPT

Page 1: Data discovery and reuse

Data discovery and reuse

TTF-IP Workshop, 18.3.

Monday, March 21, 2011

Page 2: Data discovery and reuse

Data processes 1

•Need: machine-readable, openly licensed.

•Re-publish derived data

Monday, March 21, 2011

Page 3: Data discovery and reuse

Data processes 2

•Goal: reproducible results, ecosystems:

•Tools to regularly extract data, share, transform and load it.

•Catalogues, documentation.

•“Data-the-process”, not “data-the-file”

•no “Excel Afternoons”

Monday, March 21, 2011

Page 4: Data discovery and reuse

Data Acquisition

VoluntaryRelease

InvolunatryRelease

Active acquisition FoI Scraping

Passiveacquisition PSI/Open Data Leaks

Monday, March 21, 2011

Page 5: Data discovery and reuse

Basic tools

•Language “convention”: Python

•Screen scraping: ScraperWiki

•Semi-structured storage: MongoDB

•Keeping an overview: CKAN

Monday, March 21, 2011

Page 6: Data discovery and reuse

Monday, March 21, 2011

Page 7: Data discovery and reuse

Textual data

•De-PDF (Acrobat Pro, pdf2text)

•Index & Search (Apache Solr)

•Basic NLP: Word counts/freqs, NEE etc.: nltk

•Publish: Co-ment, AnnotateIt, Scribd

•Soon: DocumentClouds for all!

Monday, March 21, 2011

Page 8: Data discovery and reuse

Monday, March 21, 2011

Page 9: Data discovery and reuse

Monday, March 21, 2011

Page 10: Data discovery and reuse

Monday, March 21, 2011

Page 11: Data discovery and reuse

Monday, March 21, 2011

Page 12: Data discovery and reuse

Monday, March 21, 2011

Page 13: Data discovery and reuse

Numeric data I•De-PDF: ABBYY FineReader

•Munge & Massage: Google Refine

•Share & Extend: Google Spreadsheets

•R/Stata/SPSS: more suited to internal processes.

•BI/Analytics/Aggregation: custom?

Monday, March 21, 2011

Page 14: Data discovery and reuse

Monday, March 21, 2011

Page 15: Data discovery and reuse

Numeric data II

•Visualization, first go: Google Vis Toolkit, IBM ManyEyes

•Visualization, interactive: Protovis, Raphael

•Flash considered harmful :-(

Monday, March 21, 2011

Page 16: Data discovery and reuse

Monday, March 21, 2011

Page 17: Data discovery and reuse

Monday, March 21, 2011

Page 18: Data discovery and reuse

Network data

•Can be derived from other types

•Think about structure: nodes, edges, weights, directions

•Analysis: find central actors, mediators, ...: MCI, NetworkX

•Visualization: Gephi, GraphViz

Monday, March 21, 2011

Page 19: Data discovery and reuse

Monday, March 21, 2011

Page 20: Data discovery and reuse

FTS “Afghanistan”, Ronny PatzMonday, March 21, 2011

Page 21: Data discovery and reuse

Geo-data

•There is more than Google Maps markers :-)

•Talk to your neighborhood OSM crowd.

Monday, March 21, 2011