Download - Data discovery and reuse
Data discovery and reuse
TTF-IP Workshop, 18.3.
Monday, March 21, 2011
Data processes 1
•Need: machine-readable, openly licensed.
•Re-publish derived data
Monday, March 21, 2011
Data processes 2
•Goal: reproducible results, ecosystems:
•Tools to regularly extract data, share, transform and load it.
•Catalogues, documentation.
•“Data-the-process”, not “data-the-file”
•no “Excel Afternoons”
Monday, March 21, 2011
Data Acquisition
VoluntaryRelease
InvolunatryRelease
Active acquisition FoI Scraping
Passiveacquisition PSI/Open Data Leaks
Monday, March 21, 2011
Basic tools
•Language “convention”: Python
•Screen scraping: ScraperWiki
•Semi-structured storage: MongoDB
•Keeping an overview: CKAN
Monday, March 21, 2011
Monday, March 21, 2011
Textual data
•De-PDF (Acrobat Pro, pdf2text)
•Index & Search (Apache Solr)
•Basic NLP: Word counts/freqs, NEE etc.: nltk
•Publish: Co-ment, AnnotateIt, Scribd
•Soon: DocumentClouds for all!
Monday, March 21, 2011
Monday, March 21, 2011
Monday, March 21, 2011
Monday, March 21, 2011
Monday, March 21, 2011
Monday, March 21, 2011
Numeric data I•De-PDF: ABBYY FineReader
•Munge & Massage: Google Refine
•Share & Extend: Google Spreadsheets
•R/Stata/SPSS: more suited to internal processes.
•BI/Analytics/Aggregation: custom?
Monday, March 21, 2011
Monday, March 21, 2011
Numeric data II
•Visualization, first go: Google Vis Toolkit, IBM ManyEyes
•Visualization, interactive: Protovis, Raphael
•Flash considered harmful :-(
Monday, March 21, 2011
Monday, March 21, 2011
Monday, March 21, 2011
Network data
•Can be derived from other types
•Think about structure: nodes, edges, weights, directions
•Analysis: find central actors, mediators, ...: MCI, NetworkX
•Visualization: Gephi, GraphViz
Monday, March 21, 2011
Monday, March 21, 2011
FTS “Afghanistan”, Ronny PatzMonday, March 21, 2011
Geo-data
•There is more than Google Maps markers :-)
•Talk to your neighborhood OSM crowd.
Monday, March 21, 2011