data discovery and reuse

Post on 06-May-2015

1.802 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Slides for workshop led by Friedrich Lindenberg and Jonathan Gray at "Use of Information and Data for EnhancedCommunication and Advocacy" workshop in Budapest, 17th March 2011.

TRANSCRIPT

Data discovery and reuse

TTF-IP Workshop, 18.3.

Monday, March 21, 2011

Data processes 1

•Need: machine-readable, openly licensed.

•Re-publish derived data

Monday, March 21, 2011

Data processes 2

•Goal: reproducible results, ecosystems:

•Tools to regularly extract data, share, transform and load it.

•Catalogues, documentation.

•“Data-the-process”, not “data-the-file”

•no “Excel Afternoons”

Monday, March 21, 2011

Data Acquisition

VoluntaryRelease

InvolunatryRelease

Active acquisition FoI Scraping

Passiveacquisition PSI/Open Data Leaks

Monday, March 21, 2011

Basic tools

•Language “convention”: Python

•Screen scraping: ScraperWiki

•Semi-structured storage: MongoDB

•Keeping an overview: CKAN

Monday, March 21, 2011

Monday, March 21, 2011

Textual data

•De-PDF (Acrobat Pro, pdf2text)

•Index & Search (Apache Solr)

•Basic NLP: Word counts/freqs, NEE etc.: nltk

•Publish: Co-ment, AnnotateIt, Scribd

•Soon: DocumentClouds for all!

Monday, March 21, 2011

Monday, March 21, 2011

Monday, March 21, 2011

Monday, March 21, 2011

Monday, March 21, 2011

Monday, March 21, 2011

Numeric data I•De-PDF: ABBYY FineReader

•Munge & Massage: Google Refine

•Share & Extend: Google Spreadsheets

•R/Stata/SPSS: more suited to internal processes.

•BI/Analytics/Aggregation: custom?

Monday, March 21, 2011

Monday, March 21, 2011

Numeric data II

•Visualization, first go: Google Vis Toolkit, IBM ManyEyes

•Visualization, interactive: Protovis, Raphael

•Flash considered harmful :-(

Monday, March 21, 2011

Monday, March 21, 2011

Monday, March 21, 2011

Network data

•Can be derived from other types

•Think about structure: nodes, edges, weights, directions

•Analysis: find central actors, mediators, ...: MCI, NetworkX

•Visualization: Gephi, GraphViz

Monday, March 21, 2011

Monday, March 21, 2011

FTS “Afghanistan”, Ronny PatzMonday, March 21, 2011

Geo-data

•There is more than Google Maps markers :-)

•Talk to your neighborhood OSM crowd.

Monday, March 21, 2011

top related