tese phd

136
Organization is Sharing: From eScience to Personal Information Management Rodrigo Dias Arruda Senra Advisor: Prof a Dr a. Claudia Bauzer Medeiros Defesa de Tese de Doutorado em Ciência da Computação Universidade Estadual de Campinas Instituto de Computação Campinas 2012-12-10

Upload: rodrigo-senra

Post on 18-Dec-2014

476 views

Category:

Education


5 download

DESCRIPTION

My PhD thesis presentation

TRANSCRIPT

Page 1: Tese phd

Organization is Sharing:From eScience to

Personal Information Management

Rodrigo Dias Arruda Senra

Advisor: Profa Dra. Claudia Bauzer Medeiros

Defesa de Tese de Doutorado em Ciência da Computação Universidade Estadual de Campinas

Instituto de Computação

Campinas 2012-12-10

Page 2: Tese phd

Outline

• Motivation

• Objectives

• Contributions

• Results

2

• SciFrame

• Database Descriptors

• Organographs{

Page 3: Tese phd

Motivation

Page 4: Tese phd

4

Study the relation Heterogeneity ↔ Organization ↔ Sharing

Page 5: Tese phd

5

NDVI Profile Generation

PostGIS

Filesystem

Postgres

WebMAPS

Page 6: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTTPFTP

WebMAPS

Page 7: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTTPFTP

WebMAPS

Page 8: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTTPFTP

WebMAPS

Page 9: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTTPFTP

WebMAPS

Page 10: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTTPFTP

WebMAPS

Page 11: Tese phd

5

NDVI Profile Generation

Geometries (IBGE)

Spectral Images(NASA)

Crops(Min.Agr)

PostGIS

Filesystem

Postgres

HTML, Microformats, 2D Plots

HTTPFTP

HTTP

WebMAPS

Page 12: Tese phd
Page 13: Tese phd

Objectives

Page 14: Tese phd

8

Page 15: Tese phd

• describe and compare eScience systems

• match Applications needs with DBMS capabilities

• manage digital content hierarchies

8

Page 16: Tese phd

Motivation

Objectives

• Contributions

• Results

9

• SciFrame

• Database Descriptors

• Organographs{

Page 17: Tese phd

SciFrame

Page 18: Tese phd

11

SciFrame

The Scientific Digital Data Processing Framework is a conceptual framework that describes systems or

processes involving digital data manipulation.

Page 19: Tese phd

Interfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Page 20: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Page 21: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Page 22: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Page 23: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Data Management

Manipulation

Create Retrieve Update Delete Index

Storage

Page 24: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Data Management

Manipulation

Create Retrieve Update Delete Index

Storage

Information Management

Page 25: Tese phd

SciFrameInterfacing

Acquisition

Publication

(discovery - extraction - transference )

Information Management Data Management

Information Management

Page 26: Tese phd

SciFrameInterfacing

Acquisition

Discovery

Extraction

Transference

Publication

Data Management

Storage

Manipulation

Information Management

Description

TransformationFusing

Filtering

Page 27: Tese phd

WebMapsInterfacing

Acquisition

Discovery Geometries (IBGE), Raster(NASA), Crops(Min.Agr)

Extraction ad hoc extractor scripts (paparazzi)

Transference FTP and HTTP

Publication HTML, Microformats, 2D Plots

Data Management

Storage Geometries(PostGIS), Raster(Files), Crops(Postgres)

Manipulation Geometries(CRDI), Raster(CRD), Crops(CRUDI)

Information Management

Description Geometries(SHP,WKT), Raster(HDF,GeoTIFF)

TransformationFusing NDVI Time Series

Filtering Cloud and noise removal (HANTS)

Page 28: Tese phd

Research ProblemsInterfacing

Acquisition

Discovery data scattered, many providers, search engines ?

Extraction feasibility, preserve provenance, lack of semantics

Transference availability, voluminous data, bandwidth, protocol

Publication lack of intention, access control, traceability

Data Management

Storage scalability, distribution, consistency, preservation

Manipulation multimedia, impedance mismatch

Information Management

Description implicit x explicit, semantic web, social, trust, privacy

Transformationinformation lost: conceptual > logical > physical

multi-modalityhandle uncertain and incomplete data

Page 29: Tese phd

TechnologiesInterfacing

Acquisition

Discovery DAS Registry, BIOCatalogue, SciScope

Extraction Scrappers, Wrappers, PiggyBank, Operator

Transference Streaming, P2P, OpenDAP

Publication SOA x ROA, Microformats x RDFa

Data Management

Storage Scientific Datasets, XML, Cloud Computing

Manipulation SQL extensions, ORMs, LINQ

Information Management

Description In Loco Semantics

TransformationArray Algebra (RASDAMAN)Topological Operators (GIS)

Proximity Search and Report Language (ISIS)

Page 30: Tese phd

Interfacing

Acquisition

Publication

(discovery - extraction - transference )

Information ManagementData Management

Page 31: Tese phd

Data Management

Page 32: Tese phd

Data Management

Page 33: Tese phd

Data Management

✓enforce loose coupling between Apps and DBMS

✓DBMS product/vendor independence

✓seamless cross-database migration

✓capability verification, validation and negotiation

✓support Apps and DBMS in the cloud!

Page 34: Tese phd

Database Descriptors

Page 35: Tese phd

DBMS

Descriptors

Feature descriptor

Desiderata descriptorspecifies what a client application needs

12

App

Page 36: Tese phd

DBMS

Descriptors

Feature descriptor

Desiderata descriptorspecifies what a client application needs

specifies what a DBMS provides12

App

Page 37: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

Page 38: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

descriptor X

descriptor Y

Page 39: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

DescriptorRegistryDescriptor

RegistryDescriptorRegistry

descriptor X

descriptor Y

Page 40: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

DescriptorRegistryDescriptor

RegistryDescriptorRegistry

App

descriptor X

descriptor Y

Page 41: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

Negotiator

DescriptorRegistryDescriptor

RegistryDescriptorRegistry

App

descriptor X

descriptor Y

Page 42: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

Negotiator

DescriptorRegistryDescriptor

RegistryDescriptorRegistry

App

descriptor X

descriptor Y

Page 43: Tese phd

Architecture

15

WebDMS X

DMS YDMS Z

DescriptorRegistry

Negotiator

DescriptorRegistryDescriptor

RegistryDescriptorRegistry

App

descriptor X

descriptor Y

binding

Page 44: Tese phd

DBD Structure

13 * http://dublincore.org/documents/dces/

App DBMS

Page 45: Tese phd

@prefix : <http://www.lis.ic.unicamp.br/purl/DBD/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix dc: <http://purl.org/dc/elements/1.1/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .

:Cmbm a foaf:Person ; foaf:name “Claudia Bauzer Medeiros” .

:DBD1 dc:identifier “DBD1” ; dc:type “Feature DBD” ; dc:format “text/turtle” ; dc:title “Sample Feature Descriptor” ; dc:description “Hypothetical Feature DBD in RDF/Turtle” ; dc:creator :Cmbm ; dc:date “2009-12-18” ; dc:language “EN” ; :isolation :READ_COMMITED ; :versioning “unsupported” ; :storage “RDF Triples” ; :DML [ a rdf:Bag ; rdf:_1 RDQL ; rdf:_2 SPARQL ; ] .

Feature Descriptor

Page 46: Tese phd

@prefix : <http://www.lis.ic.unicamp.br/purl/DBD/> .@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix dc: <http://purl.org/dc/elements/1.1/> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .

:Rodsenra a foaf:Person ; foaf:name “Rodrigo Dias Arruda Senra” .

:DBD2 dc:identifier “DBD2” ; dc:type “Desiderata DBD” ; dc:format “text/turtle” ; dc:title “Sample Desiderata Descriptor” ; dc:description “Desiderata DBD for hypothetical App” ; dc:creator :Rodsenra; dc:date “2010-01-05” ; dc:language “EN” ; :isolation :READ_COMMITED ; :concurrency “Two phase lock” ; :storage “RDF Triples” ; :DML SPARQL .

Desiderata Descriptor

Page 47: Tese phd

Understanding Hierarchies...

SciFrame DBDs

Page 48: Tese phd

Organographs

Page 49: Tese phd

27

Page 50: Tese phd

28

Which of the following sets better accommodate the object above ?

Page 51: Tese phd

29

Red ? Triangles ? Metric Related ?

Page 52: Tese phd

Problems

30

1. Single Category versus Multi-faceted Content

2. Manually-defined categories

3.Criteria is not explicit

4.Static Membership Relation

5. Organization is not reusable

Page 53: Tese phd

31

Page 54: Tese phd

31

Organograph

... artifact to make explicit how to organize information in the context of a particular task.

Page 55: Tese phd

Organograph

32

Hout = forg(Hin)

vcnt

eagg

ecnt

H(V,E)

vagg

vagg

Page 56: Tese phd

Organograph

32

Hout = forg(Hin)

forg:• navigation (crawler/iterador)

• feature extraction

• FHil(vagg,vagg): hierarchical structuring

• FCat(vagg,vcnt): categorization

URL

HoutHin

URL

vcnt

eagg

ecnt

H(V,E)

vagg

vagg

Page 57: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

Page 58: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• patterns• dictionaries• rules• probabilities• templates/wrappers

Page 59: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• matching• dice• jaccard• overlap• cosine

Page 60: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• FOAF• Dbpedia• Schema.org• Freebase• MusicBrainz• Geonames

Page 61: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• Naive Bayes• SVM• Nearest Neighbors• LDA• LSI

Page 62: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• Filesystem• Gmail• Evernote• Delicious• Dropbox

DBDs!

Page 63: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

33

Iterators

Data Container UX

Organograph Composition

Task !

• Fuse, Dokan• Infoviz• D3

Page 64: Tese phd

Metodology

34

collection

Page 65: Tese phd

Metodology

34

collection

organize

Page 66: Tese phd

Metodology

34

collection

organize

evaluate

Page 67: Tese phd

Metodology

34

collection

organize

evaluate

reorganize

Page 68: Tese phd

Metodology

34

collection

organize

evaluate

reorganize

share

Page 69: Tese phd

Evaluating Hierarchies

35

Page 70: Tese phd

Evaluating Hierarchies

35

too much content

Page 71: Tese phd

Evaluating Hierarchies

35

too much content

duplicated or misplaced

Page 72: Tese phd

Evaluating Hierarchies

35

too much content

too manyaggregators

duplicated or misplaced

Page 73: Tese phd

Evaluating Hierarchies

35

too much content

too manyaggregators

duplicated or misplaced

too deep

Page 74: Tese phd

Reorganizing Hierarchies

36

Alice

Bob

2011

2008

2011

Author

Publication Date

paper 1

paper 2

paper 3

Page 75: Tese phd

Reorganizing Hierarchies

36

Alice

Bob

2011

2008

2011

Author

Publication Date Author

Publication Date

paper 1

paper 2

paper 3

Page 76: Tese phd

Reorganizing Hierarchies

36

Alice

Bob

2011

2008

2011 Alice

Bob

2008

2011

Alice

Author

Publication Date Author

Publication Date

Task is important!

paper 1

paper 2

paper 3

Page 77: Tese phd

Reuse Organization

37

Page 78: Tese phd

Reuse Organization

37

Page 79: Tese phd

Reuse Organization

37

Hacm Vcntmine

Page 80: Tese phd

Hin

Hout

Internal Indexes

Pre-processing

Feature Extraction

Transformation Workflow

Organograph Execution

FCat() FHil()

Visualization

Page 81: Tese phd

Hin

Hout

Internal Indexes

Pre-processing

Feature Extraction

Transformation Workflow

Organograph Execution

FCat() FHil()

Visualization

Page 82: Tese phd

Hin

Hout

Internal Indexes

Pre-processing

Feature Extraction

Transformation Workflow

Organograph Execution

FCat() FHil()

Visualization

Page 83: Tese phd

Hin

Hout

Internal Indexes

Pre-processing

Feature Extraction

Transformation Workflow

Organograph Execution

FCat() FHil()

Visualization

Page 84: Tese phd
Page 85: Tese phd

@organographdef forg_ccs98(self, input): self.id = new_uuid() #‘ff7d8e21-4226-11e2-b2f1-109add6b426c’ self.description = ‘docs by ACM CCS98’ ccs98 = acm_extract(‘http://www.acm.org/about/class/1998/ccs98.xml’) trainset = [] for category,words in nlp_clean_titles(ccs98.Vcnt.paths): for w in words: trainset.append((make_feature(w), category))

classifier = NaiveBayes(trainset) self.Ecnt = classifier.classify(input) # FCat self.Eagg = ccs98.Eagg.Level[:1] # FHil

Page 86: Tese phd

@organographdef forg_ccs98(self, input): self.id = new_uuid() #‘ff7d8e21-4226-11e2-b2f1-109add6b426c’ self.description = ‘docs by ACM CCS98’ ccs98 = acm_extract(‘http://www.acm.org/about/class/1998/ccs98.xml’) trainset = [] for category,words in nlp_clean_titles(ccs98.Vcnt.paths): for w in words: trainset.append((make_feature(w), category))

classifier = NaiveBayes(trainset) self.Ecnt = classifier.classify(input) # FCat self.Eagg = ccs98.Eagg.Level[:1] # FHil

input = collection(‘file:///some/local/dir/docs’)output = forg_ccs98(input)publish(output, ‘rodsenra@dropbox:/output’)organicer.render(output, organicer.views.HYPERBOLIC_TREE)

Page 87: Tese phd

forg_ccs_98Interfacing

Acquisition

Discovery ACM CCS98, Hin

Extraction pdf2txt,pdfbox, pypdf; NLTK (tokenizer)

Transference HTTP, WebDAV, NFS, SMB

Publication Hout :HTML+CSS, JS(Infoviz,D3); Dropbox

Data Management

Storage NoSQL DB (Mongo, Neo4J)

Manipulation Indexes (CRDI)

Information Management

Description SKOS, GraphML, JSON

TransformationMining NaiveBayes

Filtering Vcnt(unconverted pdfs); Vagg (empty or ambiguous)

Page 88: Tese phd

Related Work

Page 89: Tese phd

Related Work (SciFrame)

• CLRC scientific metadata modelB. Matthews and S. SufiThe CLRC Scientific Metadata Model, version 1, DL TR 02001, CLRC2001

• myGrid Information ModelSharman, Nick, et al. "The myGrid information model." UK e-Science programme All Hands Conference. 2004.

Page 90: Tese phd

Related Work (DBDs)

Madnick and Wang.Evolution Towards Strategic Applications Of Databases Through Composite Information Systems.Journal of Management Information Systems 5(2):5-22 1988

“In order to: separate data from the application processing, it is necessary to employ a process descriptor and a database descriptor.

The process descriptor describes the name, the input/output data requirement, and other resource requirements of the processing components.

The database descriptor contains information about the data (e.g., data model, schema, access rights) in the database, similar to data dictionaries.

These two descriptors can be used by the execution environment to coordinate the interaction between the processing component and the database.”

Page 91: Tese phd

Related Work (Organographs)

• Topic Modeling LSA, LDA, Hierarchical Bayesian

Blei 201; Blei, Ng, & Jordan, 2003; Griffiths & Steyvers, 2002; 2003; 2004; Hofmann, 1999; 2001

• Personal Information Management CALO, UMEA, X-COSIM, Haystack, UpLib, Iris

Zimmermann 2005; Arndt 2007; Lansdale 1988; Kaptelinin 2003; Janssen & Popat 2003; Karger et al 2003

• Semantic DesktopNepomuk, SEMSOCGiannakidou et al 2008; Groza et al 2007

• Personal Digital LibrariesZotero, Mendeley, Papers

Page 92: Tese phd

Results

Page 93: Tese phd

Contributions

• SciFrame

• Database Descriptors (DBDs)

• Organographs

• Software tools & algorithms: WebMAPS, Paparazzi & Organicer

46

Page 94: Tese phd

Publications

submitted to JODS

Evaluating, Reorganizing and Sharing Digital Information Hierarchies.Rodrigo D. A. Senra, Claudia B. Medeiros. Journal on Data Semantics (submetido em 2012-10-25)

2011Organographs - Multi-faceted Hierarchical Categorization of Web Documents. Rodrigo D. A. Senra, Claudia B. Medeiros. Proceeding of the 7th International Conference on Web Information Systems and Technologies - WEBIST: 583-588

2010Database Descriptors: Laying the Path to Commodity Web Data Services.Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of Engineering of Computer-Based Systems (ECBS): 386-392

2009SciFrame: a conceptual framework to describe data sharing in eScience.Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of the III Brazilian eScience workshop (XXIV SBBD)

2009A standards-based framework to foster geospatial data and process interoperability. Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros. Journal of the Brazilian Computer Society 15(1): 13-25

2008Bridging the gap between geospatial resource providers and model developers.Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of the 16th International Conference on Advances in Geographic Information Systems - ACM SIGSPATIAL

2007O projeto WebMAPS: desafios e resultados. Carla G. N. Macário, Claudia B. Medeiros, Rodrigo D. A. Senra. Proceedings of 9th Brazilian Symposium on Geoinformatics - GeoInfo: 239-250

47

Page 95: Tese phd

Publications

submitted to JODS

Evaluating, Reorganizing and Sharing Digital Information Hierarchies.Rodrigo D. A. Senra, Claudia B. Medeiros. Journal on Data Semantics (submetido em 2012-10-25)

2011Organographs - Multi-faceted Hierarchical Categorization of Web Documents. Rodrigo D. A. Senra, Claudia B. Medeiros. Proceeding of the 7th International Conference on Web Information Systems and Technologies - WEBIST: 583-588

2010Database Descriptors: Laying the Path to Commodity Web Data Services.Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of Engineering of Computer-Based Systems (ECBS): 386-392

2009SciFrame: a conceptual framework to describe data sharing in eScience.Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of the III Brazilian eScience workshop (XXIV SBBD)

2009A standards-based framework to foster geospatial data and process interoperability. Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros. Journal of the Brazilian Computer Society 15(1): 13-25

2008Bridging the gap between geospatial resource providers and model developers.Gilberto Z. Pastorello Jr., Rodrigo D. A. Senra, Claudia B. Medeiros. Proceedings of the 16th International Conference on Advances in Geographic Information Systems - ACM SIGSPATIAL

2007O projeto WebMAPS: desafios e resultados. Carla G. N. Macário, Claudia B. Medeiros, Rodrigo D. A. Senra. Proceedings of 9th Brazilian Symposium on Geoinformatics - GeoInfo: 239-250

47

SciFrame

WebMaps

DBDs

Organographs

Page 96: Tese phd

Extensions

Theoretical Practical

SciFrame • formalize design pattern• enhance the operations vocabulary

• online catalog of eScience systems• describe as ontology (RDF)

DatabaseDescriptors

• analyse negotiation frameworks• expand DBDs expressivity• explore ranking algorithms

• catalog of concrete DBDs• adapt Organicer to use DBDs• experiment with dynamic negotiation

Organographs • model with Category Theory• explore DSLs to describe forg

• support non-textual media (eg.:img)• expand component palette

48

Page 97: Tese phd

Agradecimentos

• Laboratório de Sistemas de Informação (IC-Unicamp)

http://www.lis.ic.unicamp.br• Brazilian Institute for Web Science Research

http://webscience.org.br• Fapesp - CNPQ - CAPES

49

Page 99: Tese phd

Rodrigo Dias Arruda Senrahttp://rodrigo.senra.nom.br

[email protected]

Thank you.Agradeço sua atenção.

Page 100: Tese phd

Support Material

Page 101: Tese phd

Hierarquia de Origem

Page 102: Tese phd

Hierarquia de Origem

Pre-processamento

BeautifulSouppyPdf

Page 103: Tese phd

Hierarquia de Origem

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Page 104: Tese phd

Hierarquia de Origem

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

Page 105: Tese phd

Hierarquia de Origem

Workflow de Transformação

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

Page 106: Tese phd

Hierarquia de Origem

Workflow de Transformação

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

networkx gensimnumpy scikit-learn

Page 107: Tese phd

Hierarquia de Origem

Workflow de Transformação

HierarquiaResultante

Visualização

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

networkx gensimnumpy scikit-learn

Page 108: Tese phd

Hierarquia de Origem

Workflow de Transformação

HierarquiaResultante

Visualização

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

networkx gensimnumpy scikit-learn

matplotlibObsPy

InfoViz.jsD3.js

Page 109: Tese phd

Hierarquia de Origem

Workflow de Transformação

HierarquiaResultante

Visualização

Navegação daHierarquia

Iterador

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

networkx gensimnumpy scikit-learn

matplotlibObsPy

InfoViz.jsD3.js

Page 110: Tese phd

Hierarquia de Origem

Workflow de Transformação

HierarquiaResultante

Visualização

Navegação daHierarquia

Iterador

ExtraçãoNLTK

Pre-processamento

BeautifulSouppyPdf

Índice deFacetas

pymongo

networkx gensimnumpy scikit-learn

matplotlibObsPy

InfoViz.jsD3.js

os.walkpydeliciousevernote

Page 111: Tese phd

Hin Hout

Internal Indexes

Pre-processing

Feature Extraction

Transformation Workflow

FCat() FHil()

Visualization

Page 112: Tese phd

NLP

Author

MLContentDomain

Expert Roles

OntologiesClassifiersInformation

Extraction

Algorithms

Similarityforg

Vizualization Strategies

54

Iterators

Data Container UX

Task !

Page 113: Tese phd

55

forg:• navigation (crawler/iterador)

• feature extraction

• FHil(vagg,vagg): hierarchical structuring

• FCat(vagg,vcnt): categorization

Hin: URL

Hout:URL

Page 114: Tese phd

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dbd="http://www.lis.ic.unicamp.br/purl/DBD"> <rdf:Description rdf:about="http://www.lis.ic.unicamp.br/purl/DBD/DBD1"><!-- metadata --> <dc:creator>Claudia Bauzer Medeiros</dc:creator> <dc:description>Hypothetical DBD for an RDF DBMS</dc:description> <dc:identifier>DBD1</dc:identifier> <dc:format>application/rdf+xml</dc:format> <dc:type><rdf:Description> <dbd:Type>Feature DBD</dbd:Type></rdf:Description> </dc:type><dc:title>Descriptor of an RDF DBMS</dc:title> <dc:date>2009-12-18</dc:date> <dc:language>EN</dc:language> <!-- dimensions and values --> <dbd:concurrency>Two phase lock</dbd:concurrency> <dbd:versioning>unsupported</dbd:versioning> <dbd:storage>RDF triples</dbd:storage><dbd:DML> <rdf:Bag><rdf:li>RDQL</rdf:li><rdf:li>SPARQL</rdf:li> </rdf:Bag></dbd:DML> </rdf:Description></rdf:RDF>

Page 115: Tese phd

<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dbd="http://www.lis.ic.unicamp.br/purl/DBD"> <rdf:Description rdf:about="http://www.lis.ic.unicamp.br/purl/DBD/DBD1"><!-- metadata --> <dc:creator>Rodrigo Dias Arruda Senra</dc:creator> <dc:description>Desiderata DBD for an hypothetical application</dc:description> <dc:identifier>DBD2</dc:identifier> <dc:format>application/rdf+xml</dc:format> <dc:type><rdf:Description> <dbd:Type>Desiderata DBD</dbd:Type></rdf:Description> </dc:type><dc:title>Desiderata descriptor of an hypothetical application</dc:title> <dc:date>2010-01-05</dc:date> <dc:language>EN</dc:language> <!-- dimensions and values --><dbd:concurrency>Two phase lock</dbd:concurrency> <dbd:storage>RDF triple store</dbd:storage> <dbd:DML>RDQL</dbd:DML></rdf:Description> </rdf:RDF>

Page 116: Tese phd

58

NDVI Profiles

Page 117: Tese phd

Data Management

Manipulation

Create Retrieve Update Delete Index

Storage

Page 118: Tese phd

Information Management

Transformations‣Browsing‣Iterating‣Searching‣ Augmenting‣Mining ‣Description‣ Annotation‣ Schematization ‣Summarizing

‣Structuring‣Sorting‣Merging‣ Decreasing‣ Filtering‣ Fusing

Page 119: Tese phd

Example

61

Page 120: Tese phd

Example

62

Input Collection

Task: info extraction

Task: transformation

Task: visualization

Page 121: Tese phd

63

WebMAPS: DataFlow

Correio

FTP

MODIS Reprojection Tool

Imagens

Recorteda região

Geometria(IBGE)‏

Page 122: Tese phd

64

NDVI

Page 123: Tese phd

Related Work

9

• embedded • n-tier client/server (including web services)• mediators

Approaches to App-to-DMS binding

Information Integration [1]

Process• Understanding• Standardization• Specification• Execution

[1] Beauty and the Beast: The Theory and Practice ofInformation IntegrationLaura Haas

Mechanism • Materialization• Federation• Indexing

Page 124: Tese phd

Related Work

9

• embedded • n-tier client/server (including web services)• mediators

Descriptors are orthogonal to all of these!

Approaches to App-to-DMS binding

Information Integration [1]

Process• Understanding• Standardization• Specification• Execution

[1] Beauty and the Beast: The Theory and Practice ofInformation IntegrationLaura Haas

Mechanism • Materialization• Federation• Indexing

Page 125: Tese phd

66

Extração dos Dados Sensoriasdataset = gdal.Open(raster_file, GA_ReadOnly )‏# Obtenção dos coeficientes para funções afins de mapeamento de coordenadasgt = dataset.GetGeoTransform()‏

# Obtenção da banda de dados de interesseband = dataset.GetRasterBand(1)‏

# Identificação do padrão de codificação dos dados.# No caso do arquivo TIF os dados são bytes sem sinal ('Byte')‏data_type = gdal.GetDataTypeName(band.DataType)

# Obtenção das dimensões da imagemwidth, height = band.XSize, band.YSize

# Conversão do MBR do sistema de coordenadas lat/long para linha/coluna# Xgeo = GT(0) + Xpixel*GT(1) + Yline*GT(2)‏# Ygeo = GT(3) + Xpixel*GT(4) + Yline*GT(5)

ul_pixel, lr_pixel = g2p(gt,*ul_geo), g2p(gt,*lr_geo)‏

Page 126: Tese phd

67

WebMAPS

Page 127: Tese phd

Case Study: WebMaps

Page 128: Tese phd

Case Study: WebMaps

Page 129: Tese phd

69

Extração dos Dados

def raster2array(ul_pixel, lr_pixel, dtype='B'): """Using ul_pixel and lr_pixel it generates a numpy array with the extracted interest region from the raster file """ col_size = lr_pixel[1]-ul_pixel[1]+1 row_size = lr_pixel[0]-ul_pixel[0]+1 scanline = band.ReadRaster(ul_pixel[1], ul_pixel[0], col_size, row_size)‏ num_pixels = col_size*row_size roi = numpy.array(struct.unpack(dtype*num_pixels, scanline))‏ roi.shape = (row_size, col_size)‏ return roi

# Read data from raster file into a numpy array# defining a region of interest matrixroi = raster2array(ul_pixel, lr_pixel)‏

Page 130: Tese phd

70

Extração da Geometria

shp = ogr.Open(filepath) ‏

# Layer correspondente ao Estado de São paulolayer = vf.shp.GetLayerByName('35mu500gc')

# Feature correspondente ao município de Campinasfeature = layer.GetFeature(501)

# Extração dos pontos de controle do perímetrogeometry = feature.GetGeometryRef() ‏poly = geometry.GetGeometryRef(0) ‏centroid = geometry.Centroid() ‏centroid_geo = centroid.GetX(), centroid.GetY()‏

# Definição do Retângulo Envoltório Mínimo (MBR)‏lg_left, lg_right, lt_bot, lt_up = poly.GetEnvelope()‏ul_geo, lr_geo = (lg_left, lt_up), (lg_right, lt_bot)‏

Page 131: Tese phd

71

Operações Espaciais

Page 132: Tese phd

Organicer

72

Page 133: Tese phd

Organicer

72

Page 134: Tese phd

Organicer

72

Page 135: Tese phd

Organicer

72

Page 136: Tese phd

Organicer

72