ieee iri 2016 lucene geo gazetteer

18
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Madhav Sharan [email protected] Dr. Chris Mattmann [email protected] 1 An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data Information Retrieval and Data Science

Upload: goyalmadhav

Post on 17-Jan-2017

27 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: IEEE IRI 2016 lucene geo gazetteer

July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA

Madhav [email protected]

Dr. Chris [email protected]

1

An Automatic Approach for Discovering and Geocoding Locations in Domain-

Specific Web Data

Information Retrieval and Data Science

Page 2: IEEE IRI 2016 lucene geo gazetteer

2

OUTLINE• Introduction• Overview of ecosytem• Data Flow

• GeoTopic Identification• Identifying location names• Lucene Geo Gazetteer

• Geonames Features• Evolution of Ranking Algorithm

• Evaluation and results• Challenges• Conclusion and Future Work

Information Retrieval and Data Science

Page 3: IEEE IRI 2016 lucene geo gazetteer

3

Introduction

Information Retrieval and Data Science

• AIM – Discover geospatial locations from WWW data culled from diverse domains

• Locations can be present in text or metadata of a file

• We present the overall approach and then explains in detail the Lucene Geo Gazetteer derived from the Geonames.org dataset

Page 4: IEEE IRI 2016 lucene geo gazetteer

4

Overview of ecosytem

Information Retrieval and Data Science

• WWW is home to a rich diversity of and large volumes of data across many domains and disciplines

• In MEMEX we have collected web page and textual data from many domains including Materials Research, and Autonomous Robots and other domains

• There is no single field or textual pattern for discerning location information in all domains

Page 5: IEEE IRI 2016 lucene geo gazetteer

5

Example dataset

Information Retrieval and Data Science

Source Science Topic Location-related field Notes

AMD

Agriculture, Atmosphere, Biologi- cal Classification, Biosphere, Cli- mate Indicators, Cryosphere, Hu- man Dimensions, Land Surface, Oceans, Paleoclimate, Solid Earth, Terrestrial Hydrosphere, Spectral Engineering, Sun-Earth Interac- tions

1.Location Keywords field in Metadata 2.Sometimes mentioned in the title and Summary fields 3.Spatial coordinates provided

1.Continent level and Geo- graphic Region 2.Island, sometimes specific 3.Multi-values 4.Not missing in Location Key- words

ACADIS

Agriculture, Atmosphere, Biologi- cal Classification, Biosphere, Cli- mate Indicators, Cryosphere, Hu- man Dimensions, Land Surface, Oceans. Paleoclimate, Solid Earth, Terrestrial Hydrosphere

1.Location(s) field in Metadata 2.Location mentioned in De- scription field, often in the first few sentences or last few sen- tences. Sometimes missing. 3.Longitude and Latitude pro- vided in the Metadata. 4.Sometimes mentioned in the title

1. Province/State level or Geo- graphic Region 2. Multi-values3. Not missing in Location

Locations mentioned in the NSF TREC-DD-Polar dataset

Page 6: IEEE IRI 2016 lucene geo gazetteer

6

Data Flow

Information Retrieval and Data Science

Crawled website data

Extract text and metadata

Discover location entities

Find location in gazetteer

Lucene Geo Gazetteer

Page 7: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Lucene Geo Gazetteer

7

Uses the indexed Geonames.org dataset to match location names with latitude and longitude.

Indexes Geonames.org dataset to a lucene based index

Ranks matched results based on certain features.

Page 8: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Geonames Features• Name - Legal name of a location. • Alternate names - All other names by which a loca- tion is known.

Alternate names provide a CSV of all pronunciations and synonyms for a location.

• Feature class - A high level bucket for similar locations. This bucket distinguishes continent and countries from cities and villages.

• Feature code - A more granular bucket for locations to identify it as a country, state, capital, city.

• Population, Latitude, Longitude, Country code, Admin1 code, Admin2code

8

Page 9: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evolution of Ranking AlgorithmThe naive version –1. Search input phrase in name, alternate name stored index and

get top N results. 2. Calculate edit distance between input phrase and name,

alternate name. 3. Rank results in order of lowest edit distance between input

phrase and name, alternate name fields.

9

Page 10: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evolution of Ranking AlgorithmThe naive version results and observations–• Poor results. 1/50 locations gave correct result

• There are many location with similar names. String “Pasadena” have 250 and “Portland” have 897 matches in dataset.

• Popular places have greater number of alternate names

10

Page 11: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evolution of Ranking AlgorithmFirst improvement –1. Add multiple sorting criteria in lucene query in order “feature

class”, “feature code” and “population”2. Fetch top N results from this criteria3. Add edit distance between input phrase and name and all

alternate names

11

Page 12: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evolution of Ranking AlgorithmObservations –• More weightage to name as it is legal name for a location e.g.

China is stored with “People’s Republic of China” and alternate name have ”... Cayina,Chaina,China,Chine, ..”

• We can not just rely on edit distance.• Edit distance between “Los Cabos” and “Los Angeles” == 6• Edit distance between “Los Angeles County” and “Los Angeles” == 7

• Results sorted only by population are very relevant and query time is improved

12

Page 13: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evolution of Ranking AlgorithmOne more try –Pass 1

1. Get top N * 3 results from lucene index sorted by population for input phrase. 2. Sort results by feature code. 3. Pass top N results.

Pass 2 4. Assign a score if input phrase is exactly a token in name. e.g. “Los Angeles” in

“Los Angeles County” 5. Assign a lesser score if input phrase is partly a token in name e.g. India in

Indiana 6. Calculate edit distance of input phrase with all alternate names.7. Assign weight as a function of count of alternative names.

13

Page 14: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Evaluation and ResultsWe tested our improved algorithm on a diverse set of 50,000+ locations with countries, states, districts, cities, towns and villages from DARPA MEMEX and from the NSF Polar domains.

Our ground truth data comparisons resulted in a total overall accuracy of 94%

14

Page 15: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Challenges• Limitations with opensource Geonames.org data set. More

features will help –• Popular name – “Argentina” for “Argentine Republic“• Area of a location

• No one best result.• Pasadena in both California and Texas

• Huge and diverse data set which have data ranging from parks to continents. Improved performance if data set is reduced to eg. continents, countries, state, counties, capitals and cities

15

Page 16: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

Conclusion and Future Work• Early results from this effort are promising but a number of

avenues of future work remain unexplored. • Automatically augment Geonames.org with crowd sourced

popular place names • Geospatial coordinate data beyond points, e.g., bounding

boxes, and other shapes, to allow for more meaningful and spatially directed queries.

16

Page 17: IEEE IRI 2016 lucene geo gazetteer

Information Retrieval and Data Science

ACKNOWLEDGMENTS• This work was supported by the DARPA XDATA/Memex

program.

• NSF Polar Cyber infrastructure award numbers PLR-1348450 and PLR-144562 funded a portion of the work.

• Effort supported in part by JPL, managed by the California Institute of Technology on behalf of NASA.

17

Page 18: IEEE IRI 2016 lucene geo gazetteer

• Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geo-gazetteer• Geonames dataset: https://www.geonames.org • Apache Tika: https://tika.apache.org/ • Apache OpenNLP: https://opennlp.apache.org/ • Apache Lucene: http://lucene.apache.org• Memex: http://memex.jpl.nasa.gov/

More about us on github:

18Information Retrieval and Data Science

THANK YOU

Madhav Sharansmadha

Dr. Chris Mattmannchrismattmann