vistology stids 2013 situation awareness from social media

22
VISTOLOGY, INC BRIAN ULICNY JAKUB MOSKAL MIECZYSLAW M. (MITCH) KOKAR (NORTHEASTERN) SEMANTIC TECHNOLOGIES FOR INTELLIGENCE DEFENSE AND SECURITY (STIDS 2013) GEORGE MASON UNIVERSITY NOVEMBER 13, 2013 Situational Awareness from Social Media

Upload: brian-ulicny

Post on 20-Aug-2015

169 views

Category:

Technology


2 download

TRANSCRIPT

VISTOLOGY, INCBRIAN UL ICNYJAKUB MOSKAL

MIECZYSLAW M. (MITCH) KOKAR (NORTHEASTERN)

SEMANTIC TECHNOLOGIES FOR INTELL IGENCE DEFENSE AND SECURITY (ST IDS 2013)

GEORGE MASON UNIVERSITY

NOVEMBER 13 , 2013

Situational Awareness from Social Media

About VIStology, Inc.

Past and present R&D contracts ONR, Army RDECOM , AFOSR, AFRL, DARPA, MDA

Professional Collaborations Northeastern University, W3C, Lockheed Martin,

OMG, Referentia Systems, BBN, Raytheon, Vulcan, Inc.

Products & Services BaseVISor: highly efficient inference engine ConsVISor: consistency checker of ontologies PolVIsor: policy-compliant information exchange HADRian: next-generation smart COP for HA/DR ops Ontology Engineering and Systems Design

Areas of Expertise Level 2+ Information Fusion, Situation Awareness,

Formal Reasoning Systems, Artificial Intelligence, Ontology Engineering, Object-Oriented Design

2

VIStology HADRian

HADRian is a next-generation COP for HA/DR opsHADRian enables an Incident Commander to

Find Filter Geocode and Display

the information that he or she needs to make the best decisions about the situation on the basis of semantic annotation of information repositories.Repositories may contain text reports, photos, videos,

tracks (KML), chemical plumes, 3D building models (SketchUp), etc.

4

HADrian: Concept of Operations:Find, Filter, Map and Display Info Re:Disasters

COP Operator annotates repositories, using ontology Info from new repositories can be potentially integrated by annotating

metadata COP operator formulates High Level Query to describes info

needs for current operationSystem infers repositories that may contain relevant info by

reasoning over metadata that the repository has been annotated with. Information remains in place until need (not ETL); Users upload data wherever they want; ingested as needed

System issues appropriate low level query to repositories Repositories contain disparate data in disparate formats

System filters out irrelevant dataSystem aggregates and displays data in Google Earth COPUsers interact with data in COP

COP Operator can send information from COP to responder phone

5

Dataflow for HADRian Situation Awareness from Social Media

6

Ontology-based

annotation of

metadata

COP Operator

Commander

Identifies information needs. Conveys to COP Operator

HADRian

Formulates high level

queries based on Commander

needs: Here, “Where

are unexploded or additional

bombs reported after Boston Marathon?”

Identifies relevant repositories based on metadata and

high level queries. Here, only one

repository is relevant

Processes content of relevant tweets

1. Identifies tweets that report an additional or unexploded bomb.

2. Identifies where the bomb is reported

3. Geolocates the reported location

4. Anonymizes the reporter’s identity

5. Presents the information in HADRian COP

Field Operators

Civilians/Media/Others

Uses COP to make decisions.

TwitterStatus Updates

Current State of the Art: Fixed Feeds, On/Off

VIStology, Inc

High Level Query“Show me Locations mentioned in Tweets about Boston Marathon and Unexploded

Bombs”

Repository Annotation

Filter Processing Chain

The bulk of 0.5 million tweets were collected on April 15, 2013 in 3 hours after the Boston Marathon bombing and stored in a giant CSV file. After this file is determined to be relevant:

1) Using metadata description of schema and rules, the 0.5m tweets were filtered to only those that contained mentions of unexploded or undetonated bombs, and converted into an OWL representation. This file is identified as being relevant to the HLQ “find locations mentioned in tweets about Boston Marathon and unexploded bombs at Boston Marathon on April 15, 2013”

2) The OWL-formatted tweets were loaded into BaseVISor along with rules. These rules extract information about locations mentioned in the tweets - we call them "location phrases", for instance: "jfk", "jfk library", "kennedy library", "#boston", etc.

Location Phrases

INFO Initializing BaseVISor..INFO There are 26679 asserted facts in the knowledge base.INFO Initialization complete. Running inference...INFO There is a total of 44414 facts in the knowledgebase after running the inference.INFO Done.

INFO 'mandarin hotel' -> 3INFO 'bostonmarathon' -> 798INFO 'bpd commissioner ed davis' -> 1INFO 'back bay' -> 1 …INFO 'jfk library' -> 311INFO 'jfk' -> 73INFO 'jfklibrary' -> 2INFO 'jkf library' -> 1 …INFO 'copley place' -> 8 …INFO ‘st. ignatius catholic church’ -> 47…

Locations referenced in tweet; NOT location of user, although that can help disambiguate.

Mapping Location Phrases to Placemarks

For each extracted location phrase a rule with a special procedural attachment is fired. This attachment, takes the location phrase and tries to locate the place on the map using the following algorithm: 1. (Places API, Exact) Lookup Google Maps Places API (

https://developers.google.com/places) and see if it returns exact match for the phrase. If it does, use its Lat/Lon to assign the tweet to that location, otherwise, move to next step.

2. (Geocoding API) Lookup Google Geocoding API (https://developers.google.com/maps/documentation/geocoding) and see if it returns exact match for the phrase, if it does, use its Lat/Lon to assign the tweet to that location, otherwise move to the next step.

3. (Places API, First on the list) Pick the first result in the Google Maps Places API result (from 3a), if there were no results, ignore the location.

3 tweets mention Mandarin Hotel, 2 Copley Place, 1 Back Bay Station,…

Information in Placemark

We indicate the source of the location inside the placemark. (Here, Places API first result)

Corresponding location phrases

PhotoNumber of tweets (1158)Place type (here,

“library, museum, establishment”)

Sample Tweets (not shown)

Placemark Size and Color

The number next to the placemark's name indicates the number of tweets that used on of the location phrases mapping to this location. The higher the number, the more tweets were talking about the same spot. We emphasize this fact by rendering polygons underneath the placemarks - the higher and darker the color, the more frequently mentioned was the location.

47 Tweets mention St Ignatius Church,at Boston College

Relevant Tweet Retrieval (Finding) Evaluation

Corpus: ~500K tweetsIdentified 7,748 tweets that were about additional or

unexploded bombs with a precision of 94.5%, based on a random sample of 200 tweets identified as such. That is, only 1.5% of the original corpus was identified as referring to

additional bombs, using our pattern matching. Based on a random sample of 236 tweets from the original

corpus, our recall (identification of tweets that discussed additional bombs) was determined to be 50%. That is, there were many more ways to refer to additional bombs than

our rules considered. Thus, our F1 measure for accurately identifying tweets about additional

bombs was 65%. Nevertheless, because of the volume of tweets, this did not affect the

results appreciably.

Location Phrase (Filtering) Evaluation

Location phrases were identified purely by means of generic pattern matching.

We did not use any list of known places. Nor did we include any scenario-specific patterns.

The precision with which we identified location phrases was 95%. That is, in 95% of the cases, when we identified a phrase as a location

phrase, it actually did refer to a location in that context. Mistakes included temporal references and references to online sites. Our recall was only 51.3% if we counted uses of #BostonMarathon

that were locative. (We mishandled hashtags with camel case.) If we ignore this hashtag, then our recall was 79.2%.

That is, of all the locations mentioned in tweets about additional bombs at the Boston Marathon, we identified 79.2% percent of the locations that were mentioned.

Using the more lenient standard, our F1 measure for identifying location phrases in the text was 86.3%.

18

GeoCoding Evaluation

Our precision in associating tweets with known places via the Google APIs was 97.2%.

Our precision in assigning unique location phrases to known places via Google APIs was 50%.

That is, there were many location phrases that were repeated several times that we assigned correctly to a known place, but half of the unique phrase names that we extracted were not assigned correctly.

Ten location phrases that were extracted corresponded to no known locations identified via the Google APIs. These included location phrases such as “#jfklibrary” and “BPD

Commissioner Ed Davis”. The former is a phrase we would like to geolocate, but lowercase hashtags which concatenate several words are challenging. The latter is the sort of phrase that we expect would be rejected as non-geolocatable.

19

Top 20 Locations by FrequencyKnown Place #Tweets

JFK Library 1158

Boston 629

Boston Marathon 325

St Ignatius Catholic Church

47

PD 29

Boylston 8

CNN 5

Copley Sq 4

Huntington Ave 4

Iraq 3

Mandarin Hotel 3

Dorchester 3

Marathon 3

US Intelligence 3

Copley Place 2

Boston PD 2

BBC 2

Cambridge 2

John 2

St James Street #Boston

2

20

Comparing Locations Mentioned in Media Blogs

21

Location [Source]: (# of Tweets Identified with That Location

Boylston Street [Globe, CNN]: 8

Commonwealth Ave near Centre Street, Newton [Globe]: 0

Commonwealth Ave (Boston) [Globe]: 0

Copley Square [NYT]: 4

Harvard MBTA station [Globe]: 0

JFK Library [CNN, Globe, NYT]: 1158

Mass. General Hospital [Globe, NYT]: 0

(glass footbridge over) Huntington Ave near Copley place [Globe]: 4

Tufts New England Medical Center [NYT]: 0

Washington Square, Brookline [NYT]: 0 

For three of these sites – Mass. General Hospital, Tufts Medical Center and Washington Square, Brookline -- reports of unexploded bombs or suspicious packages occurred after the end of the tweet collection period, at 7:06 PM.

Otherwise, the recall of our system was good, missing only the report of unexploded bombs at the Harvard MBTA station.

Media failed to report other locations prominent to us (e.g. St Ignatius Catholic Church)

In our corpus, but missed, due to capitalization.

Qualitative Evaluation

On average, tweets reflecting same locations as media blogs were produced 11 minutes prior to their being reported on the sites mentioned.

Thus, the tweet processing was more timely and more comprehensive (included more locations) than simply relying on a handful of news sites alone for situational awareness

22

Conclusion

We described a system for integrating disparate information sources into a COP for Humanitarian Assistance/Disaster Relief operations by means of semantic annotations and queries, using a common ontology.

We applied our technology to a repository of tweets collected in the immediate aftermath of the Boston Marathon bombings in April, 2013, and demonstrated that a ranked set of places could be incorporated into the COP, showing the prominence of each site by tweet volume that was reported as being the site of an additional unexploded bomb or bombs.

We evaluated the results formally and compared the results with the situational awareness that could be gleaned only from mainstream media blogs being updated at the same time.

On average, the automatic processing would have had access to locations from tweets eleven minutes before these sites were mentioned on the mainstream media blogs.

Additionally, sites that were prominent on Twitter (e.g. St Ignatius Church at Boston College or the Mandarin Oriental Hotel in Boston) were not mentioned on the news blog sites at all.

We believe that these results show that this approach is a promising one for deriving situational awareness from social media going forward.

23