capabilities brief analytics

Data Tactics Corporation Proprietary and Confidential Material

Data Engineering⁻ Data Architecture Design and Development⁻ Large Scale Enterprise Architecture and Design⁻ Migrate, Extract, Transform, and Load Data⁻ Spatial, Multi-Domain, and Cloud Base Data Services

Analytics – Quantitative⁻ Data Transformation and Ingestion⁻ Dissemination and Reporting Tools ⁻ Data Mining, Exploitation, and Correlation Tools⁻ Spatial Data Mining and Geographic Knowledge Discovery

DT Core Analytical Competencies


The Team:

Graduates of top tier universities to include Stanford, Caltech and MIT as well as ties to these and local universities.

Degrees include Mathematics, Computer Science, Aeronautical Engineering, Astrophysics, Electrical Engineering, Mechanical Engineering, Statistics and Social Sciences.

Competencies include data mining, machine learning, statistics, spatial statistics, Bayesian statistics, econometrics, computational geometry, spatial econometrics, applied mathematics, theoretical robotics, dynamic systems, control theory.

Foci include unsupervised cross-modal clustering algorithms, principle component analysis, independent component analysis, regression, spatial regression, geographic weighted regression, zeroth order processing, nonlinear optimization, autoregressive models, time-series analysis, spatial regime models, HAC models.

Technical Competencies include Python, MATLAB, R, C/C++,

DT Core Analytical Competencies

Data Tactics Analytics Cell


Analytics Competencies

4

• Time Series Analytics (i)

• Applying the ARIMA model in a parallelized environment to provide anomaly detection

• Correlation Analytics (ii)

• Brute force pairwise Pearson’s correlation over vectors in a cloud-backed engine

• Aggregation Analytics (iii)

• Aggregate micro-pathing• Repurposing data to analyze

and display movement patterns

• Dwell time calculations• Analytic to discover areas of

interest based on movement activity

• Graph Analytics (iiii)

• Discovering social interaction models and paradigms within network data

(i)

(ii)

(iiii)

(iii)


• Directional Spatio-Temporal Analytics (i)

• Compare distributions with a focus on changes in morphology of the distribution and mobility of individual observations within the distribution over that same period of time over space (Wy)

• Local Classification (ii)

• Non-self-similarities & self-similarities; within and between group correlations.

• Ecological Analytics• Regression Modeling

• Spatial Regression• Spatial Regime Models• HAC Models

5

Analytics Competencies

(ii)

(i)

(i)


Data Tactics Data Repository


• Proxy problems definition – Different problems lead to different questions, which lead to different data sets. Confer acceptability of data source by the definition of the proxy problems.

• Key dimensions of variability – Key dimensions were targeted for collection such as time, space, identifier, etc. However, different proxy problems require different key dimensions.

• Capturing scope – The following was explicitly captured:• Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data)• Data timespan (if time is a dimension)• Data geospatial footprint (if geospatial is a dimension)• Data volume (both in total GB and also in total # of rows)• Determining dataset overlap

• Capturing opinions - Current star ratings based on:• Data consistency, volume, and persistence• Data coverage (time and space)• Data precision (time and space)• Data “genuineness” (synthesized data is penalized)• Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40

unique geospatial points in the data, the geo-spatial aspects aren’t that interesting)• Data dimensionality (higher dimensionality with reasonable distributions on each dimension is

preferred)

Quantitative Data Competencies

Name of the Data Source

Initial reviewer Opinion of Data Quality

Description and notes on data source as well as collection information

Collection start / end dates – if known

Geospatial coverage

Source where data was acquired

Data handling requirements

Date that statistics were last collected on data

Size of Data (storage space and rows)

Location of data on FTP site

Data format

10

Quantitative Data Holdings


Armed Conflict Location and Events Dataset (ACLED)AIS Ship DataAtmospherics ReportsBrightKite DataClassified AdsCNNDigital Terrain Elevation Data (DTED)Enron DataEpinions DataEU EmailFacebookFlickr DataFlight Information DataFour Square DataFriend Feed DataGeolife DataGowalla DataInternational Conference on Weblogs and Social Media (ICWSM) DataIdentica DataIMDB DataKnowledge Discovery and Data (KDD) Mining Tools CompetitionKDD 2003 Data

KDD 2005 DataKiva DataLandscan DataLiveJournal DataMeme TrackerMeme Twitter TS NFL PlaysNight Lights DataOpen Data Airtraffic accidentsOpen Street Maps Panoramio DataPatent Citations DataPhotobucket DataPicasa Web Albums DataProcessed Employment DataScamper DataISVGTwitterUNDPWeather DataWebgraphsYoutube

Quantitative Data Holdings


Panoramio / Flickr – Metadata on uploaded public photos provides excellent geospatial and temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata with over 150 million already gathered.

AIS – Ship tracking data that provides ship ‘pings’ as they progress in movement. Precise time and geospatial information provided. 50 million records and counting.

OpenStreetMaps – Over 2 billion geospatial points of mapping enthusiasts’ tracks across the world. Time and userid information also included.

Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and time information provided.

Example Proxy Problems:• Discovering “Holes” in the data where photos are no longer taken to detect avoided areas• Discovering relationships and links based on co-occurrence between users in time / space• Tracking and analyzing movement patterns on a local and global scale• Analyzing image data for changes in the same locations• Detecting differences in photo activity in an area over time• Detecting events based on abnormal photo activity behavior• Mapping UserIds across data sources to create a unified analytic picture• Detecting home range for each user• Defining patterns of life by routine activities and movement• Tracking language usage in areas to determine abnormal language presence in an area• Local vs tourist movement analysis and extraction• Trending of location popularity

UNCLASSIFIED 12


Twitter – Sampled ongoing collection of social media tweets with UserId and time. Some even have precise location data, but this is not the norm. Collection pulls roughly between 1-2 million tweets / day.

Example Proxy Problems:• Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain

neighborhood)• Discovery of correlated trends (e.g., finding that people posting about a certain topic in an

area correlates to higher crime in that area)• Tracking sentiment on certain topics and issues• Tracking language usage in areas to determine abnormal language presence in an area

UNCLASSIFIED 13


• How can we infer movement patterns from vast amounts of what appears to be just point data collected in time and associated with an identifier (IE: UserId / bank account / etc)?

• Technique is applicable to Twitter, FourSquare and MANY other sources

Volume plot of photos binned by area on log scaleParis as seen from Flickr over all time

14


Person A

Person B

Person C

7 seconds8 seconds

3 s

eco

nd

s

10

seco

nds

3 seconds

1. Goal: to catch active moment between locations a small distance apart2. Typically two to around a dozen points chained together3. Located in a small area, but with a definite path through the area4. Sampled in rapid succession (less than X seconds between points)5. Thousands or millions of micro-paths make a full path to view

A Micropath example

Photo taken2012-08-15 12:34:59

Photo taken2012-08-15 12:35:11

20 ft 25 ft

Overlay thousands / millions of these tiny micropaths togetherand you get…

1.13 MPH 1.22 MPH

Photo taken2012-08-15 12:35:25

33 ft

Photo taken2012-08-15 12:37:25

0.18 MPH

Segment ignored: 120 seconds between points

Photo taken2012-08-15 12:37:35

23 ft

1.57 MPH

Photo taken2012-08-15 12:37:46

2500 ft

154 MPH

Segment ignored: Velocity too fast

Common path

pattern forming

UNCLASSIFIED 15


View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data

Arc de Triomphe

Apparent typical approach pathway to the Arc

Eiffel Tower

Place de la Concorde typically approached from

southern direction

Notre Dame

Louvre

Place de la Concorde

Red strip appears to be line of sight to the Eiffel Tower

Harder to see, but you can see the

typical approach / exit pathways from

Notre Dame.

UNCLASSIFIED 16


Aggregate micro-pathing on a world of photo metadata with no speed, time, or distance restrictions

UNCLASSIFIED 17


AIS ship tracking micro-path blanket with no time / space filters

Coast of Taiwan

China’s coast with high levels of activity

Japan’s south coast

UNCLASSIFIED 18


Flickr Paris 2004 changes vs 2005

Flickr Paris 2011 changes vs 2010

UNCLASSIFIED 19

Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. HIGH reflects a strong increase of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). high reflects a modest increase of dy relative to values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations.

lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest decrease of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). LOW reflects a strong decrease of neighboring values of dx.

Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations.


Bastille Day

Recurrent red strips show the recurring weekend

Day in year

New Year provides lots of photos Paris

Number of distinct photographers

UNCLASSIFIED 20


Caracas5 day Carnival celebration

Some interesting dates for low volume activity

Day in year

Number of distinct photographers

Image from www.flickr.com/photos/globovision/6911554143UNCLASSIFIED 21


Plot of the count of points where the difference between the expected number of flights leaving an airport based on the model and the actual observed number of flights was statistically significant.

Airline Flight Data Anomaly DetectionDuring an unusual event, such as a winter storm show below, the ARIMA still follows the pattern but doesn’t match as well. These areas where the red and black don’t match are where unusual events have occurred.

UNCLASSIFIED 22


key1, timestamp, valuekey2, timestamp, value…

Raw data file:Each line is a comma separated list of values.

Key1 2.4,3.4,0.99,…Key2 3.4,4.3,1.0,0.6….…..

Vector file: Each line has a key and a comma separated list of values.

Cloud-backed transformation

key1 Key2 Key3 Key4

Key1 - 0.93 0.43 0.001

Key2 - - -0.5 -0.03

Key3 - - - .32

Key4 - - - -

Correlation analytic

For each vector calculate the correlation to each other vector. We use a Pearson correlation.

Implemented in:• Python (RAM)• Hive• Mahout• Spark• Giraph• Cascalog

UNCLASSIFIED 23


Test Engine

Training Engine

Spark

Approximation engine for the O(n²) correlation matrix problem

Approximation provides orders of magnitude of speedup when compared to equivalent brute force methods. The technique works best for highly correlated items and uses a series of data projections, unsupervised learning, and vector quantization to provide dimensionality reduction for incoming complex vectors.

Technique based on Google Correlate

UNCLASSIFIED 24


capabilities brief analytics

Documents