capabilities brief analytics
DESCRIPTION
TRANSCRIPT
Data Tactics Corporation Proprietary and Confidential Material
Data Engineering⁻ Data Architecture Design and Development⁻ Large Scale Enterprise Architecture and Design⁻ Migrate, Extract, Transform, and Load Data⁻ Spatial, Multi-Domain, and Cloud Base Data Services
Analytics – Quantitative⁻ Data Transformation and Ingestion⁻ Dissemination and Reporting Tools ⁻ Data Mining, Exploitation, and Correlation Tools⁻ Spatial Data Mining and Geographic Knowledge Discovery
DT Core Analytical Competencies
Data Tactics Corporation Proprietary and Confidential Material
The Team:
Graduates of top tier universities to include Stanford, Caltech and MIT as well as ties to these and local universities.
Degrees include Mathematics, Computer Science, Aeronautical Engineering, Astrophysics, Electrical Engineering, Mechanical Engineering, Statistics and Social Sciences.
Competencies include data mining, machine learning, statistics, spatial statistics, Bayesian statistics, econometrics, computational geometry, spatial econometrics, applied mathematics, theoretical robotics, dynamic systems, control theory.
Foci include unsupervised cross-modal clustering algorithms, principle component analysis, independent component analysis, regression, spatial regression, geographic weighted regression, zeroth order processing, nonlinear optimization, autoregressive models, time-series analysis, spatial regime models, HAC models.
Technical Competencies include Python, MATLAB, R, C/C++,
DT Core Analytical Competencies
Data Tactics Analytics Cell
Data Tactics Corporation Proprietary and Confidential Material
Analytics Competencies
4
• Time Series Analytics (i)
• Applying the ARIMA model in a parallelized environment to provide anomaly detection
• Correlation Analytics (ii)
• Brute force pairwise Pearson’s correlation over vectors in a cloud-backed engine
• Aggregation Analytics (iii)
• Aggregate micro-pathing• Repurposing data to analyze
and display movement patterns
• Dwell time calculations• Analytic to discover areas of
interest based on movement activity
• Graph Analytics (iiii)
• Discovering social interaction models and paradigms within network data
(i)
(ii)
(iiii)
(iii)
Data Tactics Corporation Proprietary and Confidential Material
• Directional Spatio-Temporal Analytics (i)
• Compare distributions with a focus on changes in morphology of the distribution and mobility of individual observations within the distribution over that same period of time over space (Wy)
• Local Classification (ii)
• Non-self-similarities & self-similarities; within and between group correlations.
• Ecological Analytics• Regression Modeling
• Spatial Regression• Spatial Regime Models• HAC Models
5
Analytics Competencies
(ii)
(i)
(i)
Data Tactics Corporation Proprietary and Confidential Material
Data Tactics Data Repository
Data Tactics Corporation Proprietary and Confidential Material
• Proxy problems definition – Different problems lead to different questions, which lead to different data sets. Confer acceptability of data source by the definition of the proxy problems.
• Key dimensions of variability – Key dimensions were targeted for collection such as time, space, identifier, etc. However, different proxy problems require different key dimensions.
• Capturing scope – The following was explicitly captured:• Data structure (E.G. graph relationship data vs. graph transaction data vs. dimensional data)• Data timespan (if time is a dimension)• Data geospatial footprint (if geospatial is a dimension)• Data volume (both in total GB and also in total # of rows)• Determining dataset overlap
• Capturing opinions - Current star ratings based on:• Data consistency, volume, and persistence• Data coverage (time and space)• Data precision (time and space)• Data “genuineness” (synthesized data is penalized)• Data distribution (IE: we may have extremely precise geo-spatial data, but if there are only 40
unique geospatial points in the data, the geo-spatial aspects aren’t that interesting)• Data dimensionality (higher dimensionality with reasonable distributions on each dimension is
preferred)
Quantitative Data Competencies
Name of the Data Source
Initial reviewer Opinion of Data Quality
Description and notes on data source as well as collection information
Collection start / end dates – if known
Geospatial coverage
Source where data was acquired
Data handling requirements
Date that statistics were last collected on data
Size of Data (storage space and rows)
Location of data on FTP site
Data format
10
Quantitative Data Holdings
Data Tactics Corporation Proprietary and Confidential Material
Armed Conflict Location and Events Dataset (ACLED)AIS Ship DataAtmospherics ReportsBrightKite DataClassified AdsCNNDigital Terrain Elevation Data (DTED)Enron DataEpinions DataEU EmailFacebookFlickr DataFlight Information DataFour Square DataFriend Feed DataGeolife DataGowalla DataInternational Conference on Weblogs and Social Media (ICWSM) DataIdentica DataIMDB DataKnowledge Discovery and Data (KDD) Mining Tools CompetitionKDD 2003 Data
KDD 2005 DataKiva DataLandscan DataLiveJournal DataMeme TrackerMeme Twitter TS NFL PlaysNight Lights DataOpen Data Airtraffic accidentsOpen Street Maps Panoramio DataPatent Citations DataPhotobucket DataPicasa Web Albums DataProcessed Employment DataScamper DataISVGTwitterUNDPWeather DataWebgraphsYoutube
Quantitative Data Holdings
Data Tactics Corporation Proprietary and Confidential Material
Panoramio / Flickr – Metadata on uploaded public photos provides excellent geospatial and temporal resolution, which also provides user information. Estimated 250 million rows of photo metadata with over 150 million already gathered.
AIS – Ship tracking data that provides ship ‘pings’ as they progress in movement. Precise time and geospatial information provided. 50 million records and counting.
OpenStreetMaps – Over 2 billion geospatial points of mapping enthusiasts’ tracks across the world. Time and userid information also included.
Gowalla / Brightkite – About 11 million FourSquare style check-ins with user, location, and time information provided.
Example Proxy Problems:• Discovering “Holes” in the data where photos are no longer taken to detect avoided areas• Discovering relationships and links based on co-occurrence between users in time / space• Tracking and analyzing movement patterns on a local and global scale• Analyzing image data for changes in the same locations• Detecting differences in photo activity in an area over time• Detecting events based on abnormal photo activity behavior• Mapping UserIds across data sources to create a unified analytic picture• Detecting home range for each user• Defining patterns of life by routine activities and movement• Tracking language usage in areas to determine abnormal language presence in an area• Local vs tourist movement analysis and extraction• Trending of location popularity
UNCLASSIFIED 12
Quantitative Data Competencies
Twitter – Sampled ongoing collection of social media tweets with UserId and time. Some even have precise location data, but this is not the norm. Collection pulls roughly between 1-2 million tweets / day.
Example Proxy Problems:• Discovery of crowd-sourced phenomena (e.g., people posting to beware of a certain
neighborhood)• Discovery of correlated trends (e.g., finding that people posting about a certain topic in an
area correlates to higher crime in that area)• Tracking sentiment on certain topics and issues• Tracking language usage in areas to determine abnormal language presence in an area
UNCLASSIFIED 13
Quantitative Data Competencies
• How can we infer movement patterns from vast amounts of what appears to be just point data collected in time and associated with an identifier (IE: UserId / bank account / etc)?
• Technique is applicable to Twitter, FourSquare and MANY other sources
Volume plot of photos binned by area on log scaleParis as seen from Flickr over all time
14
Quantitative Data Competencies
Person A
Person B
Person C
7 seconds8 seconds
3 s
eco
nd
s
10
seco
nds
3 seconds
1. Goal: to catch active moment between locations a small distance apart2. Typically two to around a dozen points chained together3. Located in a small area, but with a definite path through the area4. Sampled in rapid succession (less than X seconds between points)5. Thousands or millions of micro-paths make a full path to view
A Micropath example
Photo taken2012-08-15 12:34:59
Photo taken2012-08-15 12:35:11
20 ft 25 ft
Overlay thousands / millions of these tiny micropaths togetherand you get…
1.13 MPH 1.22 MPH
Photo taken2012-08-15 12:35:25
33 ft
Photo taken2012-08-15 12:37:25
0.18 MPH
Segment ignored: 120 seconds between points
Photo taken2012-08-15 12:37:35
23 ft
1.57 MPH
Photo taken2012-08-15 12:37:46
2500 ft
154 MPH
Segment ignored: Velocity too fast
Common path
pattern forming
UNCLASSIFIED 15
Quantitative Data Competencies
View of Paris using a 60 second segment timeout and 80km/hour cutoff on Flickr data
Arc de Triomphe
Apparent typical approach pathway to the Arc
Eiffel Tower
Place de la Concorde typically approached from
southern direction
Notre Dame
Louvre
Place de la Concorde
Red strip appears to be line of sight to the Eiffel Tower
Harder to see, but you can see the
typical approach / exit pathways from
Notre Dame.
UNCLASSIFIED 16
Quantitative Data Competencies
Aggregate micro-pathing on a world of photo metadata with no speed, time, or distance restrictions
UNCLASSIFIED 17
Quantitative Data Competencies
AIS ship tracking micro-path blanket with no time / space filters
Coast of Taiwan
China’s coast with high levels of activity
Japan’s south coast
UNCLASSIFIED 18
Quantitative Data Competencies
Flickr Paris 2004 changes vs 2005
Flickr Paris 2011 changes vs 2010
UNCLASSIFIED 19
Hh: [HIGH, high]- an increase between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. HIGH reflects a strong increase of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). high reflects a modest increase of dy relative to values of dx. Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations.
lL: low, LOW [low, LOW]- a decrease between Xt1 -> Xt2 relative to respective (Xt1, Xt2) reference distribution where t1, t2 belong to T. low reflects a modest decrease of ones own values (dxi) at location i between t1 and t2 relative to the change of neighboring values (dy). LOW reflects a strong decrease of neighboring values of dx.
Neighbors are defined with the spatially lagged variable Wy, as the eight nearest observations.
Quantitative Data Competencies
Bastille Day
Recurrent red strips show the recurring weekend
Day in year
New Year provides lots of photos Paris
Number of distinct photographers
UNCLASSIFIED 20
Quantitative Data Competencies
Caracas5 day Carnival celebration
Some interesting dates for low volume activity
Day in year
Number of distinct photographers
Image from www.flickr.com/photos/globovision/6911554143UNCLASSIFIED 21
Quantitative Data Competencies
Plot of the count of points where the difference between the expected number of flights leaving an airport based on the model and the actual observed number of flights was statistically significant.
Airline Flight Data Anomaly DetectionDuring an unusual event, such as a winter storm show below, the ARIMA still follows the pattern but doesn’t match as well. These areas where the red and black don’t match are where unusual events have occurred.
UNCLASSIFIED 22
Quantitative Data Competencies
key1, timestamp, valuekey2, timestamp, value…
Raw data file:Each line is a comma separated list of values.
Key1 2.4,3.4,0.99,…Key2 3.4,4.3,1.0,0.6….…..
Vector file: Each line has a key and a comma separated list of values.
Cloud-backed transformation
key1 Key2 Key3 Key4
Key1 - 0.93 0.43 0.001
Key2 - - -0.5 -0.03
Key3 - - - .32
Key4 - - - -
Correlation analytic
For each vector calculate the correlation to each other vector. We use a Pearson correlation.
Implemented in:• Python (RAM)• Hive• Mahout• Spark• Giraph• Cascalog
UNCLASSIFIED 23
Quantitative Data Competencies
Test Engine
Training Engine
Spark
Approximation engine for the O(n²) correlation matrix problem
Approximation provides orders of magnitude of speedup when compared to equivalent brute force methods. The technique works best for highly correlated items and uses a series of data projections, unsupervised learning, and vector quantization to provide dimensionality reduction for incoming complex vectors.
Technique based on Google Correlate
UNCLASSIFIED 24
Quantitative Data Competencies