geosocial big data analysis using python and foss4g with the case study of korean data ilyoung hong...
TRANSCRIPT
Geosocial big data analysis using python and FOSS4G with the case study of Korean
dataIlyoung Hong
Namseoul Univ
Dep of GIS engineering
Geosocial data
• Social Media- Tweeter, Facebook is the killer app for Smartphone
• Smart Phone with GPS generates lots of geotagged so-cial data
• Social data with geotagged is called geosocial data• Such as GeoTweet - geotagged tweet, 4sq Venues
Geosocial Data Researches
• Fujita, Hideyuki. "Geo-tagged Twitter collection and visualization system." Cartogra-phy and Geographic Information Science 40.3 (2013): 183-191. • =>Computational method, data collection
• Jung, Jin‐Kyu. "Code clouds: Qualitative geovisualization of geotweets." The Canadian Geographer/Le Géographe canadien 59.1 (2015): 52-68.
=> qualitative approach, with content analysis
• Li, Linna, Michael F. Goodchild, and Bo Xu. "Spatial, temporal, and socioeconomic pat-terns in the use of Twitter and Flickr." Cartography and Geographic Information Sci-ence 40.2 (2013): 61-77.
Þ Spatial statistical analysis with geodemographic data,
• Mitchell, Lewis, et al. "The geography of happiness: Connecting twitter sentiment and expression, demographics, and objective characteristics of place." (2013): e64417.
=>Sentimental analysis, computational linguistics approach,
DataCollection
DataAnalyzing
DataVisualization
QuantitativeAnalysis
Data Man-agement
Qualitative Analysis
Web programming
Multi Disciplinary Aspects of geosocial data analysis
Database Manage-ment
Geography, Cartography, GIS
StatisticsLinguisticsText Mining
GeoSocial Data
SociologyJournalismMedia
challenges of geosocial research
• different data source, format• Tweet, foursquare, Facebook,
• different analysis environment, difference software• Java, php, Python, C, R, ArcGIS, web-programming, database programming, statis-
tics, geovisualizatrion,
• different domain knowledge, multidisciplinary research methods• Computation, geography, sociology, psychology, statistics, linguistics, media, jour-
nalism
Need interdisciplinary cooperation, Are there any way to Integrate these methods?
Why python/foss4g for Geosocial Big Data?
• Integrated analysis environment in software, library• Python is free and open.• Object-oriented programming (OOP) in Python • WinPython, Anaconda(SCIPY,Ipython), Enthought Canopy for Python 2.7
• large amount of libraries, support different domain knowledge• PyPI - the Python Package Index, currently 66086 packages
• Simple Coding environment• Quick to Learn and to code• Readability The syntax of Python is readable and clear.
Research Purpose
• Introduce the intergrated platform to analysize the GeoSocial using python & FOSS4G
• Data collection, management
• Data Analysis, Qualitative & Quantitave methods
• Sentimenal Analysis
• Geovisualizaing
• Present the Case Study with Korean Geosocail Data
• GeoTweet distribution
• Spatial Patterns of Fousquare Venues
• Sentimenal Anlysis of Korean GeoTweet
Architecteture, at beginning
SocailMedia
JSON
Excelcsv
ShapeArcGIS
Twitter/Foursquare
API
Data Collection
• Python Streaming API, tweepy• limited rates for one user• However, there is a restriction on data collection from Twitter:
the method• call of Twitter API is limited by 350 calls per hour for one au-
thorized developer account • switch to the other user id when reach to the limits
• unnecessary data.. filtering• geotweet data is just 1% of total tweet
Columns from Tweet
● Tweet text; => qualitative approach, text mining, keword filter, sentimental analysis,
● Tweet ID; User ID; Destination user ID (only for tweets with “@user ID”);
User profile (including location name input by user);
=> behavioral features, heavy user feature, social network,
● Location coordinates (only for tweets tagged with the location coordinates).• Geovisualization, Spatial Analysis using GIS
● Date and Time => temporal analysis
until now, made two researches
• Spatial Analysis of Location-Based Social Networks in Seoul, Korea, Journal of Geographic Information System, 2015, 7, 259-265
• Spatial Distribution of Korean Geotweets* Journal of the Korean Cartographic Association, 2015, 15(2), 93-101
Spatial Analysis of Location-Based Social Networks in Seoul,
• The purpose of this study is to analyze the spatial patterns of location-based social network (LBSN)
data in Seoul using the spatial analysis techniques of geographic information system (GIS). The
study explores the applications of LBSN data by analyzing the association between Seoul’s
Foursquare venues data created based on user participation and the city’s characteristics. The data
regarding Foursquare venues were compiled with a program we created based on Foursquare’s
Python API. The compiled information was converted into GIS data, which in turn was depicted as a
heat map. Cluster analysis was then performed based on hotspots and the correlation with census
variables was analyzed for each administrative unit using geographically weighted regression
(GWR). Based on analytical results, we were able to identify venue clusters around city centers, as
well as differences in hotspots for various venue categories and correlations with census variables.
about 230,000 venuedata were collected for analysis between March 15 and 21, 2015
Spatial Distribution of Korean Geotweets*In this study, we analyzed the distribution of Korean geotweet. Geotweet was ana-lyzed, which was collected at November 2014 through Twitter Streaming API. Us-ing the Python programming, it was carried out to analyze the collected data and GIS data conversion. Twitter use and distribution are concentrated at Seoul and the metropolitan areas and a few heavy users were creating a large number of tweets. Time series analysis showed the characteristics of the tweets that make up the highest point on the Weekend and forms the highest point at 14:00 during the day. In addition, differences in the content that appears every high percentage of retweets and regions through text analysis were also identified. Key Words : Tweeter API, Geotweet, Spatial distribution
Spatial Distribution of geotweet
Distribution of geotweet, Nov 2014
Daily Distribution of geotweet, Nov 2014
• Nov, 2014, over 2 million tweet was collected.
Text anal-ysis• high percentage of
retweet• some keyword that
represent regional features• PyTag, Word_cloud
Problems
• Using Exoplanary Statistic Analysis, Repeated Works but the process is not automated
• Takes times, Data Error
• As time goes by, the data comes to be too big to handle.• Need to be managed at database, not as a text file
• Data and Software show be compatible at the same environ-ment for the automated analysis
Python & FOSS4G
• integrated analysis environment
• large amount of libraries, support different domain knowledge
• create the automated scripts for analysis
Social Media Server
GIS Data Server
Analysis Client
Twitter API - Tweepy
Spatialite
GeovisualizationQuantum GIS
WodCloudpytagcloud
Statistical AnalysisPySAL
VisualizeClient
Data Collection
Data Parsing
Sentiment Analysis Python NLTK
Data Conversion
Shape/Text
PANDASfor Data Analysis
pyspatialitepyspatialite
Analysis Process
Quantita-tives
Quanta-tives
Setiment Analysis
StatisitcalAnalysis
WordClouds
HeatMapThematic MappingHotspot
GWR
SocialMediaData
TextMining
GISData-baseGeo-
Taged? SpatialAnalysis
VisualiingMethod?Data
Type? Analysismethod?
Spatialite Database, Why
-Standalone & File Based Database: easy to handle
- Compatable, interoperability:
Python, QGIS, ArcGIS, export/import to any format
- Easy to useability, GUI
pyspatialite
Sentiment Analysis with Python NLTK Text Classification
• sentiment analysis using a NLTK
• Tweet Text => POS, NEU, NEG values
Heatmap using Quantum GIS2015, July, geotweet
Hot, Best Postive Place
Jongro
HongDae youngsan
Word Cloud
JongroHongDae
youngsan
Best Positive TweetHappy Pride from Kat! #seoul #gaypride #kqcf2015 #korea #hugagaytoday @ Seoul City Hall Korea https://t.co/81TiNdqCMH#seoulgayprideparade HAPPY PRIDE DAY KOREA!!!! #rainbow #lgbt #love #happy #seoul #ko-rea @ Seoul Plaza https://t.co/FUCkHxmIscGood times and more Korean BBQ with the Samsung team #MobLabs #GangnamStyle @ Gang-nam, Seoul, Korea https://t.co/NyIa440NZ3
Happy Sunday :) @ Myeongdong Cathedral https://t.co/TezVZTVtDHWe go by the zoo via the "Elephant Train" to the museum @ Seoul Grand Park Zoo https://t.co/imXCgPrcBGKorean food is the best food #korea #food #nofiilter @ Seoul ,Korea https://t.co/MqVDHqqoEyHave a beautiful and fruitful week IG fam! #MondayLook #mamichoux @ Hongdae Seoul https://t.co/lVM5NdLJypHappy the 4th of July to all my American friends! (@ Thursday Party in Seoul) https://t.co/CG27beaCQlAnd with Elizaveta from Russia :) @ Trickeye Museum https://t.co/7NCrGUYOF1Quick tour of a Korean apartment @ Hongdae Seoul South Korea https://t.co/yTy8mAVCZk..
Conclusion and Future Work
• Aanalysis of Geosocial Data is the complex, multidiciplanary process
• In this research, present the integrated architecture using Python & FOSS4G
• Future work• automated processing with Python scripts• Need more work on QGIS and PySAL for more advanced analysis and
visualization