atmosphere 2014: hadoop: challenge accepted! - arkadiusz osinski, robert mroczkowski
DESCRIPTION
Nowadays we are producing a huge volume of information, but unfortunately at most only 12% of it is analyzed. That is why we should dive into our data lake and pull out the Holy Grail - the knowledge. But BigData means big problem. So, challenge accepted! The perfect solution for achieving this goal is Hadoop. It is a 'data operating system', which allows us to process large volumes of any data in a distributed way. Together, we will take a phenomenal journey around Hadoop world. First stop: operations basics. Second stop: short tour around Hadoop ecosystem. At the end of our travel, we will walk through several examples, that show you real power of a Hadoop as your data platform. Arkadiusz Osinski - Works in Allegro Group as a System administrator. From the beginning he is related with building and maintaining of Hadoop infrastructure within Allegro Group. Previously he was responsible for maintaining large scale database systems. Passionate about new technologies and cycling. Robert Mroczkowski - In 2006 graduated master studies in Computer Science at Nicolaus Copernicus University. In 2007 he graduated Bachelor Studies in Applied Informatics at Nicolaus Copernicus University. In years 2006 - 2011 he was a PhD student in Computer Science. His research field was Computer Science applied in Bioinformatcs. In 2012 he started to work as Unix System Administartor in Allegro Group. He gained experience in Hadoop World building and maintaining a cluster for GA. Every day he works with modern high-performance and high-available technologies, centrally managed in cloud environment.TRANSCRIPT
![Page 1: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/1.jpg)
Hadoop: challenge accepted!
Arkadiusz Osiński [email protected]
Robert Mroczkowski [email protected]
![Page 2: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/2.jpg)
ToC -‐‑ Hadoop basics -‐‑ Gather data -‐‑ Process your data -‐‑ Learn from your data -‐‑ Visualize your data
![Page 3: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/3.jpg)
BigData -‐‑ Petabytes of (un)structured data
![Page 4: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/4.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed
![Page 5: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/5.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed -‐‑ a lot of data is not gathered
![Page 6: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/6.jpg)
BigData -‐‑ Petabytes of (un)structured data -‐‑ 12% of data is analyzed -‐‑ a lot of data is not gathered -‐‑ how to gain knowledge?
![Page 7: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/7.jpg)
Power Big Data
Data Lake
Scalability
Petabytes
Mapreduce Commodity
![Page 8: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/8.jpg)
HDFS -‐‑ Storage layer
![Page 9: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/9.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system
![Page 10: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/10.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware
![Page 11: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/11.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability
![Page 12: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/12.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD
![Page 13: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/13.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD -‐‑ Access control
![Page 14: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/14.jpg)
HDFS -‐‑ Storage layer -‐‑ Distributed file system -‐‑ Commodity hardware -‐‑ Scalability -‐‑ JBOD -‐‑ Access control -‐‑ No SPOF
![Page 15: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/15.jpg)
YARN -‐‑ Distributed computing layer
![Page 16: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/16.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data
![Page 17: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/17.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce…
![Page 18: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/18.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce… -‐‑ and others applications
![Page 19: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/19.jpg)
YARN -‐‑ Distributed computing layer -‐‑ Operations in place of data -‐‑ MapReduce… -‐‑ and others applications -‐‑ Resource management
![Page 20: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/20.jpg)
Let’s squize our data to get a juice!!
![Page 21: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/21.jpg)
Gather data flume-twitter.sources.Twitter.type = com.cloudera.flume.source.TwitterSource flume-twitter.sources.Twitter.channels = MemChannel flume-twitter.sources.Twitter.consumerKey = (…) flume-twitter.sources.Twitter.consumerSecret = (…) flume-twitter.sources.Twitter.accessToken = (…) flume-twitter.sources.Twitter.accessTokenSecret = (…) flume-twitter.sources.Twitter.keywords = hadoop, big data, nosql
![Page 22: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/22.jpg)
Process your data -‐‑ Hadoop Streaming!
![Page 23: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/23.jpg)
Process your data -‐‑ Hadoop Streaming! -‐‑ No need to write code in Java
![Page 24: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/24.jpg)
Process your data -‐‑ Hadoop Streaming! -‐‑ No need to write code in Java -‐‑ You can use Python, Perl or Awk
![Page 25: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/25.jpg)
Process your data #!/usr/bin/python import sys import json import datetime as dt keyword='hadoop' for line in sys.stdin: data = json.loads(line.strip()) if keyword in data['text'].lower(): dt=dt.datetime.strptime(data['created_at'], '%a %b %d %H:%M:%S +0000 %Y').strftime('%Y-%m-%d') print '{0}\t1'.format(str(dt))
![Page 26: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/26.jpg)
Process your data #!/usr/bin/python import sys (counter,datekey=(0,'') for line in sys.stdin: line = line.strip().split("\t") if datekey != line[0]: if datekey: print "{0}\t{1}".format(str(datekey),str(counter)) datekey = line[0] counter = 1 else: counter += 1 print "{0}\t{1}".format(str(datekey),str(counter))
![Page 27: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/27.jpg)
Process your data yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files ./map.py,./reduce.py \
-mapper ./map.py \
-reducer ./reduce.py \
-input /tweets/2014/04/*/*/* \
-input /tweets/2014/05/*/*/* \
-output /tweet_keyword
![Page 28: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/28.jpg)
Process your data (….) 2014-04-24 864 2014-04-25 1121 2014-04-26 593 2014-04-27 649 2014-04-28 1084 2014-04-29 1575 2014-04-30 1170 2014-05-01 1164 2014-05-02 1175 2014-05-03 779 2014-05-04 471 (….)
![Page 29: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/29.jpg)
Process your data
![Page 30: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/30.jpg)
Recommendations
Which product will be desired by client?
We’ve got historical users interaction with items.
![Page 31: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/31.jpg)
![Page 32: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/32.jpg)
Simple Example Let’s just do mahout -‐‑ it’s easy!
> apt-get install mahout
> cat simple_example.csv
1,101
1,102
1,103
2,101
> hdfs dfs -put simple_example.csv
> mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -b \
-Dmapred.input.dir=/mahout/input/wikilinks/simple_example.csv \
-Dmapred.output.dir=/mahout/output/wikilinks/simple_example \
-Dmapred.job.queue.name=atmosphere_prod
![Page 33: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/33.jpg)
Simple Example Tadadam!
> hdfs dfs –text /mahout/output/wikilinks/simple_example/part-r-00000.snappy 1 [105:1.0,104:1.0] 2 [106:1.0,105:1.0] 3 [103:1.0,102:1.0] 4 [105:1.0,102:1.0] 5 [107:1.0,106:1.0]
![Page 34: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/34.jpg)
Wiki Case
We’ve got links between wikipedia articles, and want to propose new links between articles.
„Wikipedia (i/ˌwɪkɨˈpiːdiəә/ or i/ˌwɪkiˈpiːdiəә/ WIK-‐‑i-‐‑PEE-‐‑dee-‐‑əә) is a collaboratively edited, multilingual, free Internet encyclopedia that is supported by the non-‐‑profit Wikimedia Foundation. Volunteers worldwide collaboratively write Wikipedia'ʹs 30 million articles in 287 languages, including over 4.5 million in the English Wikipedia. Anyone who can access”
![Page 35: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/35.jpg)
Wiki Case
![Page 36: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/36.jpg)
Wiki Case
hlp://users.on.net/%7Ehenry/pagerank/links-‐‑simple-‐‑sorted.zip
#!/usr/bin/awk -f BEGIN { OFS=",”; } { gsub(":","",$1); for (i=2;i<=NF;i++) { print $1,$i } }
![Page 37: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/37.jpg)
Wiki Case
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmapreduce.job.max.split.locations=24 \
-Dmapreduce.job.queuename=hadoop_prod \
-Dmapred.output.key.comparator.class=mapred.lib.KeyFieldBasedComparator \
-Dmapred.text.key.comparator.options=-n \
-Dmapred.output.compress=false \
-files ./mahout/mapper.awk \
-mapper ./mapper.awk \
-input /mahout/input/wikilinks/links-simple-sorted.txt \
-output /mahout/output/wikilinks/fixedinput
![Page 38: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/38.jpg)
Wiki Case Mahout lib count’s similarity Matrix and gave recommendations for 824 articles.
What’s important, we didn’t gather any knowledge a priori and just ran algorithm’s out of box.
![Page 39: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/39.jpg)
Wiki Case Acadèmia_Valenciana_de_la_Llengua
FIFA Valencia
October_1 Calendar
Prehistoric_Iberia Link appears recently
Ceuta Spain City at the north coast of Africa
Roussillon Part of France by the border with Spain
Sweden J
Turís municipality in the Valencian Community
Vulgar_Latin Language article Western_Italo-‐‑Western_languages Language article
Àngel_Guimerà Spanish wriler
![Page 40: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/40.jpg)
Wiki Case
![Page 41: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/41.jpg)
Tweets
Let’s find group of: • tags • users
![Page 42: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/42.jpg)
Tweets
• Our data is not random • We’ve picked specific keywords • We’ll do analysis in two
orthogonal directions
![Page 43: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/43.jpg)
Tweets {
"filter_level":"medium",
"contributors":null,
"text":"PROMOCIÓN MES DE MAYO. con ...",
"geo":null,
"retweeted":false,
"lang":"es",
"entities":{
"urls":[
{ "expanded_url":"http://www.agmuriel.com",
"indices":[ 69, 91 ],
"display_url":"agmuriel.com/#!-/c1gz",
"url":"http://t.co/APpPjRRTXn" } ]
}
(…)
![Page 44: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/44.jpg)
Tweets #!/usr/bin/python import json, sys for line in sys.stdin: line = line.strip() if '"lang":"en"' in line: tweet = json.loads(line) try: text = tweet['text'].lower().strip() if text: tags = tweet[” entities"][”hashtags”] for tag in tags: print tag[“text”]+"\t"+text except KeyError: continue
#!/usr/bin/python import sys (lastKey,text) = (None,"") for line in sys.stdin: (key,value) = line.strip().split("\t") if lastKey and lastKey != key: print lastKey+"\t"+text (lastKey,text) = (key,value) else: (lastKey,text) = (key,text+" "+value)
![Page 45: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/45.jpg)
Tweets
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-Dmapreduce.job.queuename=atmosphere_time \
-Dmapred.output.compress=false \
-Dmapreduce.job.max.split.locations=24 \
-D-Dmapred.reduce.tasks=20 \
-files ~/mahout/twitter_map.py,~/mahout/twitter_reduce.py \
-mapper ./twitter_map.py \
-reducer ./twitter_reduce.py \
-input /project/atmosphere/tweets/2014/04/*/* \
-output /project/atmosphere/tweets/output \
-outputformat org.apache.hadoop.mapred.SequenceFileOutputFormat
Get SequenceFile with proper mapping
![Page 46: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/46.jpg)
Tweets
mahout seq2sparse \
-i /project/atmosphere/tweets/output \
-o /project/atmosphere/tweets/vectorized -ow \
-chunk 200 -wt tfidf -s 5 -md 5 -x 90 -ng 2 -ml 50 -seq -n 2
Calculate vector representation for text
{10:0.6292275202550768,14:0.7772211575566166} {10:0.6292275202550768,14:0.7772211575566166} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859} {17:1.0} {3:0.37796447439954967,14:0.37796447439954967,19:0.654653676423271,22:0.534522474858859}
![Page 47: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/47.jpg)
Tweets I’ts time to begin clusterization
Let’s find 100 clusters
mahout kmeans \
-i /tweets_5/vectorized/tfidf-vectors \
-c /tweets_5/kmeans/initial-clusters \
-o /tweets_5/kmeans/output-clusters \
-cd 1.0 -k 100 -x 10 -cl –ow \
-dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
![Page 48: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/48.jpg)
Tweets Glance at results
BURN OPEN LEATHER FAT SOFTWARE WALLET WEIGHTLOSS LINUX MAN FITNESS UBUNTU ZUMBA OPENSUSE
PATCHING
![Page 49: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/49.jpg)
Tweets
It was easy because tags are very dependent (coocurence).
![Page 50: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/50.jpg)
Tweets Bigger challenge – user clustering
LINUX UBUNTU WINDOWS OS PATCH MAC HACKED MICROSOFT
FREE CSRRACING WON RACEYOURFRIENDS ANDROID CSRCLASSIC
![Page 51: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/51.jpg)
Tweets Bigger challenge – user clustering
• Results show that dataset is strongly curved by mobile and games
• Dataset wasn’t random – we subscribed specific keywords
• OS result is great!
![Page 52: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/52.jpg)
Tweets HADOOP WORLD
run predictive machine learning algorithms on hadoop without even knowing mapreduce.: data scientists are very... h:p://t.co/gdmqm5g1ar
rt @mapr: google cloud storage connector for #hadoop: quick start guide now avail h:p://t.co/17hxtvdlir #bigdata
![Page 53: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/53.jpg)
Tweets HADOOP WORLD
Cloudera wants to do big data in Real Time.
Hortonworks wants to replace cloudera by research.
![Page 54: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/54.jpg)
Visualize data add jar hive-serdes-1.0-SNAPSHOT.jar; create table tw_data_201404 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\012’ STORED AS TEXTFILE LOCATION ‘/tweets/tw_data_201404’ AS SELECT v_date, LOWER(hashtags.text), lang, COUNT(*) AS total_count FROM logs.tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags WHERE v_date like '2014-04-%' GROUP BY v_date,LOWER(hashtags.text),lang
![Page 55: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/55.jpg)
Visualize data add jar elasticsearch-hadoop-hive-2.0.0.RC1.jar; CREATE EXTERNAL TABLE es_export ( v_date string, tag string, lang string, total_count int, info string ) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler’ TBLPROPERTIES ( 'es.resource' = 'trends/log', 'es.index.auto.create' = 'true') ;
![Page 56: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/56.jpg)
Visualize data INSERT overwrite TABLE es_export SELECT distinct may.v_date,may.tag,may.lang,may.total_count,'nt' FROM tw_data_201405 may LEFT outer JOIN tw_data_201404 april ON april.tag = may.tag WHERE april.tag is null AND may.total_count>1;
![Page 57: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/57.jpg)
Visualize data
![Page 58: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/58.jpg)
Visualize data Tag: eurovisiontve
![Page 59: Atmosphere 2014: Hadoop: Challenge accepted! - Arkadiusz Osinski, Robert Mroczkowski](https://reader036.vdocuments.pub/reader036/viewer/2022081400/554dd480b4c905c70e8b4a51/html5/thumbnails/59.jpg)
Thank you!
Questions?