如何建置關鍵字精靈 how to build an keyword wizard
TRANSCRIPT
Agenda● What is Keyword ?● Why We Need ?● Word Relation & Word Representation● How to Build this Wizard● Live Demo
What is Keyword ?
● Wikipedia : Keyword (computer programming), word or identifier that has a particular meaning to the programming language
Word Relation Model
琉球潛水沖繩潛水
沖繩機場那霸機場琉球機場
琉球浮潛沖繩浮潛
沖繩水族館琉球水族館OKinawa 水族館
沖繩
Word Representation - One Hot Representation
Word One Hot Index
Apple 00000001
how 00000010
Are 00000100
You 00001000
I 00010000
Am 00100000
Fine 01000000
Book 10000000
How Are You ? I am Fine . Thank You
TF - Term Frequency
01111110
00001000
00010000
AND
You
I
00000000
Word Representation - Context VectorP(Wi|Context)
Word 餐廳 浮潛 美食 旅遊 出國
沖繩 0.1 0.7 0.5 0.9 0.5
好吃 0.6 0.01 0.7 0.01 0.02
Okinawa 0.2 0.5 0.2 0.8 0.7
喔伊西 0.3 0.002 0.8 0.02 0.03
Similar
Similar
Word2Vec
使用類神網路來產生以下模型:給予短句中的前文即可預測出下一個可能會出現的詞
附帶產生的結果投影層即為詞向量(Word Vector)
https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html
我想要去沖繩潛水 潛水
打球
潛水
睡覺
洗臉
...
Word2Vec● Google 2013 Release● Open Source Project● Two Layer Neural Network● Another Toolkit : Gensim● pip install --upgrade
gensim
https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html
Major Process Flow
Word Collection
Content ExtractionArticle Selection Build ModelWord Cutting
花笠麵很好吃
花笠麵△很△好吃
Slack IntegrationSearch Log
Article Selection
High Quality 500K Articles at 2015Q3Q4
4.4 Billion
SpamClassifier
Ranking
● pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
● pip install -U scikit-learn● http://www.wildml.com/2015/11/understanding-
convolutional-neural-networks-for-nlp/
Content Extraction
Top
Content Body
Bottom
Side
Side
<div><p>沖繩哪裡好玩</p><p>美ら海水族館</p>
<div>
沖繩哪裡好玩美ら海水族館
● pip install beautifulsoup4
Article Raw Data Preparation
A
A1A2A3A4A5A6A7A8A9B1B2B3B4B4B6B7B8B9
Z1Z2Z3Z4Z5Z6Z7Z8Z9
…..
A1 A2 A4 A5 A6 A7 A8 A9B1 B2 B3 B4 B6 B7 B8 B9
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9
…..
Build Model - CoOccurrence
Term Database● Search Log● 各大電商網站(e.q 阿里巴巴)
○ Link1○ Link2
● http://baseterm.com/● 輸入法詞庫
○ 詞庫 破解
Search Log
Keyword URL Date Click
好吃 http://xxx.xxx 20160520 33
好吃 http://zzz.zzz 20160520 22
日本旅遊 http://xxx.xxx 20160521 15
http://xxxx.xx.xxx http://xxxx.xx.xxx 20160522 12121
Term Database● 各大電商網站(e.q 阿里巴巴)
○ Link1○ Link2
Word Cutting● Word Cut Tool
○ Jieba : https://github.com/fxsjy/jieba○ https://github.com/yanyiwu/cppjieba-serve
● C++ Jieba Server ↑ x 30 以上
● pip install jieba
Slack Integration● Library
○ pip install slackbot○ pip install slacker
● Get Bot Token○ https://my.slack.com/services/new/bot
NAS
Technology Software Stack
Redshift BigQuery Article DB
Spark
WorkerWorker Worker
Jieba Server
Gensim Word2Vec
Flask
Jupyter
ScikitLearn
TensorFlow
Slack Bot