如何建置關鍵字精靈 how to build an keyword wizard

36
如何建置關鍵字精靈 How to Build an Keyword Wizard

Upload: -

Post on 21-Apr-2017

534 views

Category:

Internet


1 download

TRANSCRIPT

如何建置關鍵字精靈How to Build an Keyword

Wizard

Agenda● What is Keyword ?● Why We Need ?● Word Relation & Word Representation● How to Build this Wizard● Live Demo

What is Keyword ?

● Wikipedia : Keyword (computer programming), word or identifier that has a particular meaning to the programming language

Why We Need ?

Advertisement Tags

Look Me !

Relation Article Summary

Word Relation Model

琉球潛水沖繩潛水

沖繩機場那霸機場琉球機場

琉球浮潛沖繩浮潛

沖繩水族館琉球水族館OKinawa 水族館

沖繩

Word Relation Model

沖繩

Word Representation - Vector Space Model

One Hot v.s Continue Value

It is better for analysis

Very High Dimension

Word Representation - One Hot Representation

Word One Hot Index

Apple 00000001

how 00000010

Are 00000100

You 00001000

I 00010000

Am 00100000

Fine 01000000

Book 10000000

How Are You ? I am Fine . Thank You

TF - Term Frequency

01111110

00001000

00010000

AND

You

I

00000000

Word Representation - Context VectorP(Wi|Context)

Word 餐廳 浮潛 美食 旅遊 出國

沖繩 0.1 0.7 0.5 0.9 0.5

好吃 0.6 0.01 0.7 0.01 0.02

Okinawa 0.2 0.5 0.2 0.8 0.7

喔伊西 0.3 0.002 0.8 0.02 0.03

Similar

Similar

Word Context Vector

Co-occurrence MatrixSparse & Large

n ~= 500K

Space ~= n*nTime ~= n*n

GG!!

Word2Vec

使用類神網路來產生以下模型:給予短句中的前文即可預測出下一個可能會出現的詞

附帶產生的結果投影層即為詞向量(Word Vector)

https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html

我想要去沖繩潛水 潛水

打球

潛水

睡覺

洗臉

...

Word2Vec● Google 2013 Release● Open Source Project● Two Layer Neural Network● Another Toolkit : Gensim● pip install --upgrade

gensim

https://www.tensorflow.org/versions/r0.8/tutorials/word2vec/index.html

How to Hands On ???

Major Process Flow

Word Collection

Content ExtractionArticle Selection Build ModelWord Cutting

花笠麵很好吃

花笠麵△很△好吃

Slack IntegrationSearch Log

Article Selection

High Quality 500K Articles at 2015Q3Q4

4.4 Billion

SpamClassifier

Ranking

● pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl

● pip install -U scikit-learn● http://www.wildml.com/2015/11/understanding-

convolutional-neural-networks-for-nlp/

Content Extraction

Top

Content Body

Bottom

Side

Side

<div><p>沖繩哪裡好玩</p><p>美ら海水族館</p>

<div>

沖繩哪裡好玩美ら海水族館

● pip install beautifulsoup4

Content Extraction

Content Extraction

Article Raw Data Preparation

A

A1A2A3A4A5A6A7A8A9B1B2B3B4B4B6B7B8B9

Z1Z2Z3Z4Z5Z6Z7Z8Z9

…..

A1 A2 A4 A5 A6 A7 A8 A9B1 B2 B3 B4 B6 B7 B8 B9

Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9

…..

Build Model - Word2Vec

Build Model - Word2Vec

Term Database - Search Log

Term CollectionSearch History

Filter &

Counting

Search Log

Keyword URL Date Click

好吃 http://xxx.xxx 20160520 33

好吃 http://zzz.zzz 20160520 22

日本旅遊 http://xxx.xxx 20160521 15

http://xxxx.xx.xxx http://xxxx.xx.xxx 20160522 12121

Term Database - Search Log by Count

Term Database - Search Log by Count/Len

Word Cutting● Word Cut Tool

○ Jieba : https://github.com/fxsjy/jieba○ https://github.com/yanyiwu/cppjieba-serve

● C++ Jieba Server ↑ x 30 以上

● pip install jieba

Slack Integration● Library

○ pip install slackbot○ pip install slacker

● Get Bot Token○ https://my.slack.com/services/new/bot

NAS

Technology Software Stack

Redshift BigQuery Article DB

Spark

WorkerWorker Worker

Jieba Server

Gensim Word2Vec

Flask

Jupyter

ScikitLearn

TensorFlow

Slack Bot

LIVE Demo

Q&A

2016 PIXNET HACKATHON

8/13