第一届中国大数据技术创新与创业大赛 关键词行业分类 thufit 队:...
TRANSCRIPT
第一届中国大数据技术创新与创业大赛关键词行业分类
ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤指导:方展鹏 ,唐杰
清华大学 未来互联网兴趣团队
Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship
Goal: Predict unlabeled keywords
Task
keyword_class.txt 10,787,584 keywords 1,143,928 labeled, 10.6% 9,963,062 unique keywords 33 classes
keyword_users.txt 23,942,643 entries Each entry is a keyword-buyer pair
keyword_titles.txt 21,575,166 entries, but only 10,787,583 entries are non-
empty. Each entry comprised of keyword and its first 10 search
result using Baidu
Data summary
11%
89%
keyword distribution
labeled unlabeled
Preprocessing: Keyword segmentation
Feature Extraction: Keyword segment Keyword-buyer relation Keyword-segment relation Search result utilization
Model: liblinear
Approach
Keyword Segement
A sub-string of a keyword Semantic unit
Segmentation Break a keyword to a set of segment
Two ways: Exact segmentation
清华大学 => 清华 / 大学
Full segmentaion 清华大学 => 清华 / 大学 / 华大 / 清华大学
结巴中文分词 :https://github.com/fxsjy/jieba
Keyword segmentation
Sparse representation of segments Smoothened TFIDF-based feature N-gram “End-gram”
Feature Extraction - segment
Just in this page: segment = term
Definition of will be given later
Feature Extraction - TFIDF
N-gram To capture some structure information Recall
There are two ways of segmenting a keyword , a set , an ordered list <- adopt this one
2-gram
Limitation Large character set produce large keyword set Noise
Reduced 2-gram
Feature Extraction - N-gram
End-gram is more likely to carry discriminative information Emphasis on the last segment: append a character that did not appear in , e.g “ 漢” Example
rnu209e.tvp2 轴承
“hj 系列双锥混合机市场调查报告” Similarly we can define
Feature Extraction - End-gram
Where is ?Experiments showed that, when adding , performance slightly degrades.
Feature Extraction
Keyword-buyer/segment relation
B0
B1
B2
B3
K0
K1
K2
K3
S0
S1
S2
S3
K0
K1
K2
K3
C0
C1
C2
C3
Keyword-buyer/segment relation
B0
B1
B2
B3
K0
K1
K2
K3
S0
S1
S2
S3
K0
K1
K2
K3
C0
C1
C2
C3
S0: C2 S1: C3 S2: S3: C2 C3K0: C2 K1: K2: K3: C3
B0: C2 C3 B1: B2: B3:
Keyword-buyer/segment relation
B0
B1
B2
B3
K0
K1
K2
K3
S0
S1
S2
S3
K0
K1
K2
K3
C0
C1
C2
C3
S0: C2 S1: C3 C3 S2: C0 S3: C2 C3 C0 C3K0: C2 C0 K1: K2: C3 K3: C3
B0: C2 C3 B1: C0 C3 B2: B3:
Assumption: A user tends to by similar class of keywords Obtain the distribution of classes of keywords a buyer buys on labeled data. Each buyer has a 33-dimensioned feature vector For each keyword , its feature vectors is an average over feature vector of a buyers that buys this keyword. Using only this feature we get an accuracy of 0.82
Keyword-buyer relation
Keyword-buyer relation
B0
B1
B2
B3
K0
K1
K2
K3
S0
S1
S2
S3
K0
K1
K2
K3
C0
C1
C2
C3
We have made effort trying modeling buyers by the segments of keywords they bought, and model keywords-keywords relationship by exploiting their common connection with segments. Buyer -> Keyword ->Segment =>Buyer -> Segment We further introduced higher order relation influence between buyers and keywords, but improvements are subtle.
Keyword-buyer relation
Reverse the link between segment and keywords Keyword ->Segment => Segment -> Keyword
Keyword-segment relation
Keyword-segment relation
B0
B1
B2
B3
K0
K1
K2
K3
S0
S1
S2
S3
K0
K1
K2
K3
C0
C1
C2
C3
Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/ 1-1828169-5: 1 1828169 5 1-1838143-0: 1 1838143 0
Their search results 1-1838143-0 1-1838143-0 全国供货商【 IC37 旗下站】 1-1838143-0 价格 |PDF ... IC 芯片 1-1838143-0 品牌、价
格、 PDF 参数 - 电子产品资料 - 买卖 IC 网 PIC16C57-XT/SP145的 IC 、二极管、三极管查询 , 采购 PIC16C57-XT/SP... 原装进口连接器 TYCO 1-1838143-0 2000pcs 1005+ 现货 泰科Tyco431829-1 集成电路、连接器、接插件 AMP 欧式背板连接器崧晔达 _ 达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴 供应聚氯乙烯 _连接器 _ 供应聚 崧晔达价格 _ 优质崧晔达批发 / 采购 - 阿里巴巴 供应聚氯乙烯 _ 连接器 _ 供应聚氯乙烯批发 _ 供应聚氯乙烯供应 _ 阿里巴巴 上海金庆电子技术有限公司 限位开关 12 福州福铭仪器
Search Result Utilization
For normal keywords, the keyword itself has semantic meaning. For those keywords with less semantic information, they are usually a product serial number or some domain specific terminology , e.g chemical element names. These supplementary information yields more accuracy results on “weird” keywords. But these keywords did not seem to be included in online test.
Search Result Utilization
Recall: If we add one more term:
where is the search result of Performance decreased by noise introduced Example
“hj 系列双锥混合机市场调查报告” “ 混合设备 HJ 系列双锥混合机 - 常州市华欧干燥制粒设备有限公司 - ... 混
合机 - 供应 HJ 系列双锥混合机 - 混合机尽在阿里巴巴 - 常州欧朋干燥 ... HJ 系列双锥混合机厂家 _ 价格 - 食品机械行业网 HJ 系列双锥混合机供应信息 , 常州市步群干燥设备有限公司 HJ 系列双锥混合机 _ 百度百科 HJ 系列双锥混合机 - 常州普耐尔干燥设备有限公司 HJ 系列双锥混合机价格 ( 江苏 常州 )- 盖德化工网 ...”
Search Result Utilization
Dimensionality: 200,000Lower dimensionality introduce better generalization ability.Feature Statistics
Life is short, you need PythonImplementation
Liblinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
A Library for Large Linear Classification L2-loss logistic regression 33 one-vs-all classifiers for each class.
Model
We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.
Experiments and Results
Experiments and ResultsFeature vector constituents Accuracy
Keyword-buyer relation 0.8194
Keyword-segment relation 0.9019
Keyword-buyer + ( + TFIDF) 0.9537
+ TFIDF 0.9656
+ TFIDF 0.9635
+ TFIDF 0.9725
+ TFIDF 0.9713
We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission.
Analysis
Two types of feature Relation feature:
Utilized prior knowledge of class label information Low dimension May biased to training data
TFIDF feature: No class label information utilized High dimension Robust, good generalization ability
But a simple combination of two does not work well Ensemble methods may workaround this problem.
Limitations
Thanks!