한국어와 nltk, gensim의 만남

69
Ǵ)ľ œ NLTK, Gensim Ū ¾L ( ôƃ : ōȅ µø³ ƿǫnj ūǺǵ ę Ųd ȃģť« ǨȁǺč ĒǔÉǒ öĎDz7 ) PyCon Korea 2015 2015-08-29 Lucy Park ( ÞŦƂ ) 1 / 69

Upload: eunjeong-lucy-park

Post on 14-Jan-2017

8.121 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: 한국어와 NLTK, Gensim의 만남

Ǵ)ľœ NLTK, GensimŪ ¾L(ôƃ: ōȅ µø³ ƿǫnj� ūǺǵ ę Ųd ȃģť« ǨȁǺč ĒǔÉǒ öĎDz7)

PyCon Korea 2015

2015-08-29Lucy Park (ÞŦƂ)

1 / 69

Page 2: 한국어와 NLTK, Gensim의 만남

�Ř KoNLPy³ ĮÍ ūƃ ǙūĬť«y ȃNJĕ öĎŧ Dz�J Š�DŽ�ś�³ 0¸ ę Ųmvw, ū� À� � ǵ ę Ųd� ŃJŘ? Ųğhm!

ǙūĬŪ ţÎǴ ŴŊľ ƪµ ǛdžƖ NLTK³ ȈŚDzÍ Öč ĵŪ QŚŧ ÿ±� ljĊDz�J Řļǵ ę Ų�, ǐǯ Ðx»ŧ ƖšDzd ǛdžƖ Gensimŧ ĄŚDzÍ ň  ÖŹūJ ÖčŇ Qź�ľ Ųd .ƹ, �d ǐǯ�ŧ ƧijR ę Ųm.

ū àǨŇčd Çż bag-of-words, document embedding � ƿǫnj� Öč³ Ŷ öĎǵ ę Ų� Dzd mĽǴ ǎĝǒŪ Ǩȁ áçŇsǺ Ćǡï Ȍ, KoNLPy, NLTK, Gensim �Ū �ūù µ³ ĥƃǴ)ľ Öč�Ň ŽŚǺïm.

KoNLPy: http://konlpy.orgNLTK: http://nltk.orgGensim: http://radimrehurek.com/gensim

2 / 69

Page 3: 한국어와 NLTK, Gensim의 만남

ÞŦƂ(a.k.a. lucypark,echojuliett, e9t)

�àDzd wūnj öĎ�.čŝsdz' wūnj¼ūl Ēnj ÞĄ"ƂsǴÚ) ƂƸŪ Ð� � ¾�d Ǘǥǧ ËäCAVEAT: KoNLPy ÈŭǍūUJust another yak shaver...

11:49 <@sanxiyn> 또다시 yak shaving의 신비한 세계11:51 <@sanxiyn> yak shaving이 뭔지 다 아시죠?11:51 <디토군> 방금 찾아보고 왔음11:51 <@mana> (조용히 설명을 기대중)11:51 <@sanxiyn> 나무를 베려고 하는데11:52 <@sanxiyn> 도끼질을 하다가11:52 <@sanxiyn> 도끼가 더 잘 들면 나무를 쉽게 벨텐데 해서11:52 <@sanxiyn> 도끼 날을 세우다가11:52 <@sanxiyn> 도끼 가는 돌이 더 좋으면 도끼 날을 더 빨리 세울텐데 해서11:52 <@sanxiyn> 좋은 숫돌이 있는 곳을 수소문해 보니11:52 <@mana> …11:52 <&홍민희> 그거 전형적인 제 행동이네요11:52 <@sanxiyn> 저 멀이 어디에 세계 최고의 숫돌이 난다고11:52 <@sanxiyn> 거기까지 야크를 타고 가려다가11:52 <@mana> 항상하던 짓이라서 타이핑을 할 수 없었습니다11:52 <@sanxiyn> 야크 털을 깎아서…11:52 <@sanxiyn> etc.

3 / 69

Page 4: 한국어와 NLTK, Gensim의 만남

àǨŇ$Dzň

ūū àǨ�àǨ� m®dm®d QŚQŚ

nľ/Öč³ ƿǫnj� ūǺǵ ę Ų� Dzd Ǩȁ áçKoNLPy, NLTK, GensimŪ nǴ(?) ĄŚçū àǨ³ �� JÍ Ý« ĒǔÉǒ öĎŧ ǵ ę Ųğhm!

ūū àǨ�àǨ� m®Ɩm®Ɩ ĶdĶd QŚQŚ

ǐǯ Ðx»GensimŇ ǥǶ�ľ ŲƖ¾, ŏeŦ Ů$� ȏ´ŧ ŢǺm®Ɩ Ķğhm.

ijĜƖ¾, ķ�µƔŇ sǴ đđǴ QŚy âƃsĤ ƍĎť« o· Ʀ�Öǿ�ŧ òƍđŘ.

ǛdžƖ�Ū đôĄǹy m®Ɩ Ķğhm.

àǨ³àǨ³ �ťÍč�ťÍč DzÍDzÍ ƊŦƊŦ ČČ

ľ�Ň īÆŧ ę Ųŧ;?

7LJĄǹ7LJĄǹ

ǁ�d ǙūĬ 3 7Ə

4 / 69

Page 5: 한국어와 NLTK, Gensim의 만남

1997[Ɩ5ť«ônj ļ 20[ ž...

5 / 69

Page 6: 한국어와 NLTK, Gensim의 만남

IBM �ú®�ú®� ƭĝ đ� ƩǮŐ �µ�µ ƻĝǙ«ǬƻĝǙ«Ǭ³ ū7m.

6 / 69

Page 7: 한국어와 NLTK, Gensim의 만남

ƭĝd Ï �Ɩ Ƃȃȅ� .ƹť«y ƵöȒ ƂŪǵ ę Ų7 �ÖŇ 0�ŧ ƿǫnjŇ� �±Ƹd �Ŧ ƃç Ĝm. DzƖ¾, Ą�ū ×ƭ³ ŭģDz� ÀDzd �ŧ Ƃȃȅ�.ƹť« ƂŪDz7d ĜƖ Ķ7 �ÖŇ ƿǫnjŇ� 0¡ g¨ŧ ƍd �y ĜƖ Ķm.

ƿǫnj� t ��ǺƖ7 ŢǺčdƂȃȅ�Ƃȃȅ� .ƹū.ƹū ŃdŃd ƖģŧƖģŧ ƿǫnjŇ�ƿǫnjŇ� ľ��ľ�� žožoǵǵ �ŭƖ�ŭƖ ķijĻ Ǵm.-- Bengio et al., Deep Learning 1Ź ïÖ Ūʼn, Book in preparation for MIT Press, 2015.

7 / 69

Page 8: 한국어와 NLTK, Gensim의 만남

ǎĝǒ³ "Ǩȁ"Dzd mĽǴ áçRepresentations of text

8 / 69

Page 9: 한국어와 NLTK, Gensim의 만남

ľ�� DzÍ (ƅ� Ńd wūnjŭ ǎĝǒŪǎĝǒŪ ŪÙŪÙ³ƿǫnj� Ŷ ūǺǵ ę Ųŧ;?

ƋƷŦ Ą�ū ůd ǎĝǒ, śƷŦ ƿǫnj� ůd ÝūUµ.DzƖ¾ ÝūUµŇd "ŪÙ Ƃí"� p�ŲƖ Ķm.

9 / 69

Page 10: 한국어와 NLTK, Gensim의 만남

ǎĝǒŪ ŪÙ³ ūǺǴm ţĄǴ ŪÙ³ �Ƙ ǎĝǒIµ Õŧ ę Ųm

{nľ, Öč} ǎĝǒ

≈⊂

10 / 69

Page 11: 한국어와 NLTK, Gensim의 만남

Çż, nľnľ³ ǨȁDzd �Ź ěŜ áçŦ,ūƘ(binary) Ǩȁç

Ŧ å nľŇ sǴ énj Ǩȁ énjŪ ǂ7d Ð� nľŪ setŭ žƭ ĄžŪ ǂ7 œ �mc¢Y Ơ dzŴ�Ŧ ū¥� ǨȁǴ énj³ íǑ one-hot vector�� Ǵmľ� Ą��Ŧ ū¡ áģŪ ŭǁ�ŧ 1-of-V ǁ�ū��y Ǵm

= , = , = , . . . =w�

⎢⎢⎢⎢⎢⎢

100⋮0

⎥⎥⎥⎥⎥⎥w��(��)

⎢⎢⎢⎢⎢⎢

010⋮0

⎥⎥⎥⎥⎥⎥w����

⎢⎢⎢⎢⎢⎢

001⋮0

⎥⎥⎥⎥⎥⎥w�

⎢⎢⎢⎢⎢⎢

000⋮1

⎥⎥⎥⎥⎥⎥

wi i|V |

11 / 69

Page 12: 한국어와 NLTK, Gensim의 만남

DzƖ¾ ūƘ Ǩȁçŧ ĄŚDzÍ nľnľ ţĄyţĄy³ ƂŪǵ ę Ńm.

ex: "ȄǏ"" "ÐǏ", "ȄǏ"" "�Ľū" ĄūŪ �µ� žô 0ť« �m.

( = ( = 0w �)⊤w�� w �)⊤w��

* BOWœ �ū nľ³ ūąŽŭ(discrete) nŢ« íd �ś³ ƖʼnƖʼn ǨȁǨȁ(local representation)ū�� Ǵm. �Ňč sû�d �\ŭ ööąą ǨȁǨȁ(distributed representation)y �Źǵ ŎƂ 12 / 69

Page 13: 한국어와 NLTK, Gensim의 만남

�Ŧ Ƭdzť« ÖčÖč³ ǨȁǴ �ū,bag of words(BOW)

nľ� ÖčŇ ƇźǴm/ĵǴm (ū àǨŇčd "term existance"�� ô´)

nľ� ÖčŇ nå ƇźǴm (a.k.a. term frequency (TF))

nľŇ ƑŘy³ ǵrDz� ÖčŇ �ŹǴ nľ ƑŘyŪ �ƑǷŧ (Ǵmex: term frequency–inverse document frequency (TF-IDF)

=d1binary

⎢⎢⎢⎢⎢⎢

110⋮0

⎥⎥⎥⎥⎥⎥

=d1tf

⎢⎢⎢⎢⎢⎢

310⋮0

⎥⎥⎥⎥⎥⎥

13 / 69

Page 14: 한국어와 NLTK, Gensim의 만남

0¡w !Ū Ƣšū UÔ ƾč ÖčÖč ţĄyţĄy ƖǨŪ ȋ"� �ľƘm.*

!Ū Ƣšū ƾƙę¬ wūnjŪ sôöŦ !Ū AƖŇ öǥDz� �m.Öč �µŪ ǢƢ� Ãś ŽľƖ� �.Öč ţĄy� "�ū�ū".

* ŴđǴ QŚŦ ū ú«0 ǥĝǒ Ʀ�: Vincent Spruyt, The Curse of Dimensionality in classification, 2014. 14 / 69

Page 15: 한국어와 NLTK, Gensim의 만남

"0£ Š�Y �Ŧ ǎĕ]Ù³ ȈŚDzd� ľ�Ř?"Š�Y(WordNet)Ŧ X is-a Yœ �Ŧ ĈŢľ(hypernym) $� �ŧ ƂŪDzd áǽĐ ûĚȇ 0�Ǭ(DAG; directed acyclic graph)űhm. (Ŏ: Ą"d "Ůūm.)

from nltk.corpus import wordnet as wn

wn.synsets('computer') # "computer"Ū ţĄľ³ ßȇ#=> [Synset('computer.n.01'), Synset('calculator.n.01')]

Ʀ�: Ǵ)ľ ľȎŪÙÁ(Š�Y)Ŧ ôąs �Ň ŪǺ �à/!��

15 / 69

Page 16: 한국어와 NLTK, Gensim의 만남

DzƖ¾ Š�YŦ mŨŪ nƀū Ųğhm:1. Ð� Śľ³ ǥǶDz§�d DzƖ¾, žÖ yÈŭ Śľ�Ŧ ¿ū ÿƄŲm.

from nltk.corpus import wordnet as wnwn.synsets('nginx')#=> []

2. Ɖ ĤƅľŇ sǴ Ƃí³ ȆíDz7 ŢǺčd DƏȒ ţƖíę³ ǺƒĻ Ǵm.

wn.synsets('yolo') # �: you only live once#=> []

3. �m� NLTK«d Ǵ)ľ Ɩšū ĵ�... (ƑŘ!)

wn.synsets('ƿǫnj', lang='kor')#=> WordNetError: Language is not supported.

16 / 69

Page 17: 한국어와 NLTK, Gensim의 만남

ij..ĵ��m. m² áç ŃJ?

17 / 69

Page 18: 한국어와 NLTK, Gensim의 만남

ƺ(³ íÍ 0 Ą�ŧ ĵm.ȔȔȓȕ

18 / 69

Page 19: 한국어와 NLTK, Gensim의 만남

nľŪ ƍêŧ íÍ 0 nľ³ ĵm.You shall know a word by the company it keeps.

-- ŀľdzŴ J.R. Firth (1957)

19 / 69

Page 20: 한국어와 NLTK, Gensim의 만남

ŎĢ 1: ü Ƽ ĵŇ Âd À?ĉ«Ŝ �ŴŭŪ ďū __ Dzm.Ļ, ôņ ƈ __ Ȓ DzŴ.>ÞDz� ƨĈ Ţ³ __ Dz� ƸśƖ Ķĸm.

4ťÍ Ƃq: [@G]

20 / 69

Page 21: 한국어와 NLTK, Gensim의 만남

ŎĢ 2: "ÃJh"�d nľ³ ijĢJŘ?��?

21 / 69

Page 22: 한국어와 NLTK, Gensim의 만남

0 ZĎ ÔĞ âƟŭƖ ÃJhÃJh« œč Ůŧ Dz�m� Ǵm.ćū�y ŲľĻ �ŧ ǙƖ ÃJhÃJh«Ļ ľ�� Dz�J?SO¾ ŲťÍ ǁ�Ŧ ÃJhÃJh«y ǵ ę Ųm� Ǵm.

22 / 69

Page 23: 한국어와 NLTK, Gensim의 만남

ĥƇDzd nľ;;

23 / 69

Page 24: 한국어와 NLTK, Gensim의 만남

mĢ ÀǺ, nľŪ ŪÙd Ǻr nľŪ ÖÄÖÄ(context)ū p� Ųm.

ÖÄ(context) := ƂǺƘ ((window) �d ÖŹ/Öč QŪ nľ�íǑ ōľŪ (Ŧ Ƌś 1-8� nľ, ư 3-17� nľŪ ÖÄŧ ïm*

� nľŪ ÖÄū ûĠDzÍ "ŪÙŽ"ť« ţĄǴ nľex: "PyCon"Ŧ "R"ím "Python"Ň �?m?

* Jurafsky, Martin, Speech and Language Processing 3rd Ed. 19.1Ź, Book in preparation, 2015. 24 / 69

Page 25: 한국어와 NLTK, Gensim의 만남

"Co-occurrence"� nľ� ƂǺƘ ( QŇč }ĢŇ �ŹǶ

25 / 69

Page 26: 한국어와 NLTK, Gensim의 만남

Co-occurrence³ ƂŪDzd � �Ɩ áçdocuments = [["Jd", "ǙūĬ", "ū", "Ɗm"], ["Jd", "R", "ū", "Ɗm"]]

1. Term-document matrix ( ): Ǵ ÖčŇ �ū �ŹDzÍ ûĠǴ nľComputationū ÖčŪ �ę Ň ûª(!)

x_td = [[1, 1], # Jd [1, 0], # ǙūĬ [0, 1], # R [1, 1], # ū [1, 1]] # Ɗm

1. Term-term matrix ( ): nľ� ÖÄ QŇ �ū ƇźDzÍ ûĠǴ nľÖÄŪ 9ū� ƞŧę¬ syntactic, 9ę¬ semantic Ƃí³ ǥǶĹ�« nľ³ � �ı íd �ś (ÖÄŪ 9ū==5):

x_tt = [[0, 1, 1, 2, 0], # Jd [1, 0, 0, 1, 1], # ǙūĬ [1, 0, 0, 1, 1], # R [2, 1, 1, 0, 2], # ū [0, 1, 1, 2, 0]] # Ɗm

∈Xtd ℝ|V |×M

M

∈Xtt ℝ|V |×|V |

26 / 69

Page 27: 한국어와 NLTK, Gensim의 만남

Co-occurrence matrix³ Ųd 0s« ūŚǺy nľ ţĄy³ (ǵ ęd Ųm.

import math

def dot_product(v1, v2): return sum(map(lambda x: x[0] * x[1], zip(v1, v2)))

def cosine_measure(v1, v2): prod = dot_product(v1, v2) len1 = math.sqrt(dot_product(v1, v1)) len2 = math.sqrt(dot_product(v2, v2)) return prod / (len1 * len2)

DzƖ¾: �ū UÔ skewed �ľ Ų� (Ɠ, üy `Ŧ nľœ PŦ nľŪ �Ƣ� Dž)ƂíĐ PŦ nľ �ÖŇ discriminativeDzƖ ĶŨ (ex: list('Ŧdū�'))

ţĄy Ƃí� p8 nľ énj³ (Dzd t JŦ áçŦ?

27 / 69

Page 28: 한국어와 NLTK, Gensim의 만남

Neural embeddingsijū�ľ: ÖÄŇ Ųd nľ³ ŎƷDz�!

ŀľ Ðx(language model) ȈŚex: ["Jd", "ǙūĬ", "ū", "Ɗm"] mŨŇ Ø� Jő;? (ex: "!", "." "?")

0µ�, dzğǵ � c¢Yŧ ĮŴ.

Neural network language model (NNLM)*

* Bengio, Ducharme, Vincent, A Neural Probabilistic Language Model, {NIPS, 2000; JMLR, 2003}. 28 / 69

Page 29: 한국어와 NLTK, Gensim의 만남

word2vec *ŘƔ ǸǴ T. MikolovŪ word2vecy neural embeddingƉ �ą ǒ¶ŧ ŽŚǺč ƿǫǍūĔū Ā�ƚ! (nŮ -> mĢ)

* Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and theirCompositionality. In Proceedings of NIPS, 2013. 29 / 69

Page 30: 한국어와 NLTK, Gensim의 만남

ŎĢ: õǴ Ĥ[ĄŇ word2vecŧ ŽŚ *

* ÞƉȑ, ÞŦƂ, ƅ}Ə, õǴ Ĥ[Ą(1946-2015)Ň sǴ Ŵ}ȅ� ǎĝǒ öĎ, Ǵ)ƂƸdzȉí ƃ49ƛ ƃ2Ȅ, 2015. 30 / 69

Page 31: 한국어와 NLTK, Gensim의 만남

0£ ÖčÖčŇ sǴ neural embeddingŦ?

31 / 69

Page 32: 한국어와 NLTK, Gensim의 만남

doc2vec *ijū�ľ: Öč(�d Ön) énj³ ¼Ƹ nľŭ Ľ dzğĢdžŴ!Źƀ

Ƣšū Ň ûǺ ȍIJ ŽľƘm.nľ énjœ �Ŧ !Ň Öč³ žĄǵ ę Ųm.

|V |

* Le, Mikolov, Distributed Representations of Sentences and Documents, ICML, 2014. (šÖŇčd doc2vec sĤ paragraph vector�d ū´ŧ ĄŚǶ) 32 / 69

Page 33: 한국어와 NLTK, Gensim의 만남

ŎĢ: Term frequency -> Öč énjŪ ǂ7� nľŪ ę œ �Ũ. šĕ�Ŧ ĽŪ Ƃę.

documents = [ ['ŕŴ', '�', '!ƍ', '³', 'ƊijǴm'], ['!ƍ', '�', 'ŕŴ', '³', 'ƊijǴm'], ['ĢZ', '�', 'ŕŴ', '³', 'ĦľǴm'], ['!ƍ', '�', 'ĢZ', '³', 'ĦľǴm'], ['ĢZ', '�', 'ŕŴ', '³', 'zĆǴm'], ['ĢZ', '�', '!ƍ', '³', 'zĆǴm']]words = set(sum(documents, []))print(words)# => {'�', '!ƍ', 'zĆǴm', '³', 'ĢZ', 'ĦľǴm', 'ŕŴ', 'ƊijǴm'}

def term_frequency(document): # do something return vector

vectors = [term_frequency(doc) for doc in documents]print(vectors)# => {# 'S1': [1, 1, 0, 1, 0, 0, 1, 1],# 'S2': [1, 1, 0, 1, 0, 0, 1, 1],# 'S3': [1, 0, 0, 1, 1, 1, 1, 0],# 'S4': [1, 1, 0, 1, 1, 1, 0, 0],# 'S5': [1, 0, 1, 1, 1, 0, 1, 0],# 'S6': [1, 1, 1, 1, 1, 0, 0, 0]# }

|V |

33 / 69

Page 34: 한국어와 NLTK, Gensim의 만남

ŎĢ: doc2vec -> Öč énjŪ ǂ7� nľŪ ę ím ŵŨ. šĕ�Ŧ ĥę.

documents = [ ['ŕŴ', '�', '!ƍ', '³', 'ƊijǴm'], ['!ƍ', '�', 'ŕŴ', '³', 'ƊijǴm'], ['ĢZ', '�', 'ŕŴ', '³', 'ĦľǴm'], ['!ƍ', '�', 'ĢZ', '³', 'ĦľǴm'], ['ĢZ', '�', 'ŕŴ', '³', 'zĆǴm'], ['ĢZ', '�', '!ƍ', '³', 'zĆǴm']]words = set(sum(documents, []))print(words)# => {'�', '!ƍ', 'zĆǴm', '³', 'ĢZ', 'ĦľǴm', 'ŕŴ', 'ƊijǴm'}

def doc2vec(document): # do something else (�Ňč Gensimū ǺƐ �!) return vector

vectors = [doc2vec(doc) for doc in documents]print(vectors)# => {# 'S1': [-0.01599941, 0.02135301, 0.0927715],# 'S2': [0.02333261, 0.00211833, 0.00254255],# 'S3': [-0.00117272, -0.02043504, 0.00593186],# 'S4': [0.0089237, -0.00397806, -0.13199195],# 'S5': [0.02555059, 0.01561624, 0.03603476],# 'S6': [-0.02114407, -0.01552016, 0.01289922]# }

|V |

34 / 69

Page 35: 한국어와 NLTK, Gensim의 만남

ŘļnľŪ ÖÄŧ ūŚǺč nľ³ ǨȁǴ ę Ųm.ǎĝǒŪ ŪÙ³ énj« ǨȁDzd Ï �Ɩ áç:

Sparse, long vectors Dense, short vectorsnľ 1-hot-vectors word2vec

Öč bag-of-words(term existance, TF, TF-IDF �) doc2vec

ūƃ ū­ŧ ŬȂťh ĥžť« �íŴ.

35 / 69

Page 36: 한국어와 NLTK, Gensim의 만남

Ǵ)ľ ōȅ µø ĒǔÉǒ öĎSentiment analysis for Korean movie reviews

36 / 69

Page 37: 한국어와 NLTK, Gensim의 만남

ŷ<, wūnjd ľ�č (DzƖ?

37 / 69

Page 38: 한국어와 NLTK, Gensim의 만남

Ǵ)ľ ōȅ µø ǐū wūnj !�ǐū wūnj� ¿ijĻ dzÖū àžǴm!

38 / 69

Page 39: 한국어와 NLTK, Gensim의 만남

Naver sentiment movie corpus v1.0Maas et al., 2011 wūnjē Ʀ�wūnj ƴƪ: Xūäōȅ r 100�Ū 140Ŵǣ(ūDz 'µø')ŧ Ư"DzƖ ĶŨư 20¾ � µø (ęƛ� 64¾� Ƒ ċǭ»)

ratings_train.txt: 15¾, ratings_test.txt: 5¾6Ƃ/ôƂ µøŪ ûŤŧ }ŮDz� ċǭ» (i.e., random guess yields 50%accuracy)Ƒº µød ǥǶDzƖ ĶŨ

ǂ7: 19MBURL: http://github.com/e9t/nsmc/

39 / 69

Page 40: 한국어와 NLTK, Gensim의 만남

READMEd coming soon

40 / 69

Page 41: 한국어와 NLTK, Gensim의 만남

$ cat ratings_train.txt | head -n 25id document label9976970 ij tþ.. ƘƝ ƝƕJXŘ Ñĕµ 03819312 Ȑ...ǥĝnjí� Ư�ōȅƐ....ŏäŊ7ƅƢ �ìƖ Ķ(J 110265843 UÔźȖŅm0�číd�ŧƲƫǴm 09045019 'yĕ ūĻ7(Ç ..ĘƗȒ źÙd Ńm..ǣƀ ƅƂ 06483659 ĄūÒǟ0Ū ŬĆĝ¡ Ŋ7� {íŌv ōȅ!ĝǙūtÅŇč fľíū7¾ ǻv ƾĝǕ vĝǒ� UÔJy ūāíŌm 5403919 ½ �Ũ¼ � 3đônj Ư�dz' 1dz[Čŭ 8ĆŚōȅ.���...ëß�y ij;Ş. 07797314 šŵŪ 8Ź�ŧ ƃs« ƧQƖÓǻm. 09443947 ë ß�y ij?m řJŐm ūũ� 9Śś Ŋ7ČȈūÏ[ŭƖ..ƂÀ à«Ǻy 0�ín N�m MƸ.�5¾ßîßî..ū��¼d �ƆyŃm Ŋ7ÓDzdĄ�¾ÐŋX 7156791 ĺĔū Ńdwy źÙ Ųd Ïĵ�d ōȅ 15912145 Ŗǀ ǣƀū PŦ�w? C ð¾Ǵw.. Ȁµś�ģ ȅ§ǶŇ¾ UÔ 9�ňƄ ŲJ? 9008700 �ŭǮhǒ�Ɵūm.ƘƝƟūm♥ 110217543 ð�¼m a×Jč Ǝ�m90[sŪ ǽęŴ1!!ǾƘȄd �ĐſƃÊ«Ū oŭūm~ 5957425 ŝÍč ė�� Ȋníy �V� �ƮJőĂ ūæę Ŋ7 �¤�ÓǺ 08628627 pãDz� =FǺč Ɗm. ĤÖ7Ą«¾ ím íÍ ŴD ųľä·m. 0�y Ą�ūŅmd �ŧ. 9864035 ƶǽŦ ƇƑǴmƖ¾ ƘƝ QČŇ 1ŹŇč ï ōȅƑ �Ź ]Ż ]�}Ű ĝǐµy ľ�Ɩ� �}y ľ�Ɩ 6852435 �T Ãå 8Ź�� źÛŨ�� 19143163 Ʀ Ą�� ş8� Ýĝǁ� ū7Í �ĝǁ�� ;�Ýû� ū7Í ijū|ū�� <m.0T ;�ĩľč ĵoK�ƪ£ íŭm 4891476 ,Ýū ¦i Ǩſŭ�Ŧ ūǺDzdw Ŗ �« �ę¬ źÙŃľƖS 07465483 ū� ƂÀ @ķ ƽĝǘ" ƙǞDzƖĶŦ ą�Ǵ QŚ(Đū Ŷ äÔ Ƙ @ķŮ�!!♥ 3989148 ļLjŴ³ ŢǴ êÎ, ū�. ż_�Ŧ ƣǴ_� ſs iji�Ř. 14581211 J´ ħŏǴ �y Ųd �. 0T dzČū ďČ" ^ijJd ōȅd ſs ijk 12718894 íÍč şƖ Ķd � ÷�gDzm 19705777 źÙŃm Ɩ®Dz�. �Ŧ Ũģ ōȅŭwy ÝèǒŪ ¾ƤDz� W ƢūL....ÝèǒŪ ¾ƤŦ ūĻ7y Ų� Ũģ ídźÙy Ųdw ; ū� ð�Ńm Ũģy ë« ĵJŏ�, ǰ�� Ǫ�ū�y (�ǵ�dw 0�y ë« ĵJŒ �� 471131 ſs ǣæǴ ōȅ� iji ęŵū�d� Àİ�ºhm. 1

41 / 69

Page 42: 한국어와 NLTK, Gensim의 만남

Ŵ, 0£ ï�Žť« ĢŵǺð;Ř?1. KoNLPy« wūnj žƪµ2. NLTK« wūnj ljĊ3. Bag-of-words(term existence)« Öč ǨȁDz� classify4. Gensimť« doc2vecť« Öč ǨȁDz� classify

42 / 69

Page 43: 한국어와 NLTK, Gensim의 만남

1. Data preprocessing (feat. KoNLPy)wūnj ů7

def read_data(filename): with open(filename, 'r') as f: data = [line.split('\t') for line in f.read().splitlines()] data = data[1:] # header ƃŗ return data

train_data = read_data('ratings_train.txt')test_data = read_data('ratings_test.txt')

NOTE: pandas.read_csv()³ ĄŚǺč wūnj³ ůŧ �śŇd quoting parameter³csv.QUOTE_NONE (3)« ÎĢǺƍƖ ĶťÍ ǙĨū ƃs« ĵ � (ex: µø ĹôöŇĪ�ŒǨ� Ųd �ś)

43 / 69

Page 44: 한국어와 NLTK, Gensim의 만남

1. Data preprocessing (feat. KoNLPy)wūnj� ƃs« ůȂdƖ Ȇŭ

# row, columnŪ ę� ƃs« ůȂdƖ Ȇŭprint(len(train_data)) # nrows: 150000print(len(train_data[0])) # ncols: 3

print(len(test_data)) # nrows: 50000print(len(test_data[0])) # ncols: 3

44 / 69

Page 45: 한국어와 NLTK, Gensim의 만남

1. Data preprocessing (feat. KoNLPy)ȃNJĕ« ǐǂJūƜ

from konlpy.tag import Twitterpos_tagger = Twitter()

def tokenize(doc): # norm, stemŦ optional return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]

train_docs = [(tokenize(row[1]), row[2]) for row in train_data]test_docs = [(tokenize(row[1]), row[2]) for row in test_data]

# Ŷ �ľ�dƖ Ȇŭfrom pprint import pprintpprint(train_docs[0])# => [(['ij/Exclamation',# 'tþ/Noun',# '../Punctuation',# 'ƘƝ/Noun',# 'Ɲƕ/Noun',# 'Jm/Verb',# 'Ñĕµ/Noun'],# '0')]

NOTE: ĝǂºǒ³ ĥǼǵ �¼m ȃNJĕ öĎŧ DzƖ Ķijy �y¬ öĎ �"³ǙŮ« żŹDz� �d �y Ɗğhm. 45 / 69

Page 46: 한국어와 NLTK, Gensim의 만남

1. Data preprocessing (feat. KoNLPy)"0¡w ȃNJĕ« A JbĻDzJŘ?"

wūnj� ƂÀ ƵöDzmÍ, ľſ nŢ«y öĎ �gDz�Ɩ¾ śµd Ǻr xǴǢ Ʊ2 Ȓǒ³ ƺ Ģŭc¢Ŧ Ũſ nŢŪ öĎ *ǐǃŪ nŢd ďNjŪ Öƃ

"A ǩĄ(POS) NJ0³ ôƣǺĻDzJŘ?"0�y ďNjŪ ÖƃDzƖ¾ ǩĄ³ NJ:Ǻ�Í }ŨūŪľ³ (öǵ ę Ųmd Źƀū ŲŨ

ex: 'Ŧ/Josa' vs 'Ŧ/Noun'

* https://github.com/karpathy/char-rnn 46 / 69

Page 47: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)śµŪ ǁǝĝ� ľ� ǓƜ�ŧ �Ɩ� ŲdƖ ĆǡñĢm!Training dataŪ token Ðť7

tokens = [t for d in train_docs for t in d[0]]print(len(tokens))# => 2194536

47 / 69

Page 48: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)nltk.Text()³ īíŴ! (Explorationū ǢǶ)

š� nŮ document³ öĎǵ �d ȍIJ ǪôǴ 7g�ŧ ĄŚǵ ę ŲƖ¾,ň7čd 0 Ƒ Ï �Ɩ¾ ĆǡñĢm.

import nltktext = nltk.Text(tokens, name='NMSC')print(text)# => <Text: NMSC>

48 / 69

Page 49: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)Count tokens

print(len(text.tokens)) # returns number of tokens# => 2194536print(len(set(text.tokens))) # returns number of unique tokens# => 48765pprint(text.vocab().most_common(10)) # returns frequency distribution# => [('./Punctuation', 68630),# ('ōȅ/Noun', 51365),# ('Dzm/Verb', 50281),# ('ū/Josa', 39123),# ('ím/Verb', 34764),# ('Ū/Josa', 30480),# ('../Punctuation', 29055),# ('Ň/Josa', 27108),# ('�/Josa', 26696),# ('ŧ/Josa', 23481)]

49 / 69

Page 50: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)Plot tokens by frequency

text.plot(50) # Plot sorted frequency of top 50 tokens

ľ� 3Ŵ @ƚ;;

50 / 69

Page 51: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)ū� Ǧǒ Öƃ!

Troubleshooting: For those who see rectangles instead of letters in the saved plot file,include the following configurations before drawing the plot:

Some example fonts:Mac OS: /Library/Fonts/AppleGothic.ttf

from matplotlib import font_manager, rcfont_fname = 'c:/windows/fonts/gulim.ttc' # A font of your choicefont_name = font_manager.FontProperties(fname=font_fname).get_name()rc('font', family=font_name)

51 / 69

Page 52: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)

Ĺ ŹǨ ǁ� ĥǼ Ȍ. ,.Ůn 0¹ŧ 0µ� íh ōȅŇ sǴ ǁǝĝ�d� ȆĥȒ ķ�m.

52 / 69

Page 53: 한국어와 NLTK, Gensim의 만남

2. Data exploration (feat. NLTK)Collocations (Ŋľ): ŭƁDz� üåDz� �ŹDzd nľ (ex: "text" + "mining")

text.collocations() # Ģū ƈ ŏ� �¸ęy# => ū/Determiner �/Noun; Ž/Suffix ŭ/Josa; ū/Determiner �/Noun; ĵ/Noun# �m/Verb; �/Noun Ŧ/Josa; 10/Number ƀ/Noun; âś/Noun �/Suffix; ę/Noun# Ųm/Adjective; ū/Noun �/Josa; Q/Noun �/Josa; Ʊ�/Noun Ū/Josa; X/Suffix# Ř/Josa; ū/Noun ōȅ/Noun; H/Noun ;Ɩ/Josa; �/Suffix ū/Josa; ò/Noun# y/Josa; �Ö/Noun Ň/Josa; Ž/Suffix ť«/Josa; Ą�/Noun �/Suffix; ōȅ/Noun# ³/Josa

NOTE: nltk.Text()Ň sǴ t ŴđǴ Ƃíd, source codeJ APIŇč ð ę Ųm. ōÖ" Ǵ3Ū }Ģ ŽŚ Ŏd ū »ǂ Ȇŭ.

53 / 69

Page 54: 한국어와 NLTK, Gensim의 만남

3. Sentiment classification with term-existance*ň7čd nDz� termū ÖčŇ ƇźDzdƖŪ ţÔŇ �� ö¯³ ǺíŴ.

# ň7čd Ʊüy nľ 2000�³ ǮƮ« ĄŚ# WARNING: ěŜ ūǺ³ ŢǴ ǁ�ūÌ time/memory efficientDzƖ Ķğhm

selected_words = [f[0] for f in text.vocab().most_common(2000)]

def term_exists(doc): return {'exists({})'.format(word): (word in set(doc)) for word in selected_words}

# Ģ nƳŧ ŢǴ Bę« training corpusŪ Ůô¾ ĄŚǵ ę ŲŨtrain_docs = train_docs[:10000]

train_xy = [(term_exists(d), c) for d, c in train_docs]test_xy = [(term_exists(d), c) for d, c in test_docs]

Try this:

Ʊüy nľ�Ŧ ŪÙ� Ńŧ ę Ųm. ƱĈŢ 50�³ ƃ�Dz� öĎDzÍ?TF, TF-IDFŪ ĐgŦ?ǁ� ƱŽȅ(sparse matrix) or scikit-learnŪ TfidfVectorizer()³ ūŚ!

* t ŴđǴ QŚŦ Natural Language Processing with PythonŪ Chapter 6. Learning to classify text Ʀ� 54 / 69

Page 55: 한국어와 NLTK, Gensim의 만남

3. Sentiment classification with term-existanceūƃ ¼Ɩ½ n�. ö¯!mĽǴ ö¯7³ ĄŚǺíÍ źÜm.NLTKŇč ƃ!Dzd �¾y ň  �:

Naive Bayes ClassifiersDecision Tree ClassifiersMaximum Entropy Classifiers��.

scikit-learn � m² ǛdžƖŪ ö¯7y Ł¼�Ɩ ĄŚ �gǶ.ň7čd NLTKŪ Naive Bayes Classifier DzJ¾ ĄŚǺíŴ!

55 / 69

Page 56: 한국어와 NLTK, Gensim의 만남

3. Sentiment classification with term-existanceNaive Bayes classifier ŽŚ

# ƯnǶclassifier = nltk.NaiveBayesClassifier.train(train_xy)print(nltk.classify.accuracy(classifier, test_xy))# => 0.80418classifier.show_most_informative_features(10)# => Most Informative Features# exists(ęŵ/Noun) = True 1 : 0 = 38.0 : 1.0# exists(ƱĴ/Noun) = True 0 : 1 = 30.1 : 1.0# exists(♥/Foreign) = True 1 : 0 = 24.5 : 1.0# exists(]Ż/Noun) = True 0 : 1 = 22.1 : 1.0# exists(Oû/Noun) = True 0 : 1 = 19.5 : 1.0# exists(Į¦7/Noun) = True 0 : 1 = 19.4 : 1.0# exists(ňŜ/Noun) = True 1 : 0 = 18.9 : 1.0# exists(àŊ7/Noun) = True 0 : 1 = 16.9 : 1.0# exists(,/Noun) = True 1 : 0 = 16.9 : 1.0# exists(Ʊ�m/Noun) = True 1 : 0 = 15.9 : 1.0

6Ƃ/ôƂ Ƒ ijÔ�J ơľy Âƴ Ȇ°Ŧ ßßūh; baseline accuracyd 0.500Ň ûǺ śµd µø 1¾�¾ į �śŇy accuracy�� 0.80ū JŔm. J´ Đ!Ž?

56 / 69

Page 57: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ūƃ Ĺč Ćǡï doc2vecť« µø³ 6Ƃ/ôƂť« ö¯ǺñĢm.Gensimŧ ĄŚǵ�ŇŘ. *

¾�ū: Radim Rehureck

* ň7čd Gensim 0.12.0ŧ ĄŚDzŌğhm. Ʀ�« łwūǒ� ȈàǴ �ūù µ� äžłū Ā�Ř! 57 / 69

Page 58: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)

ijh ÈŭǍūU� �ƃyŇ!!!GensimŇ sǺ -5Ǵ� ŲťÍ 0 ť«...

58 / 69

Page 59: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)doc2vecd ļ t îŸǷhm.doc2vecū Ř(Dzd ȃNJ« wūnj³ ÝEƒĻDz7 �Ö

from gensim.models.doc2vec import TaggedDocument³ ūŚǺy OK

from collections import namedtuple

TaggedDocument = namedtuple('TaggedDocument', 'words tags')

# ň7čd 15¾� training documents žô ĄŚǶtagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs]tagged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs]

59 / 69

Page 60: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)from gensim.models import doc2vec

# Ąž (Ƴdoc_vectorizer = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234)doc_vectorizer.build_vocab(tagged_train_docs)

# Train document vectors!for epoch in range(10): doc_vectorizer.train(tagged_train_docs) doc_vectorizer.alpha -= 0.002 # decrease the learning rate doc_vectorizer.min_alpha = doc_vectorizer.alpha # fix the learning rate, no decay

# To save# doc_vectorizer.save('doc2vec.model')

Try this:

Vector size, epoch ę � Ɖ Ǚ�Ùnj ÝEí7

60 / 69

Page 61: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ƃs« dzğ ~dƖ ȆŭǺíŴ! (1/3)

pprint(doc_vectorizer.most_similar('!ǥ/Noun'))# => [('čĝǠĝ/Noun', 0.5669919848442078),# ('Ùĝnjµ/Noun', 0.5522832274436951),# ('ĝ¸ /Noun', 0.5021427869796753),# ('Ź±/Noun', 0.5000861287117004),# ('ǚLJƖ/Noun', 0.4368450343608856),# ('Ô�/Noun', 0.42848479747772217),# ('Ȅ /Noun', 0.42714330554008484),# ('ȇLJƖ/Noun', 0.41590073704719543),# ('Ê«/Noun', 0.41056352853775024),# ('!ǥōȅ/Noun', 0.4052993059158325)]

ŏ... &ƥm.

61 / 69

Page 62: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ƃs« dzğ ~dƖ ȆŭǺíŴ! (2/3)

pprint(doc_vectorizer.most_similar('��/KoreanParticle'))# => [('��/KoreanParticle', 0.5768033862113953),# ('�/KoreanParticle', 0.4822016954421997),# ('!!!!/Punctuation', 0.4395076632499695),# ('!!!/Punctuation', 0.4077949523925781),# ('!!/Punctuation', 0.4026390314102173),# ('~/Punctuation', 0.40038347244262695),# ('~~/Punctuation', 0.39946430921554565),# ('!/Punctuation', 0.3899948000907898),# ('^^/Punctuation', 0.3852730989456177),# ('~^^/Punctuation', 0.3797937035560608)]

,! �"� źÜXŘ.

62 / 69

Page 63: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ń, 0 �íh 0 ţÎǴ v(KING) – v(MAN) + v(WOMAN) = v(QUEEN) Ū $�y Đºǵ;?

63 / 69

Page 64: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ƃs« dzğ ~dƖ ȆŭǺíŴ! (3/3)

pprint(doc_vectorizer.most_similar(positive=['ňŴ/Noun', 'ŕ/Noun'], negative=['LŴ/Noun']))# => [('Ĵr/Noun', 0.32974398136138916),# ('#ƖÚ/Noun', 0.305545836687088),# ('ħō/Noun', 0.2899821400642395),# ('ŏÿ/Noun', 0.2856029272079468),# ('žŵ/Noun', 0.2840743064880371),# ('aĭ/Noun', 0.28247544169425964),# ('%ǜ/Noun', 0.2795347571372986),# ('Ɩg/Noun', 0.2794691324234009),# ('Þíō/Noun', 0.27567577362060547),# ('�Ŏš/Noun', 0.2734225392341614)]

Ũ...? Ŗƌ?

64 / 69

Page 65: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)

ij, 0�*Ř;;Ąĥ, ň7č � źÜd Öƃ� àČǷhm. ľ�� mŪľ³ (öǵ ę Ųd�? *

text.concordance('ŕ/Noun', lines=10)# => Displaying 10 of 145 matches:# Josa «Åĝ/Noun S/Josa ,,/Punctuation ŕ/Noun Ɲƕ/Noun ...../Punctuation ijƍ/Noun ž# /Noun Ƕ/Noun ../Punctuation �À/Noun ŕ/Noun ĥÁ/Noun Ű/Noun žŵ/Noun Ň/Josa û/Nou# nction Ł+/Noun ¾/Josa Ŏăm/Adjective ŕ/Noun �m/Verb Âm/Verb ��¼/Noun �y/Josa yu# /Noun ĝ¸ /Noun Ű/Noun ?/Punctuation ŕ/Noun ĥÁ/Noun ./Punctuation Ŋ7/Noun sï/No# b 5/Noun Ąŭá/Noun ��/KoreanParticle ŕ/Noun Ż/Noun Ńm/Adjective ./Punctuation Ƃ# osa čţ7/Noun ím/Josa ȑ1/Noun Ɩ/Josa ŕ/Noun ū/Josa t/Noun Ʊ�/Noun �/Josa Č/Nou# Ɓ/Noun Ǵ/Josa �ŵ/Noun ./Punctuation ŕ/Noun ,/Punctuation UÔ/Noun �}/Noun Ž/Suf# Josa Ő/Noun �/Noun ƪ£/Noun ƃJ�/Noun ŕ/Noun "/Josa *Ą/Noun �/Suffix ŧ/Josa Ėūm/# m/Verb ./Punctuation 7sDzm/Adjective ŕ/Noun Ɩ®/Noun .../Punctuation ƃhǝ/Noun ǖµ# tive Ş/Noun Ɲƕ/Noun .../Punctuation ŕ/Noun Ɲƕ/Noun ../Punctuation Ą�/Noun ¼m/J

* Huang, Socher, Improving Word Representations via Global Context and Multiple Word Prototypes, ACL, 2012. 65 / 69

Page 66: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ūƃ ¼Ɩ½ ö¯ n�. ö¯³ ŢǴ ǮƮ�ŧ ¼©DzŴ.

train_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_train_docs]train_y = [doc.tags[0] for doc in tagged_train_docs]len(train_x) # Ąĥ ū �ÖŇ ĹŪ term existanceœd !ǣǴ û'd ijj ę Ųm# => 150000len(train_x[0])# => 300test_x = [doc_vectorizer.infer_vector(doc.words) for doc in tagged_test_docs]test_y = [doc.tags[0] for doc in tagged_test_docs]len(test_x)# => 50000len(test_x[0])# => 300

66 / 69

Page 67: 한국어와 NLTK, Gensim의 만남

4. Sentiment classification with doc2vec (feat. Gensim)ūåŇd scikit-learnŇč LogisticRegression() DŽ�ĝ³ ý§ŏŴ.

from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state=1234)classifier.fit(train_x, train_y)classifier.score(test_x, test_y)# => 0.78246000000000004

Ȑ! Accuracy�� 0.78ū*Ř.Term existance ž�Ň ûǺ accuracy� 2% Ƃy �ĕǻƖ¾, JăƖ ĶXŘ.7ĿǺĻǵ �Ŧ, term existancey, doc2vecy ĢyǺð¾Ǵ mĽǴ êȃ�ū ¿md �!

67 / 69

Page 68: 한국어와 NLTK, Gensim의 만남

¼ÔµǎĝǒŪ mĽǴ ǨȁçŇ sǺ Ćǡíĸm.KoNLPy« wūnj³ žƪµDz�, NLTK« wūnj³ ljĊDz�, Gensimť« Öčénj³ (Ǻóm.Term existanceœ document vector« ĒǔÉǒ³ ö¯Ǻóm.ň7čd ōȅ µøŇ sǴ ĒǔÉǒ öĎŧ ǻƖ¾ Ł¼�Ɩ m² taskŇy applyǵ ę ŲŨ!

)ȉ Ūĵ Ǒ"/Ǥ7 ŎƷ *ƍ� Ĉġ/Dz� ŎƷ

Jd ľ� NJĝǂŇ ŽŚǺð ę ŲŧƖ ČǺíŴ!

* ÞŦƂ, �DZĐ, ƅĐƏ, ľ� Ūĵū ç°ū �d�: wūnj ÷/ȃŧ �§Ǵ Ūĵ �" ŎƷ, SNUDM Technical Reports, Apr 14,2015. 68 / 69

Page 69: 한국어와 NLTK, Gensim의 만남

�ĄǷhm :D

http://lucypark.kr@echojuliett

69 / 69