korean text analysis in r - github pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1...

65
Korean Text Analysis in R 연세치대 치의학교육연구센터 김준혁

Upload: others

Post on 30-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

Korean Text Analysis in R

연세치대 치의학교육연구센터 김준혁

Page 2: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

•• � ? �

• �

• � +)+ ,� + +( � �

자기소개

Page 3: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

. C�

2C� 1 3?� 2� 3 � 3 13?

1 3?

2 1 � 1 33:3� C? ?

3

:31 2� 1 3

: 3 � 3 �1 ?? 13 13??

23 �

3?

목차

Page 4: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

R에서 텍스트 분석을 한다고?

Page 5: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• ?

• . �

Why R?

Page 6: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

•• 2 2 2 � 2 �

• � 2 2 2

•• � 2 �

•• 2 � / � 2 �2• 2

• �

• � /� 2 2 � / 2

R vs. Python

Willems, K. (May 12, 2015). Choosing R or Python for Data Analysis? An Infographic. DataCamp. <https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis>

Page 7: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• R LP

• I � G?

• � ,. . � ,. �, .

• � � .� � . � � � � � � . � �. ,. �, .� � � � � � � � . . , � � �

Then, still R?

Page 8: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• �• � � � � � � � � � � �

Tidy Principle

Wikipedia contributors. Tidy data. Wikipedia, The Free Encyclopedia. Oct 23, 2018. https://en.wikipedia.org/wiki/Tidy_dataWickham H., Grolemund G. (2017). “2. Introduction”. R for Data Science. O’Reilly

Page 9: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• 1 2 2 � � 2. �.1� 2 ,4 � � � 1 4.�, �2 � � 4

1�.2 � , 2 � � 1 � 2 ,4 � 1 4.�, �2 � �.2 �

1 � 1 4.�, � � ,4 � � 1� 32 . � � 2 ,4

� �1 � 4 2 4 � ,4 � 1 � 1 4.�2 4 . � � 4 �2 � 1 � ,4 � 1 �44 � 1 � �, �42 3 .

Tidy Principle

Wikipedia contributors. Tidy Data. Wikipedia, The Free Encyclopedia. Oct 23, 2018. https://en.wikipedia.org/wiki/Tidy_dataWickham, H. (2013). Tidy Data. Journal of Statistics. 59(10).

Page 10: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• , CC:• �2 2 � , � 2 2

• �C 2 : � 2 :

• / : : / �D �D: D D

• / 2 C• / : � : �D : C �CD

• � : �F D C � 2CD/ D �

• : � 2 2 C� : � 2 D � � CD �

• � 2 : �/ C F � 2C �

• :C 2 :• D

• F

Packages for Korean NLP in R

Page 11: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• � : : �F: �, � � : H� A H� D : �-: � E: �, : 1

• H� : J � �• �E A: �: � � D

• � AE : �: � �A F

• � H � � AE : �D : �: � �

• A� � � �F: � A A F

• � : �F A � A � � A A �

tidytext?

[1] Text Mining with R. <https://www.tidytextmining.com/>

Page 12: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

Introducing RcppMeCab

Page 13: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• � � �

• � � �

• ,� � ,� , � � � �

• �W c �Pec W KNdEa bL R

NLP Research

Page 14: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• d a e � lr�( ? u )

my nb c s

•• c d o l q

• f 0 0 .:%-,80 �3 0 l ass

Addressing Korean Texts

박명석, 권영진, 이상용. (2018). 댓글이 음원 판매량에 미치는 차별적 영향에 관한 텍스트마이닝 분석. 지식경영연구. 19(2). 91-108쪽.

Page 15: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• : � � � � � � � � � � �� � � �

Zipf:s Law and Korean

Based on Maj M. (Feb 7, 2019). “Investigating words distribution with R—Zipf’s law”. Appsilon Data Science. Korean word frequency is from 김한샘. (2005). 현대 국어 사용 빈도 조사 2. 국립국어원

“The curve slopes more gently than that of English text or of French text. In other words, the Mandelbrot’s ‘information temperature’, 1/B, should be larger than English or French.”

Choi, S. W. (2000). Some Statistical Properties and Zipf’sLaw in Korean Text Corpus. J Quantitative Linguistics. 7(1). 19-30.

Page 16: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

•• +• +

한글 토큰

Page 17: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• ,

••

• � �

• �

Morpheme Analysis

자연어 처리. (2018년 9월 12일). 위키백과. <https://ko.wikipedia.org/wiki/%EC%9E%90%EC%97%B0%EC%96%B4_%EC%B2%98%EB%A6%AC>한국어 품사 분류와 분포(distribution). (Apr 21, 2017). ratsgo’s blog. <https://ratsgo.github.io/korean%20linguistics/2017/04/21/wordclass/>

Page 18: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• 2H5 ALLIK BLAM H A:N C H 2H5• ,�8�I: : � HJ�2HJ : �5

• /: : M n �lgbm �: : P J

• -:K �H �1:N:

• 2H5 P ALLI H IP HJ :L KL• ,� PLAH �I: : � HJ�5 �H �LA �2HJ : � : M:

• /: : M �2 : � e �2H HJ: hfc � : � L I �2HJ : � OL

• 2A:BBB ALLIK BLAM H : :H A:BBB• 2: :H /: M �, : P J�

• ( aT ajRo kj d � ) i

• .55 :K

KoNLP? KoNLPy? Khaiii?

Page 19: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• J- C B � �C 2 � B2B � B� C � � � B � �� 2 2 C � 2 2 K 1

• 2 R � 2 2 C� � B

• B2 2B F � �(.)� 2 � 1

• (2 M 1 O 1

MeCab

[1] MeCab. Wikipedia, the free encyclopedia. <https://en.wikipedia.org/wiki/MeCab>[2] Taku Kudou. MeCab. <http://taku910.github.io/mecab/>[3] Wikipedia contributors. (Feb 1, 2019). Viterbi algorithm. Wikipedia, The Free Encyclopedia. [4] Wikipedia contributors. (Jan 30, 2019). Conditional random field. Wikipedia, The Free Encyclopedia.

Viterbi algorithm[3]A dynamic programming algorithm for finding the most likely sequence of hidden statesConditional Random Fields (CRFs)[4]An undirected graphical model whose nodes can be divided into exactly two disjoint sets X and YThe conditional distribution p(Y|X) is modeled

Page 20: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• M d b d a K ! ,

• 1 - 5 1 - 2 5� 1 �2 � 1-

• 1 - 5 0 � 1 - d a• ] Cf [ e c ,

은전한닢 프로젝트

[1] 은전한닢프로젝트를소개합니다. 2013년 2월 12일. <http://eunjeon.blogspot.com/2013/02/blog-post.html>[2] mecab-ko-dic 소개. <https://bitbucket.org/eunjeon/mecab-ko-dic>

Page 21: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

•• .

• 1

개인적으로 생각하던 MeCab의 장점

[1] KoSpacing. <https://github.com/haven-jeon/KoSpacing>

Page 22: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• EN P 4032 3 L3 R E

• E ?K 4032

•• 2-3 12

하지만…

Page 23: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• d a �k]P-�

• :2 0?F �F A �?C� . ,�) (+ (• N F m gk 3 : t 7(

• b t r : a r � � Neit -K

• d � � F F ?C t c � c p

• 3 : a P 3 : n m � �. 1� o RM t P [ls I

그렇다면…

[1] RMeCab. <http://rmecab.jp/wiki/index.php?RMeCab>

Page 24: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• y b r t K lyM - � l …l c

• xh!�c K! � �G p

• n … m Gi R C . -)oc ( :-l R

• � al e d u a y

RmecabKo

Page 25: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• cU � -. F � 8 R J

• � -. a: T

• cU � � R ? Tb K

• � � C a

RcppMeCab

Page 26: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• /AFG �F G �AC�0A I: �( ) �

• /AFG �F G �AC�. -5 �( �

• T �K . : C AC A A C FJ�R UN .2 cda M F�

• 1C I C AC � 87/ b

RcppMeCab release

Page 27: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• 2 , ) W• - h

• 2 W• 2 , ) b 8 m h b M a RK e

• � 2 8 f 8 2&

• h (� c d 8 8C8 :� � � 2 � 2 82n 8 � 8 �ok hi � pO � 8 d e

• h lg b I( d

향후 계획

Page 28: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

사례: 한글 텍스트 분류하기

Page 29: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

예제: Text Classification with Tidy Data Principles [1]

[1] Silge J. Text Classification with Tidy Data Principles. <https://juliasilge.com/blog/tidy-text-classification/>

library(tidyverse)library(tidytext)library(RcppMeCab)library(rsample)library(glmnet)library(doParallel) # Parallelization in Windowslibrary(broom)library(yardstick)

Page 30: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# 무진기행(김승옥), 아내의상자(은희경) 비교

con <- file("김승옥_무진기행.txt")

mujin <- readLines(con)

close(con)

mujin <- iconv(mujin, "CP949", "UTF-8")

con <- file("아내의상자 - 은희경.txt", encoding = "UTF-8")

box <- readLines(con)

close(con)

books <- data_frame(text = mujin, title = enc2utf8("무진기행")) %>%

rbind(data_frame(text = box, title = enc2utf8("아내의상자"))) %>%

mutate(document = row_number())

bookstext title document

무진기행 무진기행 1

김승옥 무진기행 2

무진기행 3

버스가산모퉁이를돌아갈때나는 … 무진기행 4

\앞으로십킬로남았군요.\"" 무진기행 5

\예, 한삼십분후에도착할겁니다.\"" 무진기행 6

그들은농사관계의시찰원들인듯했다. 아니그렇지않은 …. 무진기행 7

\무진엔명산물이…… 뭐별로없지요?\"" 무진기행 8

Page 31: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# 띄어쓰기기준 tokenization

tidy_books <- books %>% unnest_tokens(word, text) %>% group_by(word) %>% filter(n() > 10) %>% ungroup()

tidy_books %>% count(title, word, sort = TRUE) %>% anti_join(get_stopwords()) %>% group_by(title) %>% top_n(20) %>% ungroup() %>% ggplot(aes(reorder_within(word, n, title), n, fill = title)) +geom_col(alpha = 0.8, show.legend = FALSE) +scale_x_reordered() + coord_flip() + facet_wrap(~ title, scales = "free") +scale_y_continuous(expand = c(0, 0)) +labs(x = NULL, y = "Word count",title = "Most frequent words after removing stop words",subtitle = "Word frequencies are hard to understand; '여자는' vs. '아내는' is the main difference.")

Page 32: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 33: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training을위한 training/test set separation

books_split <- books %>% select(document) %>% initial_split()

train_data <- training(books_split)test_data <- testing(books_split)

sparse_words <- tidy_books %>% count(document, word) %>% inner_join(train_data) %>% cast_sparse(document, word, n)

class(sparse_words) # [1] “dgCMatrix”

dim(sparse_words) # [1] 342 147

word_rownames <- as.integer(rownames(sparse_words))

books_joined <- data_frame(document = word_rownames) %>% left_join(books %>% select(document, title))

Page 34: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training

registerDoParallel(4)

is_box <- books_joined$title == "아내의상자"model <- cv.glmnet(sparse_words, is_box, family = "binomial", parallel = TRUE, keep = TRUE)

plot(model)

plot(model$glmnet.fit)

“Glmnet is a package that fits a generalized linear model via penalized maximum likelihood.”Automated lambda selection by mean-squared error (Gaussian) or binomial deviance (binomial)

Hastie T., Qian J. (Jun 26, 2014). Glmnet Vignette. <https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html>

Page 35: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 36: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# word probabilities

coefs <- model$glmnet.fit %>% tidy() %>% filter(lambda == model$lambda.1se)

coefs %>% group_by(estimate > 0) %>% top_n(10, abs(estimate)) %>% ungroup() %>% ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +geom_col(alpha = 0.8, show.legend = FALSE) +coord_flip() +labs(x = NULL,title = "Coefficients that increase/decrease probability the most",subtitle = "A document mentioning 말했다 is unlikely to be written by 은희경"

)

Page 37: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 38: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# ROC curve

intercept <- coefs %>% filter(term == "(Intercept)") %>% pull(estimate)

classifications <- tidy_books %>% inner_join(test_data) %>% inner_join(coefs, by = c("word" = "term")) %>% group_by(document) %>% summarize(score = sum(estimate)) %>% mutate(probability = plogis(intercept + score))

comment_classes <- classifications %>% left_join(books %>% select(title, document), by = "document") %>% mutate(title = as.factor(title))

comment_classes %>% roc_curve(title, probability) %>%ggplot(aes(x = 1 - specificity, y = sensitivity)) +geom_line(color = "midnightblue", size = 1.5) +geom_abline(lty = 2, alpha = 0.5, color = "gray50", size = 1.2) +labs(title = "ROC curve for text classification using regularized regression",subtitle = "Predicting whether text was written by 김승옥 or 은희경")

Page 39: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 40: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# AUC & confusion matrix

comment_classes %>% roc_auc(title, probability)

comment_classes %>% mutate(prediction = case_when(probability > 0.5 ~ "아내의상자",

TRUE ~ "무진기행"),prediction = as.factor(prediction)

) %>% conf_mat(title, prediction)

yardstick::roc_auc(data, truth, estimate)yardstick::conf_mat(data, truth, estimate)

Page 41: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# 형태소기준 tokenization

tidy_books <- books %>% unnest_tokens(word, text, token = pos, to_lower = FALSE) %>%group_by(word) %>% filter(n() > 10) %>% ungroup()

tidy_books %>% count(title, word, sort = TRUE) %>% anti_join(get_stopwords()) %>% group_by(title) %>% top_n(20) %>% ungroup() %>% ggplot(aes(reorder_within(word, n, title), n, fill = title)) +geom_col(alpha = 0.8, show.legend = FALSE) +scale_x_reordered() + coord_flip() + facet_wrap(~ title, scales = "free") +scale_y_continuous(expand = c(0, 0)) +labs(x = NULL, y = "Word count",title = "Most frequent words after removing stop words",subtitle = "Word frequencies are hard to understand; only '아내/NNG' for 아내의상자 is noticeable

difference.")

Page 42: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 43: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training을위한 training/test set separation

books_split <- books %>% select(document) %>% initial_split()

train_data <- training(books_split)test_data <- testing(books_split)

sparse_words <- tidy_books %>% count(document, word) %>% inner_join(train_data) %>% cast_sparse(document, word, n)

class(sparse_words) # [1] “dgCMatrix”

dim(sparse_words) # [1] 342 147

word_rownames <- as.integer(rownames(sparse_words))

books_joined <- data_frame(document = word_rownames) %>% left_join(books %>% select(document, title))

Page 44: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training

registerDoParallel(4)

is_box <- books_joined$title == "아내의상자"model <- cv.glmnet(sparse_words, is_box, family = "binomial", parallel = TRUE, keep = TRUE)

plot(model)

plot(model$glmnet.fit)

Page 45: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 46: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# word probabilities

coefs <- model$glmnet.fit %>% tidy() %>% filter(lambda == model$lambda.1se)

coefs %>% group_by(estimate > 0) %>% top_n(10, abs(estimate)) %>% ungroup() %>% ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +geom_col(alpha = 0.8, show.legend = FALSE) +coord_flip() +labs(x = NULL,title = "Coefficients that increase/decrease probability the most",subtitle = "A document mentioning 말했다 is unlikely to be written by 은희경"

)

Page 47: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 48: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# ROC curve

intercept <- coefs %>% filter(term == "(Intercept)") %>% pull(estimate)

classifications <- tidy_books %>% inner_join(test_data) %>% inner_join(coefs, by = c("word" = "term")) %>% group_by(document) %>% summarize(score = sum(estimate)) %>% mutate(probability = plogis(intercept + score))

comment_classes <- classifications %>% left_join(books %>% select(title, document), by = "document") %>% mutate(title = as.factor(title))

comment_classes %>% roc_curve(title, probability) %>%ggplot(aes(x = 1 - specificity, y = sensitivity)) +geom_line(color = "midnightblue", size = 1.5) +geom_abline(lty = 2, alpha = 0.5, color = "gray50", size = 1.2) +labs(title = "ROC curve for text classification using regularized regression",subtitle = "Predicting whether text was written by 김승옥 or 은희경")

Page 49: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 50: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# AUC & confusion matrix

comment_classes %>% roc_auc(title, probability)

comment_classes %>% mutate(prediction = case_when(probability > 0.5 ~ "아내의상자",

TRUE ~ "무진기행"),prediction = as.factor(prediction)

) %>% conf_mat(title, prediction)

Page 51: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# 형태소기준 tokenization, 명사만추출

tidy_books <- books %>% unnest_tokens(word, text, token = pos, to_lower = FALSE) %>%group_by(word) %>% filter(n() > 10) %>% ungroup()

tidy_books %>% count(title, word, sort = TRUE) %>% anti_join(get_stopwords()) %>% group_by(title) %>% top_n(20) %>% ungroup() %>% ggplot(aes(reorder_within(word, n, title), n, fill = title)) +geom_col(alpha = 0.8, show.legend = FALSE) +scale_x_reordered() + coord_flip() + facet_wrap(~ title, scales = "free") +scale_y_continuous(expand = c(0, 0)) +labs(x = NULL, y = "Word count",title = "Most frequent words after removing stop words",subtitle = "Word frequencies are hard to understand; only '아내/NNG' for 아내의상자 is noticeable

difference.")

Page 52: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 53: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training을위한 training/test set separation

books_split <- books %>% select(document) %>% initial_split()

train_data <- training(books_split)test_data <- testing(books_split)

sparse_words <- tidy_books %>% count(document, word) %>% inner_join(train_data) %>% cast_sparse(document, word, n)

class(sparse_words) # [1] “dgCMatrix”

dim(sparse_words) # [1] 342 147

word_rownames <- as.integer(rownames(sparse_words))

books_joined <- data_frame(document = word_rownames) %>% left_join(books %>% select(document, title))

Page 54: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# glmnet training

registerDoParallel(4)

is_box <- books_joined$title == "아내의상자"model <- cv.glmnet(sparse_words, is_box, family = "binomial", parallel = TRUE, keep = TRUE)

plot(model)

plot(model$glmnet.fit)

Page 55: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 56: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# word probabilities

coefs <- model$glmnet.fit %>% tidy() %>% filter(lambda == model$lambda.1se)

coefs %>% group_by(estimate > 0) %>% top_n(10, abs(estimate)) %>% ungroup() %>% ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +geom_col(alpha = 0.8, show.legend = FALSE) +coord_flip() +labs(x = NULL,title = "Coefficients that increase/decrease probability the most",subtitle = "A document mentioning 말했다 is unlikely to be written by 은희경"

)

Page 57: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 58: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# ROC curve

intercept <- coefs %>% filter(term == "(Intercept)") %>% pull(estimate)

classifications <- tidy_books %>% inner_join(test_data) %>% inner_join(coefs, by = c("word" = "term")) %>% group_by(document) %>% summarize(score = sum(estimate)) %>% mutate(probability = plogis(intercept + score))

comment_classes <- classifications %>% left_join(books %>% select(title, document), by = "document") %>% mutate(title = as.factor(title))

comment_classes %>% roc_curve(title, probability) %>%ggplot(aes(x = 1 - specificity, y = sensitivity)) +geom_line(color = "midnightblue", size = 1.5) +geom_abline(lty = 2, alpha = 0.5, color = "gray50", size = 1.2) +labs(title = "ROC curve for text classification using regularized regression",subtitle = "Predicting whether text was written by 김승옥 or 은희경")

Page 59: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 60: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# AUC & confusion matrix

comment_classes %>% roc_auc(title, probability)

comment_classes %>% mutate(prediction = case_when(probability > 0.5 ~ "아내의상자",

TRUE ~ "무진기행"),prediction = as.factor(prediction)

) %>% conf_mat(title, prediction)

Page 61: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# 형태소기준 tokenization, NN/VV/VA filtering

Page 62: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

# VV/VA (동사, 형용사) 추가가예측력을높이지는않음

Page 63: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."
Page 64: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

• ,) ( � �

• RN

• /C (

Gist

Page 65: Korean Text Analysis in R - GitHub Pages · •1 2 2 2. . 1 2,4 1 4.,2 4 1.2 , 2 1 2,4 1 4.,2 .2 1 1 4., ,4 1!32 ."

Q&A