disclaimer - seoul national...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

문학석사학위논문

Detecting Language from Depressed Users

with Korean Twitter Data

한국어 트위터 데이터를 활용한 우울증 표현 인식

2018 년 8 월

서울대학교 대학원

언어학과 언어학전공

Julius Jacobson

Abstract

Detecting Language from Depressed Users

with Korean Twitter Data

Julius Jacobson

Department of Linguistics

The Graduate School

Seoul National University

Despite leading the OECD in suicides, both the diagnosis and treatment

of mental health conditions such as depression remain a taboo in South Ko-

rea. With research utilizing English social media text to find signals of mental

health conditions becoming ever more abundant, and South Korea’s Ministry

of Education releasing its own social media text scanning app in order to

identify minors at risk, exploration into effective methods of classifying Ko-

rean social media text on the basis of underlying mental health conditions is

perhaps more relevant than ever.

Most studies to date leveraging social media to detect signals tied to men-

tal health conditions have utilized pre-generated dictionaries such as LIWC

or survey data. While there has been some research into automatic detection

methods requiring little or no domain knowledge and no survey data, such

studies are rare outside of English and, to our knowledge, no such study has

yet been done in Korean. Given the unique relevance of depression and suicide

as public health concerns to South Korea, this thesis hopes to be a potential

start to filling this void.

This paper employs various machine learning classifiers to predict whether

a tweet was posted by a depressed user. After searching for users with tweets

stating that they have been diagnosed with depression, Korean native speak-

ers were utilized to determine if such statements indicated a genuine claim of

a diagnosis. Up to 3200 tweets were scraped for each verified user. Then, a set

of tweets from an equal number of random Twitter users that had posted over

the same time period was collected as a control group. Using two different

tokenizers and an array of machine learning classifiers, the average precision

and F1 scores over a 10-fold cross-validation were recorded for all combina-

tions of tokenization and classifiers. All combinations were found to be able to

detect whether a tweet came from a depressed user with an accuracy rating

well above chance. This study, therefore, suggests that detection of mental

health issues using social media data may be a viable approach for further

study and treatment of mental illness, and on par or better than previous

methods relying upon pre-generated dictionaries such as LIWC or expensive

and time-consuming survey data.

Keywords: Machine Learning, Mental Health, Social Media, Depression,

Twitter

Student Number: 2015-22104

Acknowledgements

First and foremost I would like to thank my advisor, Professor Shin Hy-

opil, for his patience and guidance over the course of the long and challenging

process of completing my degree. I would also like to thank Professor Nam

Seung Ho and Dr. Kim Munhyoung for their advice over the course of the

writing of this thesis.

Secondly, I would like to thank my mother for her tireless encouragement

and support while I was writing this paper on the other side of the world.

And lastly, Derek Hommel and Timour Igamberdiev were constant com-

panions and comrades on this research journey, without whom whiling away

the hours necessary to complete this project would have been, if not impossi-

ble, at least far less enjoyable. Thank you my friends.

5

Contents

1 Introduction 11.1 Text Analysis in Psychology . . . . . . . . . . . . . . . . . . . . 11.2 Mental Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Social Media and Mental Health . . . . . . . . . . . . . . . . . 41.4 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Research Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Literature Review 92.1 Types of Depression . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Diagnostic Methods . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Establishing the Language of Depression . . . . . . . . . . . . . 132.4 Studies Utilizing Non-Korean Social Media Data . . . . . . . . 162.5 Studies Utilizing Korean Social Media Data . . . . . . . . . . . 192.6 Qntfy Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Corpus Data 313.1 The Twitter API and User Selection . . . . . . . . . . . . . . . 323.2 Tweet Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Classification Methods 374.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Training, Testing, and Cross Validation . . . . . . . . . . . . . 384.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Linear Support Vector Machines . . . . . . . . . . . . . . . . . 424.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.7 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . 44

5 Experiment 475.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Explanation of Metrics . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Accuracy, Precision and Recall . . . . . . . . . . . . . . 485.2.2 The F1 Score . . . . . . . . . . . . . . . . . . . . . . . . 505.2.3 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Classifier Results with Space Tokenization . . . . . . . . . . . . 515.4 Classifier Results with Mecab Tokenization . . . . . . . . . . . 535.5 Linear SVM Top Features . . . . . . . . . . . . . . . . . . . . . 53

5.6 Precision-Recall Graphs . . . . . . . . . . . . . . . . . . . . . . 575.6.1 Precision-Recall Curves and Hard to Classify Tweets . . 57

5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 Conclusion 63

List of Figures

3.1 Tweet Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Twitterscraper Tweet Object . . . . . . . . . . . . . . . . . . . 34

5.1 ROC for Space Tokenization . . . . . . . . . . . . . . . . . . . . 525.2 ROC for Mecab Tokenization . . . . . . . . . . . . . . . . . . . 545.3 Linear SVM Top Features with Space Tokenization . . . . . . . 555.4 Linear SVM Top Features with Mecab Tokenization . . . . . . 565.5 Precision-Recall for Space Tokenization . . . . . . . . . . . . . 595.6 Precision-Recall for Mecab Tokenization . . . . . . . . . . . . . 60

List of Tables

3.1 Diagnostic Tweets . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Tokenized Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 F1 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

1 Introduction

1.1 Text Analysis in Psychology

In 2010, Yla R. Tausczik and James W. Pennebaker published The Psy-

chological Meaning of Words: LIWC and Computerized Text Analysis Methods.

The abstract of the paper claimed that ”We are in the midst of a technological

revolution whereby, for the first time, researchers can link daily word use to

a broad array of real-world behaviors.” Nearly a decade later, the revolution

that Tausczik and Pennebaker identified is still in full-swing and more relevant

than ever. In fact, one could argue that, with the rise of machine learning and

the widespread adoption of social media platforms, the full potential of this

revolution is only now beginning to be understood.

But the history of text analysis in the field of psychology significantly

predates not only social media and machine learning, but the internet and

computers entirely. This is a fact which makes itself evident within the lexicon

of the English language, with the common term ”Freudian slip” indicating an

instance of linguistic error that unintentionally reveals a hidden motive on the

part of the speaker or writer, and which has its origin in Freud’s 1901 book

the The Psychopathology of Everyday Life. Decades after its publication in

the 1950s, researchers developed the Gottschalk method, which consisted of

tracking Freudian themes in texts through content-analysis (Gottschalk et al.,

1

1970).

It was not until the 1960’s, however, that the first general purpose com-

puterized text analysis program in psychology, called The General Inquirer,

was produced (Stone and Hunt, 1963). It operated according to a series of

algorithms developed by its author. While it has proven useful in detecting

mental disorders and personality dimensions, it relied on weighted variables

that were not observable to the user (Stone and Hunt, 1963)

In the 1980’s, Walter Weintraub discovered that the usage of first-person

singular-pronouns could be linked to depression, a simple but profound insight

that foreshadowed the kind of impact that future psychological software could

have on the detection of mental health conditions (Weintraub, 1989). In fact,

this finding is utilized by psychological text analysis software to this day, most

notably by the Linguistic Inquiry and Word Count (LIWC) program, which

was developed by Martha Francis and James W. Pennebaker in the mid-90’s

(Tausczik and Pennebaker, 2010). The goal of LIWC is simple: to count words

in psychologically relevant categories over multiple text files.

In recent years, there have been attempts to utilize machine learning

methods in place of explicitly programmed software. As machine learning uti-

lizes statistical methods to allow a model to ”learn” from data without being

told beforehand what linguistic features are relevant, it does not necessarily

require the input of domain experts, as software such as LIWC has. With the

rise of social media, these computational techniques can be experimented with

on large datasets. Shortly, we will discuss some of these experiments before

exploring the results of an approach that has previously been untested in the

Korean language.

2

1.2 Mental Health

South Korea leads the OECD in suicides while also containing a very

large population of social media users (Kemp, 2016), (OECD, 2016). While

the Ministry of Education in South Korea has released an app to alert par-

ents of students that send instant messages or do web searches that indicate

suicidal ideation (학생 스마트폰 ’SNS 자살징후’ 부모에게 알린다 ), the link

between depression and suicide has long been studied and is well understood

(Werth, 2004). Therefore, an application that could detect depressed social

media users may prove effective in creating greater opportunities for early de-

tection and treatment of mental health conditions and the prevention of tragic

outcomes such as suicide.

Furthermore, according to a report put out by the Korean National Evidence-

Based Health Care Collaborating Agency, 5.6 percent of Koreans, approxi-

mately 2 million individuals, have suffered from depression at least once (Kim

et al., 2013). Yet, despite the fact that over 90 percent of suicide victims in

Korea suffer from a diagnosable psychiatric illness such as depression, very few

visit psychiatric clinics, leaving a potentially life-saving diagnosis out of reach

(Na et al., 2015). An automatic classification tool that could, in the privacy

of one’s own or of a loved one’s electronic device, provide some indication of

potential mental health issues, would likely be a helpful first step in preventing

negative outcomes in a culture that struggles to deal with mental health issues

amidst the taboos of being diagnosed. If used in tandem with alert systems

and accompanied with information about means of treatment, those suffering

from depression or other similarly dehabilitating mental health conditions may

3

stand to garner a significant benefit.

1.3 Social Media and Mental Health

As discussed earlier, text analysis has long played a role in the detection

of psychological phenomena such as personality dimensions and mental health

conditions. It is reasonable to assume, then, that social media would play a

prominent role in modern applications of computational text analysis in the

field of psychology. After all, social media provides a gigantic corpus of data

from which to draw insights. In addition, such data represents different re-

gions, economic climates, and even languages. Indeed, social media has been

utilized to gain insights in fields ranging from political science to public health

(Boydstun et al., 2013),(Aramaki, Maskawa, and Morita, 2011). There are,

however, unique challenges in evaluating psychological phenomena. Perhaps

most significantly, diagnosis relies upon a patient’s self-reported experiences.

This means obtaining objective metrics for factors correlating with or causing

depression may be confounded by subjective intentions, such as a desire to

obtain anti-depressant medication, or alternatively to mask one’s condition

from one’s self or from others.

Furthermore, according to the World Health Organization, most depressed

individuals do not receive a diagnosis for their condition as a result of not seek-

ing treatment (Marcus et al., 2012). This suggests that, despite the challenges

posed by mental health conditions, a convenient and practical form of detec-

tion may prove to be of great benefit to those suffering from various forms of

mental illness. This is particularly true in the case of depression, where the

rate of successful treatment is high. According to one study, nearly 8 out of 10

4

patients showed a significant improvement in symptoms of major depressive

disorder within 4 to 6 weeks (Mann et al., 2005). In addition, according to a

study funded by the National Institute of Mental Health, patients reported

an average 65 percent reduction in the symptoms of major depressive disorder

despite not responding to an initial anti-depressant description. This indi-

cates that there are various effective methods for the treatment of depression

that may prove helpful even when medication is proven ineffective in a given

case(Fava et al., 2003). This, combined with the low treatment rate, suggests

that an efficient and practical way to detect depression in particular stands to

benefit both many suffering individuals and society as a whole.

1.4 Research Goals

The goals of our investigation are:

(1) To demonstrate the effectiveness of the data collection method utilized

in numerous studies by the U.S.- based startup Qntfy in the Korean

language.

(2) To demonstrate the ability of this method to capture signal-containing

data even when datasets are relatively small.

(3) To gauge the effectiveness of an array of popular machine learning algo-

rithms when used in conjunction with this method.

(4) To briefly explore features distinguishing depressed users from non-depressed

users.

In accomplishing the above, our research aims to demonstrate the ef-

fectiveness of machine learning classifiers to distinguish depressed users from

5

non-depressed users. In addition, it hopes to indicate the potential for a large-

scale analysis of mental conditions such as depression using Korean social

media text. Furthermore, it suggests a new avenue of exploration for a more

effective and timely method of diagnosis that preserves anonymity, which may

prove invaluable if brought to fruition. The costs of depression are high, on the

level of both society and the individual, and methods that can reach a large

number of users with a minimal investment of resources are worth noting and

exploring.

By demonstrating that automatic data collection and machine learning

classification methods can achieve results surpassing those achieved through

curated lexical data, we indicate that this combination may be more effective

than methods utilizing domain knowledge or curated data. Such a finding is

likely have a significant impact on how resources are allocated in future efforts

to tackle mental health epidemics such as suicide in Korea.

Lastly, as was discussed in Section 1.2, many residents of Korea are ret-

icent when it comes to seeking help for mental health challenges. By demon-

strating the effectiveness of methods that are anonymous, impersonal, and

scalable, it is our hope a foundation is laid for future investigations into men-

tal health diagnosis that address what appear to be the fundamental obstacle

to obtaining a variety of effective treatments: getting an individual diagnosed

in the first place.

1.5 Research Outline

The thesis is structured as follows: Chapter 2 provides a literature review

of sentiment analysis and automated text analysis methods that leverage social

6

media data, as well as of previous studies that utilized computational methods

in conjunction with data that was acquired through crowdsourcing. Chapter 3

discusses the method used to acquire the data used in the experiment, as well

as an overview of the data itself. Chapter 4 provides an overview of the var-

ious machine learning classifiers used in our classification task. The first half

of Chapter 5 reports the findings of a preliminary study that utilized a naive

tokenization method and little optimization of the neural network model. The

second half of Chapter 5 discusses the results obtained after utilizing a more

refined tokenization method and the most significant features distinguishing

depressed users from non-depressed users. Chapter 6 concludes the by dis-

cussing the caveats and limitations of our research and avenues left open to

be pursued in future work.

7

2 Literature Review

This chapter discusses the relevant literature covering past research that

utilized social media data, machine learning, or other methods of computa-

tional data analysis to evaluate the mental health of an individual or group

of individuals. In addition, certain relevant studies from the field of Korean

sentiment analysis are included. Studies done in both Korean and English are

discussed.

2.1 Types of Depression

While it is common parlance to use the term ”depression” or to describe

someone as ”depressed” without the use of an additional modifier, in reality

there are actually multiple types of depression that are diagnosed in a clinical

environment. In this section, we will briefly discuss the varieties of depression

so that the diagnostic terms used in studies discussed in future sections are

clearly understood.

Major Depressive Disorder (MDD) is characterized by a depressed mood,

or lack of interest in activities that were once found pleasurable, for most

days of the week over a period of two weeks or more. Other symptoms in-

clude fluctuations in weight, disruptions in sleep patterns, consistent feelings

of sluggishness or agitation, feelings of guilt, fatigue, trouble concentrating,

and suicidal thoughts (Belmaker and Agam, 2008).

9

Persistent Depressive Disorder is diagnosed when a patient suffers from

depression for a period of 2 years or more. As suggested by its name, symp-

toms are less intense than those characterizing MDD but chronic over time

(Klein and Black, 2013).

Bipolar disorder, also called manic depression, is diagnosed when a pa-

tient suffers from consistent cycles of mood episodes that consist of ”highs”

and ”lows” that are extreme. When in the ”low” portion of a mood cycle, a

patient may experience symptoms characteristic of MDD (Goodwin and Jami-

son, 2007).

Seasonal Affective Disorder (SAD) is a disorder that arises in certain in-

dividuals during periods of the year when the days are shorter and the hours

of available sunlight is decreased, i.e. the winter and fall seasons of the year

(Saeed and Bruce, 1998).

Individuals suffering from Psychotic Depression generally suffer from the

same symptoms as those diagnosed with MDD, however, as the name of the

condition implies, they suffer from symptoms of psychosis as well (Nelson and

Davis, 1997). These can include hallucinations, delusions, or paranoia.

While the above list is by no means comprehensive, it is sufficient for our

discussion below.

2.2 Diagnostic Methods

While doctors may administer a variety of physical examinations to rule

out other potential diagnoses, depression has no reliable medical (i.e. non-

psychological) means of detection. In 2012, however, Reid et. al found biolog-

ical markers of early onset Major Depressive Disorder (Pajer et al., 2012). A

10

panel of of 11 blood markers was found to be sufficient to differentiate users

with early-onset MDD from those with no diagnosis in a small study consist-

ing of 28 participants between the ages of 15 and 19. Considerations of such

studies and of the utility of using biological markers to rule out other diag-

noses aside, however, questionnaires remain the primary means with which

clinicians diagnosis depression. In this section we will briefly discuss some of

these diagnostic tools.

The five questionnaires most commonly cited in the studies used for this

thesis are:

(1) The Patient Health Questionnaire (PHQ-9)

(2) Beck Depression Inventory (BDI)

(3) Zung Self-Rating Depression Scale (SDS)

(4) Center for Epidemiologic Studies Depression Scale (CES-D)

(5) Hamilton Rating Scale for Depression (HRS)

The Patient Health Questionnaire is based on the diagnostic criteria of the

Diagnostic and Statistical Manual of Mental Disorders (DSM), a manual pub-

lished by the American Psychiatric Association, and consists of 9 questions.

Developed in 1999 at Columbia University, it ranks patients’ levels of depres-

sion according to five categories: none or minimal, mild, moderate, moderately

severe, and severe. According the the 5th edition of the DSM, if 5 or more of

the 9 symptoms indicated by the questions of the PHQ-9 have persisted for

two tweeks or more, depression is a likely diagnosis if the symptoms are not

better explained by substance abuse or other medical condition (Association,

11

2013). The questions consist of inquiries into a patient’s interest in activities,

energy levels, mood, sleeping and eating habits, ability to concentrate, ability

to function, and whether or not a patient has entertained suicidal thoughts.

(Kroenke and Spitzer, 2002)

The Beck Depression Inventory (BDI), developed by American psychia-

trist Aaron T. Beck at the University of Pennsylvania, consists of 21 questions

and was first published in 1961 and then revised in 1978. The BDI-II was

published in 1996. Each response is assigned a point value between 1 and 3

and then the sum of response scores are summed to obtain a total score that

indicates the severity of the patient’s depression. In 1996, when the BDI-II was

published, all but three of the items were reworded to reflect updated diagnos-

tic criteria in the third edition of the DSM. A score of 0-13 indicates minimal

or no depression, 14-19 indicates mild depression, 20-28 indicates moderate

depression, and scores above 29 indicate severe depression. (Beck, Steer, and

Brown, 1996)

The Zung Self-Rating Depression Scale (SDS) was published in 1965 by

Dr. William W.K. Zung, a psychiatrist at Duke University. It consists of 20

items that ask the respondent to rate the symptoms of depression on a scale

of frequency from 1 to 4, with 1 meaning ”little of the time” and 4 meaning

”all of the time”. Scores on the test range from 20 to 80, with higher scores

indicating more severe levels of depression. (Zung, 1965)

The Hamilton Rating Scale for Depression (HRSD) was published in 1960

by Max Hamilton while he was a senior lecturer of psychiatry at the University

of Leeds.The patient is rated on a 3 to 5 point scale on anywhere from 17 to

29 items. There are multiple versions of the test, as it was revised in 1966,

12

1967, 1969, and 1980. A score of 0-7 is considered to be normal while a score

over 20 indicates a case of at least moderate depression. (Hamilton, 1986)

Lastly, the Center for Epidemiologic Studies Depression Scale was devel-

oped in 1977 and consists of 20 questions. Each questions asks the respondent

to indicate how frequently they have experienced a given symptom over the

past week. Scores range from 0 to 60, with scores closer to 0 indicating no or

minimal depression, and scores nearer to 60 indicating more severe cases of

depression. (Eaton et al., 2004)

2.3 Establishing the Language of Depression

In this section we will briefly discuss relevant research that has indicated

or suggested a connection between depression and language.

In 1969, The Measurement of Psychological States through the Content

Analysis of Verbal Behavior was published. In it, the authors showed how the

lexical content features of recordable speech behavior could a probabilistic ac-

count for a variety of psychological states (Gottschalk and Gleser, 1969). In

their research, they asked participants to speak for five minutes about any in-

teresting personal life experiences in a stream-of-consciousness fashion. While

the Gottschalk method, as it came to be known, was later used for the di-

agnosis of various cognitive impairments and mental disorders, it has proven

difficult to adapt to a computer program (Gottschalk and Bechtel, 1993).

In 2001, Stirman and Pennebaker examined the word usage of both suici-

dal and non-suicidal poets. 300 poems were selected from the bodies of work

of 18 poets, with 9 being classified as suicidal and 9 being classified as not sui-

cidal. The suicidal poets were labeled as such due to the fact they did, in fact,

13

commit suicide. They used the LIWC program and found that the groups did

not differ in their usage of words correlated to positive and negative emotion,

but did differ when it came to pronoun usage. The suicidal group used more

first-person singular words, and fewer words suggesting identification with a

group or collectve (Stirman and Pennebaker, 2001).

In 2004, Rude, Gortner and Pennebaker sought to replicate the findings

of this and other studies that suggested that self-focus, along with its lin-

guistic indicators such as personal pronouns, was a significant aspect of a de-

pressed psychological state. They asked a sample of undergraduates to write

for 20 minutes abouts ”their deepest thoughts and feelings about coming to

college”.The sample was comprised of 31 depressed participants, 26 formerly

depressed participants, and 67 never-depressed participants. Participants in

the study were classified as depressed or not with the BDI diagnostic ques-

tionnaire. Essays were evaluated with the LIWC software, which compared

files on a word-by-word basis to a dictionary consisting of 2290 words and

word stems organized into various linguistic and psychological categories. The

study found that depressed respondents used more first-person singular words

(such as I, me, my) than did never-depressed respondents. In addition, de-

pressed participants were found to use a greater proportion of negative emo-

tion words, and marginally fewer positive emotion words (Rude, Gortner, and

Pennebaker, 2004).

Resnik et. al used a Latent Dirichlet Allocation and features derived from

the LIWC software in order to develop a linear regression model based on a col-

lection of 6,459 stream-of-consciousness essays collected from college students

between 1997 and 2008. Each essay consisted of approximately 780 words and

14

were responses to being prompted to writing about their thoughts and feelings

in the present moment. Each essay writer also provided data regarding their

personality traits and state of mind. The experiment found that topic model-

ing using Latent Dirichlet Allocation added value to the predictions of clinical

assessments of depression and neuroticism (Resnik, Garron, and Resnik, 2013).

A recent study further corroborated the findings of previous studies sug-

gesting a relationship between singular personal pronoun usage and depression

after conducting a text analysis of 63 internet forums comprised of over 6,800

active members. In addition, it found that absolutist words such as ”always”,

”entirely”, or ”totally” tracked the severity of affective disorder internet fo-

rums more reliably than negative emotion words did. In other words, while

anxiety, depression, and suicidal ideation forums all exhibited greater usage of

absolutist words, suicidal ideation forums exhibited greater usage of absolutist

words than anxiety or depression forums, thus correlating the usage of abso-

lutist terms with the severity of the condition under discussion.(Al-Mosaiwi

and Johnstone, 2018).

This finding that depressed individuals use more first-person singular and

negative emotion words has been posited to be either be an indication of more

self-focus as a response to pain, or alternatively a thinking pattern that is it-

self a causal factor in the emergence of depression. (Tausczik and Pennebaker,

2010).

In summary, both words categorized by the LIWC software program as

negative emotion words and singular personal pronouns have been linked to

depression in previous text analysis studies, with the connection to singular

personal pronoun usage having been found to be particularly robust over mul-

15

tiple studies.

2.4 Studies Utilizing Non-Korean Social Media Data

In this section we will discuss relevant studies that utilized the social

media data of users outside of Korea and whose main contribution was not

based upon upon pre-defined dictionaries or domain knowledge (such as stud-

ies whose findings were based on LIWC). Taken as a whole, these studies

demonstrate and exploit the potential of social media data as a source of pre-

dictive insight into depression. Unlike the studies that we will discuss in a

later section, however, they all rely on surveys and crowdsourcing to establish

a ground truth as a starting point for their analyses.

A study by Moreno et. al selected public Facebook profiles from second

and third-year undergraduates and evaluated their status updates. Disclosures

of depression were modeled in association with demographic Facebook usage

data using a binomial regression analysis. Two hundred profiles were eval-

uated, with 25 percent of profiles containing status updates that referenced

depressive symptoms. The study concluded that those who received positive

support from their friends were more likely to discuss their depressive symp-

toms publicly on Facebook. (Moreno et al., 2011) While not utilizing linguistic

data, this study demonstrated similar goals and intuitions regarding the po-

tential of social media to combat stigma surrounding mental health concerns

and more effectively identify those suffering from mental health challenges.

In perhaps one of the most important studies under consideration in this

section, De Choudhury et. al found that social media contained valuable sig-

nals in detecting individuals that were likely to be suffering from depression

16

(De Choudhury et al., 2013). Crowdsourcing was used to first create a list of

Twitter users that reported being clinically diagnosed as depressed through

the CES-D diagnostic measure. The tweets posted by each of the 476 users for

approximately a year before the onset of depression were collected. User meta-

data was also leveraged. Through a combination of egonetwork, lexical, and

pattern of behavior data, features were constructed to train a support vector

machine classifier which achieved an accuracy of approximately 70 percent. In

addition, the study found that individuals with depression showed a decrease

in social activity, an increase in negative emotion, a greater focus on self, an

increase in medical concerns, an increase in social concerns, and an increase

in reports of religious activity. While many relevant studies were inspired by

this research, one important difference between De Choudhury et. al’s research

and later lines of work, including the experiment put forth in this paper, is the

former study’s reliance on crowdsourcing and surveys. The methods employed

in this paper require little to no domain knowledge, no surveys or crowdsourc-

ing, and focus only on linguistic data, which De Choudhury’s study showed to

be the most effective in distinguishing depressed users from non-depressed.

Moving on to studies done using data in a non-English language, there

have been at least two published studies using Japanese language Twitter

data to detect depression. In 2013, Tsugawa et. al constructed a multiple re-

gression model in order to determine the probability of a user suffering from

depression based on the frequencies of words used. A survey was conducted of

50 Japanese participants using Zung’s SDS. Following the survey, the tweets

posted by respondents over the week prior to the administering of the survey

were obtained through the Twitter API. As an aside, it is relevant to note

17

that this one week limitation is the result of utilizing the Twitter API to ex-

tract tweets and that this limitation was avoided in this study through the

use of alternative extraction methods. 14,757 words were obtained after ex-

cluding particles, auxiliary verbs, adnominal adjectives, and symbols through

the use of the morphological analyzer MeCab. Furthermore, the frequencies

of words used by a participant were normalized by the number of occurrences

of all words in the participant’s total tweet corpus. After running a multiple

regression analysis, it was found that words with negative mood were posi-

tively correlated with higher scores on Zung’s SDS. The correlation coefficient

between the estimated and actual Zung SDS scores was found to be .45. The

study concluded that word frequencies in tweet posts were useful in predicting

to what degree a user was suffering from depression. (Tsugawa et al., 2013)

In another study, Tsugawa et. al built upon the research described in the

preceding paragraph by using other features in addition to word frequencies

for their analysis, as well as by including a larger number of participants: 209

Japanese respondents compared to the previous study’s 50. In addition, this

study used the CES-D and BDI diagnostic tools as opposed to Zung’s SDS.

Features used apart from word frequencies were as follows: topics generated

by LDA, ratios of positive and negative affect words, hourly posting frequency,

daily posting frequency, average number of words per tweet, overall retweet

rate, overall mention rate, ratio of tweets containing a URL, number of users

following the user, and the number of users followed. Differences in these fea-

tures correlated with the presence or absence of depression were explored.

The study found that the most common posting times were consistent

between depressed and non-depressed users, and that no significant differ-

18

ence could be found between depressed and non-depressed users posting times,

which contradicted the findings of De Choudhury et. al (De Choudhury et al.,

2013). Significant differences were found, however, when analyzing the ratios

of tweets containing positive and negative words, tweet urls, post frequencies,

and retweet rates. Again, contrasting with De Choudhury et. al, the rate that

a user was mentioned by other users, and the number of both followers and

followed users did not demonstrate a statistically significant association with

depression. Tsugawa et. al posited that the different findings of the two stud-

ies may be due to cultural differences between Japanese and English language

users of Twitter, though it was granted that more research was needed to

clarify this matter. An SVM classifier trained using the statistically signifi-

cant non-bag of words features achieved an accuracy rating of 66 percent.

2.5 Studies Utilizing Korean Social Media Data

In this section we will discuss studies that leveraged Korean social media

data to either predict or generate insights into suicide or depression.

In 2013, there were three such studies of significance published. Won et. al

compared the predictive potential for the national number suicides of suicide

and dysphoria related weblog entries to social, economic and meteorological

variables over the period of time from 2008 to 2010. After applying a set of

filtering operations to remove noise such as advertisements, all remaining we-

blogs posted to Naver Blogs between January 1st, 2008 and December 31st,

2010 were used to tabulate a ”suicide weblog count” and a ”dysphoria we-

blog count”. The suicide weblog count was defined as the number of weblogs

that included the Korean word for suicide, jasal, at least once. The dysphoria

19

weblog count was defined as the number of weblogs containing the Korean

word for stressed or fatigued, himdeulda, at least once. Significant variables

were detected through individual univariate regression tests. Following this,

a multivariate regression model was generated using the previously identified

significant variables. The study concluded that the dysphoria weblog count

variable was a stronger and more stable predictor of national suicide numbers.

The suicide weblog number was observed to be more variable and suscepti-

ble to sudden and drastic spikes, especially as a response to celebrity suicide

events. (Won et al., 2013)

Another study published in 2013 identified the potential of non-linguistic

Facebook data in identifying depressed users of the platform. Park et. al at-

tempted to use Korean Facebook data to identify symptoms in users suffering

from depression. This study was unique amongst the studies discussed thus

far as it employed the use of its own unique app, EmotionDiary, in order to

engineer its own features based on behavior exhibited on the social media plat-

form. 55 Facebook users were recruited through advertising on the campus of

a large Korean university, and each user was administered the CES-D. Partic-

ipants were classified as depressed if they obtained a score of 25 or higher. The

study concluded that there was a statistically significant positive correlation

between reading tips and facts about depression in the Facebook app and a

score of 25 or higher on the CES-D. Furthermore, the number of friends and

locations tags associated with a subject of the study was negatively correlated

with a score of 25 or higher. While the sample size was small and linguistic

data was not used, this study was another significant attempt at leveraging

social media data to identify possible symptoms of depression in users of a

20

social media platform (Park et al., 2013).

Contrasting with other studies in our overview, Park et. al did not rely

solely upon the social media data of users, but also survey results that indi-

cated users’ views regarding their usage of a social media platform. 14 volun-

teers, 7 depressed and 7 controls, were interviewed about their views regarding

their usage of the Twitter social media platform. Interviewers collected the age,

gender, education level, job title, Twitter IDs, and depression diagnostic his-

tory of each interviewee. Each participant was also required to take the CES-D.

Scores of 22 or above were used to classify participants as depressed. While

253 volunteers completed this initial screening, after screening for participants

that were both active on Twitter and willing to share their tweets, only 69

participants remained; 23 depressed and 46 not depressed. Of these only 24

agreed to do a more in-depth interview. One of the depressed users failed to

show up for the scheduled interview, leaving 14 interviews, with an even split

between depressed and non-depressed participants. This drastic reduction in

sample size serves to demonstrate the difficulty in relying upon volunteered

survey data, which is one motivating factor for the pursuit of automated meth-

ods that is found in the studies of Coppersmith et. al as well as in this work

(Coppersmith, Dredze, and Harman, 2014) (Coppersmith et al., 2015) (Cop-

persmith et al., 2017). The interview consisted of question regarding matters

related to experiences of depression and usage of the Twitter social media

platform.

In addition to the interviews, 1,523,377 tweets were collected from 1,363

friends of individuals in the depressed group and 1,649,761 tweets were col-

lected 1,756 friends of individuals in the control group. A random sample of

21

10,000 tweets was generated from this dataset, with 5000 tweets being ran-

domly selected from each class. After using LIWC to analyze the randomly

sampled tweets, it was found that analysis of the tweets consumed by the two

groups corroborated the differing accounts provided by the two groups for the

type of content they consumed on Twitter. Depressed users reported consum-

ing more emotional content, which was reflected by higher average scores in

multiple LIWC affective lexical categories. The average affectiveness of tweets

in from the depressed user friend group was 4.67, while for the control friend

group it was 3.42. (Park, McDonald, and Cha, 2013)

A study conducted by Park et. al in 2015 leveraged non-linguistic social

media data to gain insight into the online behaviors of depressed Facebook

users. 234 students with Facebook accounts that were at least a year old were

recruited at a large university. Of the 234 students, 212 completed an online

version of the CES-D that was administered through a Facebook app and

consented to providing access to their Facebook wall activities. Of these, 120

participants obtained a score of 20 or less, indicating a low probability of being

depressed. The remaining participants obtained a score between 21 and 60, in-

dicating a a strong probability of being depressed. An attempt to validate the

efficacy of this means of diagnosing participants was made via comparisons

to the results of alternative measures. The first measure was the BDI, and

the second was administered through a face-to-face interivew, which utilized

another diagnostic tool, the Hamilton Depression Rating Scale (HAM-D). It

is worth noting that despite sending out personal invitations and offering free

counseling to users identified as depressed through the CES-D, very few of the

participants responded and ended up being interviewed: only 6 participants

22

out of the 42 that obtained a score over 20 on the CES-D. Despite this, user

metadata was collected, and the correlation of various aspects of user inter-

action with the Facebook platform to CES-D scores was measured through

the Pearson correlation coefficient and a linear regression analysis. The study

found that depressed users used fewer geolocation tags, indicating that they

posted from a less varied group of locations, and that they had fewer Face-

book friends. In addition, they made an equal amount of wall posts, liked more

posts, and viewed more tips about depression through the same proprietary

app through which the CES-D was administered. Depressed users received sig-

nificantly fewer likes on their wall post, and both received significantly fewer

comments than non-depressed users. The rate of comment posting was approx-

imately the same between the two groups. Some limitations of the study that

were identified by the authors included a lack of analysis of linguistic data and

a largely homogeneous sample. In a round of follow-up surveys, a majority of

participants indicated that they found the app administered in the study ben-

eficial and educational. Furthermore, the authors concluded that the similar

level of prevalence of depression between the studied demographics and the

nation-wide sample gathered by various offline studies indicated that online

screening could be a viable means of connecting with depressed individuals,

at least in the demographics that more commonly use social media. (Park

et al., 2015). Also utilizing a proprietary Facebook app, EmotionDiary, Lee

et. al found that an expressive writing activity administered through the app

was effective at treating symptoms of depression. This finding, combined with

the potential of online diagnostic methods, suggest that the recurring theme,

present in many of the aforementioned studies, of participants being unwilling

23

or unlikely to seek a diagnosis of depression, even when symptomatic, may no

longer prove to be the obstacle to treatment it has been if methods such as

those suggested in this paper and the various studies discussed in this section

continue to demonstrate their value. (Lee et al., 2016)

2.6 Qntfy Studies

Qntfy is a U.S.- based company that describes itself as ”a technology

solutions provider bridging data science and human behavior” (Qntfy). They

specialize in leveraging psychological and behavioral data so as to support the

operations of for-profit, not-for-profit, and government organizations. In addi-

tion to its custom data analysis work, Qntfy also publishes original research

in peer-reviewed publications. Many of these studies are the basis for the ap-

proach taken in this thesis, and so we will provide an overview of this work

here.

In 2014, Coppersmith et. al found that linguistics signals related to various

mental health disorders were present in Twitter data and able to be leveraged

through a simple unigram or character 5-gram language model (Coppersmith,

Dredze, and Harman, 2014). First, Twitter users were identified as suffering

from a mental health condition by searching for users who tweeted a state-

ment similar to ”I have been diagnosed with PTSD”, where PTSD could also

be depression, bipolar, or seasonal affective disorder. Up to 3200 tweets were

collected for each user that had posted such tweets, and then a corpus of

control tweets was generated by scraping the tweets of a random sampling of

users that had posted over the same time period. Various so-called pattern

of life data were also measured, and included measurements of how often a

24

user posted, proportion of tweets including mentions of other users, and pro-

portions of self-mentions. An analysis of the text data was also done using

LIWC. While LIWC was valuable in reproducing previous findings concerning

the language of mental health, neither the pattern of life analytics nor the data

obtained through the LIWC analysis were as effective at differentiating users

as the unigram and character 5-gram language models. The authors concluded

that their results indicated that a variety of signals relevant to mental health

were observable in Twitter data, and in particular in its lexical data.

In 2015, Coppersmith et. al published a summary of the Computational

Linguistics and Clinical Psychology (CLPSych) shared and unshared tasks

(Coppersmith et al., 2015). The task used data from Twitter users who stated

that they had been diagnosed with depression or PTSD and demographically-

matched community controls, with the goal of comparing various methods

of modeling language from social media related to mental health. Data was

collected in the fashion described in (Coppersmith et al., 2015) Three binary

classification experiments comprised the shared task: 1) depression vs. control,

2) PTSD versus control, 3) depression versus PTSD. Classifier performance

was measured primarily by average precision. Twitter users were divided into

train and test sets, with the train partition consisting of 327 depressed users,

246 PTSD users, and an age-and-gender-matched control user for each, for

a total of 1,146 users. The test data contained 150 depressed users and 150

PTSD users, which, combined with the matched controls, amounted to a total

of 600 users. Participants in the shared task consisted of four teams: The Uni-

versity of Maryland (UMD), The World Well-Being Project (WWBP), The

University of Minnesota at Duluth (Duluth), and a team comprised of mem-

25

bers employed at Microsoft, IHMC, and Qntfy (MIQ). The authors concluded

that the results of the shared task demonstrated the relative superiority of

topic-modeling over simple linguistic features for the shared tasks, though

such features provided some classification ability, even without the utilization

of complex machine learning techniques.

Using a data collection method similar to that used in the study con-

ducted in 2014, Coppersmith et. al used Twitter data scraped from users that

had been identified as having made a public declaration of a suicide attempt to

perform an exploratory analysis of the tweets posted prior to a user’s suicide

attempt (Coppersmith et al., 2016). 554 users were identified as having made

a public declaration of a suicide attempt, of these, however, only 312 gave an

indication of when their latest attempt was. 163 users provided and exact date,

and of these 125 had data available that was posted prior to their respective

suicide attempts. In a similar fashion to what was accomplished in previous

studies, Coppersmith et. al found that they were able to distinguish those who

attempted suicide from controls using n-gram language models with logistic

regression. It was also found that users that had attempted suicide posted a

greater volume of tweets than users in the control group. An emotional state

generator was also developed using hashstags as labels. This emotion classifier

was used in order to explore the emotional makeup of users’ tweets prior to a

suicide attempt. Based on the labels generated by this automatic classifier, it

was concluded that users that committed suicide posted a greater proportion

of tweets that could be categorized as angry or sad than controls did. These

proportions fall to levels similar to that of controls in the weeks following a

suicide attempt, however. Tweets labeled as fearful or disgusting were similar

26

between the control group and the suicide group in the weeks preceding a

suicide, but the suicide group showed a decrease in these categories to levels

below that of the control group in the weeks following a suicide attempt. In-

terestingly, and perhaps counterintuitively, the suicide group showed a lower

proportion of tweets labeled as indicating loneliness compared to the control

group. Furthermore, this difference tended to widen in the weeks following a

suicide attempt.

In contrast to the previous studies discussed, which generally tracked

users’ activity over years, months, or weeks, or alternatively did not include

time as a variable at all, Loverys et. al sought to explore micropatterns oc-

curring in messages over much shorter periods of time (Loveys et al., 2017).

Data was collected similarly to the method described in the previous studies

in this section. Tweets were collected for users that stated they were diag-

nosed with generalized anxiety disorder, an eating disorder, panic disorder, and

schizophrenia. Users stating that they had attempted suicide were included in

the study as well. These conditions were chosen as they were considered to

have symptoms the most sensitive to timing. Using the VADER method of

sentiment analysis. VADER, or Valence Aware Dictionary and Sentiment Rea-

soner, is a lexicon and rule-based sentiment analysis tool specifically attuned

to sentiments expressed in English on social media. The authors examined the

emotional content of three tweets following an initial tweet when the following

tweets were posted no more than three hours later. Tweets could be counted

in more than one overlapping micropattern if more than three tweets were

posted by a user within three hours. Continuing the line of inquiry followed in

the previous studies discussed in this section, the relative performance of the

27

micropatterns, underlying sentiment labels, and a combination of the two on

a binary classification task. Micropatterns were shown to provide information

beyond that provided by the sentiment labels alone for all mental health cat-

egories.

Also in 2017, a team working at Qntfy developed an annotated scheme for

classifying depressed tweets according to a number of categories to generate the

Depressive Symptom and Psychosocial Stressors Acquired Depression (SAD)

corpus (Mowery et al., 2017). Using the DSM-5, elements of the DSM-IV, and

other descriptions of depressive symptoms documented in the psychiatric lit-

erature in conjunction with additional depression related categories observed

in data, such as weather and media, the authors developed an annotation

scheme. Both a psychiatrist and a counseling psychologist provided feedback

on the annotation categories prior to its finalization. Data for the corpus was

collected by searching the Twitter API using depression-related terms from the

LIWC corpus. In addition, data collected for the CLPsych 2015 shared task

described previously in this section was sampled. To validate the annotation

scheme, two psychology graduate researchers and a postdoctoral biomedical

informatics researcher annotated the 1200 tweets comprising the SAD corpus.

While interannotator agreement was high for tweets indicatingno evidence of

clinical depression, agreement was much lower for depressive symptoms and

psychosocial stressors. Keywords were found to have much more predictive

value for tweets in the the CLPysch data than in the SAD corpus. The au-

thors theorized that this was due to the depression-related vocabulary being

grounded by users’ statements that they had been diagnosed, whereas in the

SAD corpus such terms could appear without such contextual grounding, lead-

28

ing to difficulty in classifying tweets accurately according to pre-defined lexical

categories. While the authors hoped future investigations into machine learn-

ing based postprocessing techniques could mitigate these limitations, overall

this study highlighted the present research difficulties in improving upon com-

putational methods with a qualitative analysis.

Lastly, in 2017 Coppersmith et. al addressed how linguistic signals of de-

pression could by health care professionals as a supplement to data that is

already collected by the health care system (Coppersmith et al., 2017). Using

both a VADER sentiment classifier and the hashtag-derived emotion classifier

developed by Coppersmith et. al (2016), the authors generated probability

distributions for each of the possible sentiment and emotion labels that could

be assigned to the internal chat and communications within a company. In

order to estimate the variety and proportions of emotions and sentiments ex-

pressed by a company on a given day, the authors aggregated the messages and

summed up the probabilities associated with each label, ignoring communica-

tions labeled as ’neutral’ or ’no sentiment’. The rolling mean over a one week

window of various sentiments and emotions were calculated over a 36 day pe-

riod. The data analysis revealed that the company appeared to have increases

in average negative sentiment in the weeks leading up to a big deliverable. In

contrast, peaks of joy were observed in the periods preceding holidays and the

completion of the first derivable of a project. The authors suggested that these

findings were illustrative of the population-level analysis that is now possible

with computational analysis and classification tools. While the classifiers used

in this study were for emotion and sentiment, the authors indicated that the

mental health classification tools they had utilized in the studies discussed

29

earlier in this section were equally conducive to this sort of population-level

analysis.

30

3 Corpus Data

Twitter is an online social media service launched in July of 2006. It has

330 million active users as of October 2017 and was ranked the 13th most

visited site on the internet as of May 16, 2018 (Alexa Top 500 Global Sites).

Messages posted by users, referred to as ”tweets”, are limited to 140 Japanese,

Korean, or Chinese characters, and to 280 characters in other languages. Pho-

tos and other media, as well as urls and screennames, do not count towards the

character limit. Users’ account pages are publicly viewable, while their tweets

are publicly viewable only by default; this setting can be modified by users so

that tweets can only be seen by registered Twitter users that have subscribed

to the user in question. Subscribing to another user puts that user’s tweets

on the ”timeline” of the subscribing user, and this act of subscribing is called

”following”. A user’s timeline is made up of the tweets of the Twitter users

they are following. Users can mention other users through their use of the ”@”

character before another user’s screenname; users have the ability to isolate

and see all tweets containing such mentions of their screenname through the

platform. In addition to being able to reply to other users tweets, tweets can

be ”retweeted” by other users, allowing them to be shared on the timelines

of others with attribution to the original tweeter. Twitter users also have the

option of messaging each other privately and of blocking other users so that

future tweets or direct messages sent to them by the blocked user will not be

31

viewable or trigger a notification. Tweets can be posted by users through the

Twitter website, approved external applications such as smartphone apps, and

through SMS. Users can also click a button to ”like” a tweet. How many likes

a tweet has will be visible to all users that are able to view the tweet.

3.1 The Twitter API and User Selection

Twitter’s developer platform offers several application programming in-

terfaces (API), or sets of methods and properties that can be used to interact

with data on the twitter platform. These APIs allow developers the ability to

interact with username data, media, text data, and other metadata for usage in

other apps. For example, through the Twitter developer platform and the use

of its APIs, a smartphone developer could create an app that allows users to

access their Twitter feed within another app, while a web developer could use

an API to embed relevant tweets on a website. The API could be thought of as

a list of rules and directions from accessing data and features. The Twitter API

platform includes access to numerous endpoints, where an endpoint is unique

url address pointing to an object. Twitter objects are generally represented as

JSON files and consist of tweet objects, user objects, Twitter entities, Twitter

extended entities and geospatial objects. Twitter API usage is rate limited,

which means that the number of tweets that can be extracted is limited over

a 15-minute window. At the time of this writing, the rate limit is 450 calls

for past data and 15 calls per every 15-minute period for live data. Real-time

data uses a separate API called the Streaming API, so-called because it allows

developer to interact with tweets just as they are uploaded. Another limitation

of the Twitter API is that it does not provide tweets that are older than 7

32

Figure 3.1: Tweet object retrieved through the Twitter API.

days old. Due to these limitations, a supplemental tool for tweet location and

extraction was necessary.

The Twitterscraper python script developed by Ahmet Taspinar bypasses

the Twitter API and instead used the Twitter website advance search func-

tion and the BeautifulSoup library to extract tweets (taspinar/twitterscraper:

Scrape Twitter for Tweets). This allows us to retrieve tweets that are greater

than 7 days old. As with objects retrieved through the Twitter API, each

tweet is retrieved as a JSON object. For each tweet retrieved, twitterscraper

retrieves the username of the user that posted the tweet, the tweet id, the

tweet url, the tweet text, the tweet html, the tweet timestamp, the number of

of likes the tweet has received, the number of replies to the tweet that have

been posted, and the number of times the tweet has been retweeted by other

users. Because the script utilizes the Twitter website’s advanced search func-

tionality, various arguments can be given to queries. Searches for tweets can

be restricted by timespan, region and language.

3.2 Tweet Extraction

A search query was run using the Twitterscraper script for the terms ”

우울증 진단” (eu-ool jeung jindan), or ”depression diagnosis” in Korean. The

search was restricted to posts made in Korea and made in the Korean language

33

Figure 3.2: An example of a tweet object obtained through the Twitterscraper Pythonscript.

Genuine Statements of Diagnosis

“병원가보세여 저도 불면증와서 갓다가 우울증진단받구옴 꼭꼭 병원가보세욧ㅠ ㅜ”“5년 내내 우울증에 시달리다가 올해 처음 정신과 가서 불안장애랑 우울증 진단받고 대략 5개월째 치로중ㅇ이야”“아 병원은...일상 유지가 힘들어서 정신과 찾아갔는데 만성 우울증이랑 PTSD 진단받고 상담이랑 약물치료 병행하고있어요”

Disingenuous or Less Certain Statements of Diagnosis

“아 우울증이란 진단을 안 받아서 글치 직장인들 태반이 우울증일게야 아마”“우울증 자가 진단... 60 점 나왔다 – ;; 아∼!!!!!!!!! 심심해서 그래 ㅋㅋㅋ”,”네이트 검색어에 직장인 회사 우울증 나오길래, 클릭 했는데....점수가.....고도 판정.... 나 전문가 진단 받으래..ㅠㅠ 자가 진단에서 이런 결과나 받고.”

Table 3.1: Tweets used by native speakers to detect users belonging to the positiveclass.

over an approximately 8-year period from 2010 to 2017. 441 users were initially

retrieved for the depressed class. After using human annotators to identify

tweets that indicate a claim of a diagnosis of depression (see Table 3.1 for

examples of such tweets), the number of users in the depressed category was

reduced to 139. Then, with a limit of 3,200 tweets per user, the past tweets

of the users we identified as belonging to the depressed class were extracted.

A control group of 4,000 randomly selected users posting over the same time

period was constructed by searching for an empty string and having their tweet

data extracted as well, again with a cap of 3,200 tweets. In order make finding

meaningful signals easier usernames and URLs were removed from the data.

The final numbers of users and tweets can be seen in Table 3.2.

34

Users Tweets

Depressed 139 17140Control 4000 594281

Table 3.2: Number of users and tweets broken down by category

3.3 Tokenization

Tokenization is the process of taking a text string and breaking it into

smaller portions, called tokens, that can be counted and used as a feature of

a document through which a classifier will ”interpret” a document.

For the first round of classifier experimentation, simple space tokenization

was used. This means that tweets were simply divided according to the white

spaces appearing in the tweet, with each token being a substring between two

whitespaces within a tweet. As indicated above, urls and screennames were

removed from all tweets prior to tokenization. For a second round of trials,

the open source text segmentation library MeCab was used. While originally

developed for Japanese by the Nara Institute of Science and technology, a Ko-

rean fork of the project, Mecab-ko, has been developed by the Eunjeon Project

(은전한닢 프로젝트: 은전한닢 프로젝트를 소개합니다.). Unlike a simple space

tokenizer, Mecab is able to identify parts of speech and select out sequences of

syllables with semantic relevance. It can also recognize misspellings in many

cases, as well as utilize information encoded by punctuation in a text se-

quence. Examples of tweets tokenized by both the space tokenization method

and MeCab-ko can be seen in Table 3.3.

35

Space Tokenization

’ㅠㅠ’ ’필요할거에요’ ’동의가’ ’부모’ ’미성년자는’

’댕댕이’

’아님’ ’고치겠다는’ ’뜯어’ ’때까지’ ’조곤조곤’ ’말은’ ’아니다’ ’몰아치기식’

’자두되나’

’시간이’ ’유익한’ ’상담으로’ ’친절한’ ’교수님의’ ’있었고’ ’의미’ ’더욱’ ’있어’ ’오를’ ’정상에’ ’빠짐없이’’분도’ ’다녀왔습니다’ ’산행을’ ’심학산으로’ ’경기도’ ’함께’ ’환우분들과’ ’다지고자’ ’희망을’ ’완치에의’ ’토요일에는’’19’ ’지난

Mecab

’ㅠㅠ’ ’..’ ’.’ ’에요’ ’거’ ’할’ ’필요’ ’가’ ’동의’ ’부모’ ’는’ ’성년자’ ’미’ ’@’

’댕댕이’

’?’ ’아님’ ’것’ ’다는’ ’겠’ ’고치’ ’어’ ’뜯’ ’까지’ ’때’ ’될’ ’조곤조곤’ ’은’ ’말’ ’다’ ’아니’’식’ ’기’ ’몰아치’ ’”’ ’는’

’나’ ’되’ ’자두’ ’..’ ’.’

’*’ ’이’ ’시간’ ’유익’ ’상담’ ’친절’ ’님’ ’교수’ ’고’ ’었’ ’의미’ ’더욱’ ’있’ ’수’ ’오를’ ’정상’’빠짐없이’ ’도’ ’한’ ’습니다’ ’다녀왔’ ’산행’ ’으로’ ’산’ ’심학’ ’경기도’ ’함께’ ’과’ ’들’ ’분’ ’환우’’고자’ ’다지’ ’희망’ ’완치’ ’에’ ’토요일’ ’19’ ’9’ ’지난’ ’을’ ’의’ ’어’ ’..’ ’.’ ’는’

Table 3.3: Examples of Tokenized Tweets

3.4 Caveats

As in previous studies utilizing this method of dataset creation, there

are certain caveats to keep in mind in regard to its efficacy. As indicated by

Coppersmith et al. in their paper that introduced the data collection method

used for this experiemnt, significant among these are the following: 1) Indi-

viduals willing to speak publicly on such a taboo subject may only represent

a unique subpopulation, and not depressed individuals as group. 2) The claim

that a given user has been diagnosed with depression is not verified, but sim-

ply taken at face value. 3) There is the possibility that some percentage of

the control group users are also depressed. 4) Twitter users themselves are

possibly not representative of the greater population of depressed individuals.

(Coppersmith, Dredze, and Harman, 2014)

36

4 Classification Methods

In the previous chapter, we discussed our dataset and how we would

organize it so that various machine learning classifiers could learn how to tell

if a tweet came from a depressed user or a nominally non-depressed user. In this

chapter, we will provide a brief discussion of the tools employed to accomplish

that task, i.e., the machine learning classifiers themselves. To begin, let’s start

with a very brief introduction to machine learning and its applications.

4.1 Definitions

Machine learning is a subfield of computer science that employs statistical

knowledge through algorithms to enable models of the world to be ”learned”

by computers. These models can then be made to make predictions about the

world. American computer scientist Tom Michael Mitchell formally defined

machine learning as follows:

”A computer program is said to learn from experience E with re-

spect to some class of tasks T and performance measure P if its

performance at tasks in T, as measured by P, improves with expe-

rience E.”

(Mitchell, 1997)

For our purposes, task T would be determining whether or not a tweet

was posted by a user diagnosed with depression. The performance measures P

37

utilized for determining how well a given ”computer program”, or algorithm,

accomplishes this task are discussed in the following chapter. That leaves us

with the task of defining E, or the ”experience” that the classifiers will need

in order to improve at their assigned task.

4.2 Training, Testing, and Cross Validation

The task of the experiment is a supervised learning task, due to the nature

of the training, or experience as defined above, that is employed to improve

the performance of the model generated by the learning algorithm. In essence,

some of the tweets in the dataset are set aside for a given classifier so that the

classifier can hopefully start to detect patterns between the text and the labels

that have been given to the texts, i.e., depressed or not-depressed. Then, it

observes other labeled tweets and makes predictions based on its experiences.

The classifier’s performance is evaluated on this test set. If there were not

a pre-determined correct or incorrect answer for the algorithm’s predictions,

and if the classifier did not train with data explicitly labeled and divided into

groups according to intentions and knowledge held outside of its own findings,

i.e. the understanding of what constitutes a depressed twitter user versus a

control user, as was outlined in the previous chapter, then the task would

become an unsupervised learning task. In an unsupervised learning task, the

goal is often for the algorithm to find interesting patterns in a dataset either

as an end in itself or as a prelude to improving performance on a supervised

learning task.

When training and testing for a classification task, however, there are

certain pitfalls that have to be avoided. As indicated above, the data is sepa-

38

rated into at least two subsets, a training set and a test set. This is because

what is being sought in the improvement of a classifier’s performance is in-

formation that is generalizable to new, unseen data, not to see how well it

memorized the data it has already seen. That would be an extreme example

of overfitting, or a case where a classifier does very well on a very specific and

limited set of data, but does not perform well when given new or novel data

to classify. But even after splitting the dataset into two subsets, what if all of

the data in our training set belonged to depressed users, while all of the data

in the test set belonged to the control group? One would expect the classifier

to perform very poorly on the training set. In this case, the classifier would

be underfitting, or not detecting any signals that would have allowed it to in-

crease its performance on the task. It could be said that the classifier is highly

biased towards the training data, and was not able to learn more generalizable

patterns because it was so drastically limited by the data it could train on.

When training and testing a classifier, it is desirble to avoid both overfitting

or underfitting the data, and often if one is confronted with making choices in

an experiment design that decreases the probability of one, it will increase the

probability of the other. This is a tradeoff between bias and variance, where a

biased classifier is one that underfits, and a classifier exhibiting a large degree

of variance is one that overfits.

One solution to this problem is cross-validation. Cross-validation is the

process of training and testing a classifier that consists of partitioning the

dataset into multiple subsets, and then alternating which subset serves the

test set with each ”fold”, or iteration, of the process. After each subset has

been used as a training set once, the performance of the classifier on each test

39

set is averaged as the measure of its overall performance. There are many ways

to split the data and many variations of cross-validation, but our experiment

relies upon the widely-used convention of 10-folds for our cross-validation pro-

cess, meaning that our classifiers will train on 90 percent of the data and test

on the remaining 10 percent ten times, with the data comprising the test set

being unique and alternated for each iteration. This reduces bias by allow-

ing our classifiers to learn from all of the signals available in the data, while

potentially guarding against overfitting by repeatedly alternating the data is

used for the test set.

Having provided a brief indication of the nature of how a machine learning

classifier operates, we will conclude this chapter by providing a basic overview

of the classification algorithms that were employed in our experiment. Naive

Bayes was chosen as it is the simplest algorithm and often used for text clas-

sification tasks, even if only as a baseline to compare other approaches with,

and because is very fast and easy to train. Because the feature space is quite

large and we assume the classes to be linearly separable, we also employ logis-

tic regression and linear SVM classifiers. A random forest classifier was also

implemented as an alternative approach to the two previous linear methods,

and also because it tends to work well in high-dimensional spaces.

4.3 Naive Bayes

When tasked with document classification, Bayesian classifiers use Bayes

theorem, along with its underlying assumption of feature independence, to

determine the probability of a document belonging to one class or another.

For example, if they know how often the term ”depressed” appears in the

40

depressed category of documents than in the control category, it will use the

tabulations from the data it trained on to determine how likely it is that

the given document came from either the depressed or control class given the

presence of the term ”depressed” in the document. In other words, the relative

frequency of ”depressed” in each class. These probabilities are summed across

each feature in a document, giving the document a probability for each class.

(Shimodaira, 2014)

P (θ|D) = P (θ)P (D|θ)P (D)

(4.1)

4.1: The equation for Bayes Theorem

4.4 Logistic Regression

Like linear regression, logistic regression makes linear associations be-

tween features and observations. In the case of document classification, fea-

tures are often word frequencies or ratios. Unlike linear regression, however,

logistic regression applies a logistic transformation to a linear function in order

to output a probabilistic class prediction between 0 and 1. This is equivalent

to saying that the logarithm of the odds of an observation belonging to a given

class can be represented by a linear function, or for the purposes of this study,

a linear function of token counts. This logarithm of the odds is also called the

logit of the probability. A decision threshold is applied to round the proba-

bility value generated by the logistic function to a discrete categorical value.

(Peng, Lee, and Ingersoll, 2002)

41

π =1Xβ

1 + eXβ(4.2)

4.2: The logistic regression function

4.5 Linear Support Vector Machines

This algorithm treats each observation as a a vector of features. Each

observation can therefore be conceptualized as a point in a high dimensional

space. For our purposes, an observation is a tweet and a feature is a token’s

relative frequency in one class versus the other. After the documents have

been converted to vectors, a decision boundary function is used in order to

maximize the margins between the observations. This is equivalent to saying

that instead of simply finding a line or hyperplane that separates observations

belonging to one class or another, the algorithm finds the line or hyperplane

that maximizes the distance between the most similar observations belonging

to different classes, which are known as the support vectors. When evaluating

an unlabeled observation, the algorithm attempts to determine where in the

space it is located relative to the hyperplane it has constructed to divide the

two classes. (Joachims, 1998)

⇀wT⇀x+ b = 0 (4.3)

4.3: The equation for the decision boundary of the LSVM algorithm

4.6 Random Forest

The random forest classifier is a classifier that combines aggregate bag-

ging and decision tree learning (Liaw and Wiener, 2002). A decision tree can

42

be conceptualized as a series of binary questions, that, when answered in a

hierarchical sequence, allow for the identification of an entity or observation.

For example, if we were to attempt to predict whether or not an individual

survived a natural disaster, we may start by inquiring if the individual is male

or female, as know that all survivor were female. Because we can so accurately

separate the classes with this one feature, gender, it would make up the root

of the decision tree so long as there is no other feature of the survivors that

so clearly distinguishes them from those that didn’t survive. If the answer is

male and no males survived the disaster, we know that this individual did

not survive. On the other hand, if the individual was female, we could then

ask another question from the next level in the decision hierarchy. This next

decision can be conceptualized as a branch of the decision tree, and it will ask

about a feature relevant to distinguishing between the two classes, but that

was not as exclusive to one class versus the other as the feature at the previous

branch or at the root of three would be. At this branch, we might ask if she

was under 170 centimeters tall, and if no one that died in the incident was

both female and under 170 centimeters tall, we would know that she survived.

This final determination that this individual is a survivor is called the leaf,

or the decision, of the tree. For our classification task, each token can be con-

ceptualized as a feature that the tree could ask about. For each feature, the

tree must decide whether to split, i.e., form another binary branch, based on

that feature. This is determined by which feature costs us the least in terms of

predictive power to split on. This process is repeated in a recursive condition

on smaller and smaller subgroups until some terminal condition is met. One

way of determining such a terminal condition is to set a minimum number

43

of observations for each leaf or decision. In other words, going back to our

survivor example, if we set the minimum number of example belonging to a

decision as eight, we would ignore any potential leaf that describes less than

eight observations belonging to a particular class. Another way is to set the

maximum depth, which limits how many branches a tree can have between its

root and its leaf.

Decision trees have many advantages, but one disadvantage is that they

are prone to becoming overly complex and overfitting data, which a random

forest minimizes through aggregate bagging. Aggregate bagging consists of cre-

ating a number of random samples with replacement from the larger dataset,

and then, for each of these subsamples, training a tree. This aggregation of

trees you end up with is the random forest. To test the random forest in the

case of a classification task, we take the majority vote of the trees in the forest

as its prediction. One last important note is that random forests use a tree

learning algorithm that learns on a random subset of features at each poten-

tial split. This is too prevent too many trees from becoming correlated to each

other due to a few dominant signals (Ho, 2002).

4.7 Feedforward Neural Network

A feedforward neural network is a type of artificial neural network com-

posed of a collection of linked computational units, often referred to as nodes

or neurons, that are arranged in multiple layers, where information in the

resulting network of units is propagated forward from one layer to another,

from an initial input layer, to one or more hidden layers, and then finally to an

output layer, in a non-cyclic fashion. In this study, the initial layer consisted

44

of the inputs, represented as a ”bag of words”, or a vector of zeros and ones,

with each value in the vector representing the presence or absence of a unique

token. These input are then multiplied by a vector of weights, which are the

parameters for the network that are trained through a learning process. The

output produced by a hidden layer, or a layer between the input and output

layers, is the result of the application of an activation function to the values

from the previous layer. In a fully connected network such as the one utilized in

our research, each unit multiplies the vector of values from the previous layer

by weights associated with that unit, which can be conceptualized as a con-

nection or synapse between neurons. The values are then summed before the

activation function is applied to it. In this study, a sigmoid function was used,

meaning that each unit in a hidden layer forward propagated a value between

0 and 1 to either a subsequent hidden layer or the output layer. This value can

be conceptualized as representing the confidence of the network in the weights

associated with that unit that are applied to its inputs. The weights used

to parameterize the network are learned through backpropagation, a process

where the error of the network’s prediction is distributed through the preced-

ing layers and units of the network and the weights associated with each pair

of units is adjusted accordingly (Bebis and Georgiopoulos, 1994). The feedfor-

ward neural network used in this study consisted of two hidden layers of 100

units each.

45

5 Experiment

Inspired by the Qntfy studies discussed in section 2.6 and the lack of

research in this direction in foreign languages, especially Korean, an experi-

ment was conducted with data extracted according to the methods discussed

in Section 3.2. A comparison of performance in a binary classification task be-

tween depressed users and controls was conducted using the machine learning

classification methods discussed in Chapter 4 and two separate methods of

tokenization. To the best of our knowledge, this is the first study done using

Korean data attempting to distinguish the tweets of users suffering from a

mental health condition from those of a control group.

5.1 Methodology

After extracting the data, for each classification experiment a random

sample of tweets equal in number to the number of tweets in the depressed user

class were set aside for training and testing in cross-validation. This resulted

in having a corpus of 34280 tweets for training and testing. It was decided

to balance the classes instead of utilizing a class ratio that more accurately

represents the prevalence of depression in the population in order to ease

classifier interpretation and creation, while keeping in mind that results must

be interpreted with this bias in mind. This is a far smaller dataset than those

used in the Qntfy studies, but as stated in Section 1.4, one goal of our project

is to gauge the efficacy of this method of data extraction and classification

47

on a smaller dataset. The tweets were shuffled in order to help ensure that

each batch used during the training process was representative of signals in

the entire dataset and not an idiosyncratic cluster.

5.2 Explanation of Metrics

To measure the performance of the classifiers on the binary classification

task, we utilize F1 scores and receiver operating characteristic (ROC) graphs.

We explain these metrics and also the rationale for using them in this study.

5.2.1 Accuracy, Precision and Recall

The accuracy rating of a classifier, or the rate of correct predictions rel-

ative to the total number of predictions made, is usually not a satisfactory

measure when evaluating the performance of a classifier on a task when the

task is related to diagnosing individuals with medical conditions. While a sim-

ple accuracy rating may suffice if the underlying task has to do with predicting

the result of a game given a certain state, or if predicting an event with a priori

equally likely outcomes, this does not extend to cases such as cancer detection

where a random patient undergoing screening is more likely to be cancer-free

than not, and in which the costs of a false negative prediction are far greater

than that of a false positive prediction. In cases such as these, metrics that

indicate how effective the classifier is at identifying positive cases are used, as

the performance we are interested in in such cases is in detecting the positive

class. This rate of the number of positive predictions relative to the number

of positive cases observed is the recall of a classifier.

On the other hand, if a classifier simply predicted all cases as positive,

that would not be very helpful, either. Consider again the cancer detection ex-

48

ample. A classifier could obtain a one-hundred percent precision rating simply

by diagnosing every patient with cancer. While false positives are less costly

than false negatives in this task, cancer treatment is time consuming and, in

most cases, life altering. And medical resources are limited; even if treatment

were painless, it simply is not practical to treat everyone as if they had cancer.

And of course, there is also the unnecessary trauma experienced by the patient

when receiving false diagnosis. For all of these reasons, we must also consider

the precision, or the rate of correct positive predictions to total positive pre-

dictions made by a classifier. Ideally, the classifier in our example will be able

to attain a high level of recall while also maintaining a high level of precision;

this means that it is able to identify the patients that need treatment while

minimizing the number of incorrectly diagnosed patients, and thereby inflict-

ing unnecessary stress upon patients and incurring significant costs to both

patients and the health care system.

Of course, there are a wide variety of possible classification tasks, and the

relative importance of recall versus precision can vary a great deal depending

on the task. That being said, for the purposes of our experiment, the three

measures discussed below are used in place of accuracy because they better

represent the greater importance of recall and precision over simple accuracy

in detecting depressed users.

True Positive Predictions

Total Positive Predictions(5.1)

5.1: The equation for determining the precision of a classifier

49

True Positive Predictions

Positive Observations(5.2)

5.2: The equation for determining the recall of a classifier

5.2.2 The F1 Score

The F1 score is a measure that indicates how well a classifier does in terms

of both its precision and its recall, given a probability threshold at which it

determines whether an observation belongs to the positive class. This is both

its strength and its weakness as a metric. With one number, we can gain

insight into how the classifier performs in a generalized way, but the F1 score

does not specifically tell us how well the classifier performs in terms of either

precision or recall; simply put, it is a representation of the balance between

the two. Given our balanced classes, the threshold used in our experiments

was .5. It is calculated as follows: F1 = 21

recall+ 1

precision

5.2.3 ROC Curves

An ROC curve is a graph generated by plotting the true positive rate

(TPR) along the y-axis and the false positive rate (FPR) along the x-axis

at various threshold settings, with the threshold being the probability level

at which the classifier determines that an observation belongs to the positive

class. For example, if the threshold is set to .5, then the classifier will predict

that an observation is positive if the classifier determines that it has a .5 or

greater probability of belonging to the positive class. The best possible classi-

fier would be represented by an ROC graph as having a coordinate at the (0,1)

coordinate of the graph, maximizing the area under the curve. This point in-

dicates that he classifier is able to achieve a one-hundred percent true positive

50

rate while also achieving a one-hundred percent true negative rate. In contrast,

a classifier that made random guesses would be represented by a diagonal line

beginning at the 0,0 coordinate that divides the graph in half. Points above

this line indicate that a classifier performs better than random chance, while

points below the line indicate the reverse. Lastly, the ROC curve graphs in

this section indicate the AUC, or area under curve, for each fold of a 10-fold

cross validation for each classifier. The AUC indicates the probability that the

classifier will identify a random observation from the positive class as more

likely to be of the positive class than a randomly selected observation from

the negative class. An AUC of .5 indicates performance no better than that

of a random guess, while an AUC of 1 would represent perfect classification

accuracy on a given test set.

5.3 Classifier Results with Space Tokenization

As discussed in section 3.3, our first trial of experiments were conducted

using simple space tokenization. This means that tokens were generated simply

by splitting strings into chunks based on where white spaces appeared in a

string. As mentioned previously, this is done after removing usernames and

URLs from strings. We utilize this minimalist approach to tokenization in

order to determine how much predictive insight a classifier can obtain through

relatively unfiltered twitter data. Given that a goal of this thesis is to minimize

the need for domain knowledge or expertise, this is a reasonable baseline to

explore. A 10-fold cross validation was used for the training and testing of all

classifiers. As we can see in Table 5.3, the standard deviations of each classifiers

F1 score over 10-fold cross validation is low, suggesting that the classifiers

51

performance is consistent over each subset of data. Logistic regression achieves

the highest F1 score, with a score of .75. The performance disparity between

the classifiers is not great, however. That all of the classifiers perform this

task with a success rate well above chance guessing is visualized clearly in

Figure 5.3. Lastly, Table 5.3.2 shows that each classifier obtains similar levels

of performance as the linear classification method utilized on a similar dataset

in (Coppersmith, Dredze, and Harman, 2014).

(a) Logistic Regression (b) Naive Bayes

(c) Linear SVM (d) Random Forest

Figure 5.1: ROC graphs representing performance with space tokenization.

52

Space Tokenization Mecab

Logistic Regression .75 .84

Multinomial Naive Bayes .72 .84

Linear Support Vector Machine .72 .81

Random Forest .71 .83

Feedforward Neural Net .73 .83

Table 5.1: F1 Scores

5.4 Classifier Results with Mecab Tokenization

For our second round of trials, we utilize the Mecab-ko tokenizer discussed

in Section 3.3 By contrasting simple space tokenization with the results ob-

tained with Mecab, we can obtain an intuition for the degree to which morpho-

logical analysis aids classifiers in finding relevant signals that distinguish the

two classes of users. Looking at Table 5.3.3, it is clear that the morphological

analysis provided by Mecab-ko provides a significant boost to performance.

The relative performance of each classifier remains approximately the same,

with multinomial naive bayes seeing the most significant boost. It now per-

forms as well as logisic regression, with an F1 score of .84.

5.5 Linear SVM Top Features

While feature exploration was not the focus of our study, we have also

provided a graph for the top features distinguishing the two classes using both

space tokenization and Mecab. While the use of the Mecab tokenizer led to a

significant increase in the performance of the linear SVM classifier, with the

F1 score jumping from .74 to .81, the top features for the depressed class when

using Mecab are quite different from those generated by simple space tokeniza-

tion. The top features for the depressed class in the case of space tokenization

53

(a) Logistic Regression (b) Multinomial Naive Bayes


Figure 5.2: ROC graphs representing performance with tokenization performed byMecab-ko.

contain Korean emoticons for sad faces and words we might intuitively as-

sociate with negative emotion such as the Korean word for envy, bureopda

(부럽다). The top features generated in the case of Mecab tokenization, how-

ever, do not contain these features. It does, however, contain the words ’pogi’

포기, which loosely translated means to give up or renounce, ’yushil’ (유실)

, which is to be swept away or lost, ’dangyeobyeong’ (당뇨병), diabetes, and,

perhaps most appropriately, ’ooeuljeung’ (우울증), depression.

54

Figure 5.3: Linear SVM Top Features with Space Tokenization.

55

Figure 5.4: Linear SVM Top Features with Mecab Tokenization.

56

5.6 Precision-Recall Graphs

While organizing the data from our experiment, precision-recall graphs

were also generated. They have not been included or discussed thus far because

they do not serve our purposes outlined in sections 1.4 and 5.2. We will briefly

discuss them here, however, in order to address an aspect of our findings.

5.6.1 Precision-Recall Curves and Hard to Classify Tweets

As was explained earlier, precision is the rate of true positive predictions

relative to total positive predictions, while recall is the ratio of positive predic-

tions to positive observations. Precision-recall curves plot recall on the x-axis

and precision on the y-axis. Because precision is the the probability of a pos-

itive prediction being correct, it is highly sensitive to the base probabilities

of the respective classes. It is for this reason that precision-recall curves are

often used when there is a severe class imbalance. Because this study and the

Qntfy studies used balanced classes, however, ROC curves were used in place

of precison-recall curves. However, if we look at the precision-recall curves gen-

erated by all classifiers used in our experiments when using space tokenization,

we observe an interesting phenomenon: there is a consistent and dramatic drop

in precision at a 70 percent recall rate for all classifiers. This means that there

are some depressed classed tweets that the classifiers have great difficulty in

identifying and are only identified when the decision threshold is low. As a re-

sult, the precision drops precipitously. We observe a similar but less steep drop

with Mecab tokenization and logistic regression at the threshold that gener-

ates an approximately 90 percent recall rate. For multinomial naive bayes, we

again observe a similar but less pronounced drop in precision. Interestingly,

57

random forest, while obtaining a lower average precision across all thresholds

compared to logistic regression, does not produce this sudden drop in preci-

sion before using a threshold that generates an almost perfect recall rate 1. It

can be theorized that these problematic tweets contain signals that are shared

between depressed and non-depressed users that are reduced when utilizing

morphological analysis. Some of these are reduced by the morphological anal-

ysis conducted by Mecab, which may suggest that part of this overlap may be

due to idiosyncracies of tweet texts that are reduced or eliminated when using

Mecab tokenization, leading to fewer problematic tweets that require very low

thresholds to identify as belonging to the positive class.

1Average precision is a measure that summarizes the precision of a classifier over aset of thresholds, where thresholds are the probability levels at which a classifier makes adetermination as to whether or not an observation belongs to the positive class. The precisionat every threshold is weighted by the increase in recall from the previous threshold.

58



Figure 5.5: Precision-recall graphs representing performance with space tokenization.

5.7 Discussion

As we can see, this method of data collection provides great potential

for the discovery of mental health signals when using various commonly used

machine learning classifiers. Logistic regression proved to be the best perform-

ing of all classifiers, regardless of the tokenization method used. Multinomial

naive bayes proved to have an F1 score high as logistic regression when using

the Mecab tokenizer, .84, but the MNB classifier’s average precision was still

somewhat lower, .81 versus .86 for logistic regression. Results when using sim-

59



Figure 5.6: Precision-recall graphs representing performance with Mecab tokenization.

ple space tokenization were worse, with logistic regression outperforming the

MNB classifier both in terms of its F1 score as well as in its average precision,

with scores of .75 to .72 and .82 to .81, respectively. This suggests that logistic

regression would be the preferred choice for the classification task for both a

recall-prioritizing approach or a precision-prioritizing approach.

The feedforward neural network did not perform better than the shallow

learning classifiers, and this may be due to the very large feature space in con-

junction with memory limitations on our experiment leading to an inability

to train a sufficiently complex network to take advantage of the potential of a

60

feedforward network. Alternatively, there simply may not be enough data for

the network to train on so as to allow the network to take into account more

abstract nuances differentiating the two classes.

Given the significant increase in performance when using the Mecab to-

kenizer, we can infer that morphological analysis plays a key role in how ef-

fective any classification method may turn out to be. While some tweets in

the positive class are still difficult to identify, as was seen in Section 5.6, the

morphological analysis of Mecab reduced the amount of positive class tweets

that were difficult to identify, as evidenced by the lower thresholds at which

classifiers saw a decrease in accuracy when using Mecab tokenization versus

space tokenization.

The results of the experiment demonstrate that linguistic signals of de-

pression are as available and useful in a classification task in Korean using

social media data and machine learning as they have been suggested to be by

previous studies done in English. The true positive rate and and false positive

rate numbers cannot, however, be taken as close approximations of what values

would be produced on an actual population of randomly selected users, due to

the imbalance between depressed and non-depressed individuals in an actual

population, as well as the fact that there may be contamination of depressed

users in the randomly selected control group. Nevertheless, the relative efficacy

of the classifiers and the comparable performance with past studies that also

utilized datasets with balanced or nearly-balanced classes, shows that these

methods are at least as effective in Korean as they have proven to be in En-

glish.

Furthermore, in our brief examination of the top features identified by

61

the linear SVM classifier when using both Mecab and space tokenization, we

can see that some linguistic signals that we might expect to see in depressed

users tweets emerge, such as sad face emoticons and words associated with

negative emotion.

While this study only dealt with Korean data, we believe it indicates the

potential of these methods for the automatic detection of linguistic signals

of mental health conditions in a variety of languages. Availability of a good

tokenizer for the language may be key in achieving optimal results, however,

though it was demonstrated that even with simple space tokenization, signifi-

cant differentiation between the classes was able to be achieved.

62

6 Conclusion

Using social media to gain insight into signals tied to mental mental health

issues, both linguistic and otherwise, is an enterprise that has, in recent years,

grown at a rapid rate in terms of both research interest as well as in its viabil-

ity in providing results that can truly distinguish a target group from controls

in a way that is reliable, scalable, and feasible. However, most of this research

has been done in English. Our study demonstrates that an inexpensive data

collection technique first introduced in research done on English social me-

dia data, in conjunction with commonly used machine learning classifiers, is

sufficient in distinguishing depressed social media users from a randomly se-

lected control group. Furthermore, it demonstrates, to our knowledge for the

first time, that this process is at least as effective with Korean data as it has

proven to be in past studies with English data.

In accomplishing this goal, we searched for users on the social media plat-

form Twitter that claimed to have been clinically diagnosed with depression

using the Korean language. Korean native speakers were then employed to

ensure that these claims were sincere and not sarcastic, made in jest, or oth-

erwise not an indication of an actual diagnosis of depression. Using a Python

script that allowed us to bypass the limitations of the Twitter API while still

leveraging publicly available data, we scraped the posting history of users

identified as diagnosed with depression according to the aforementioned stan-

63

dard, and then matched the resulting dataset with an equal amount of tweets

from randomly selected controls posting over the same time period. We then

set various machine learning classifiers to work on a binary classification task

using two kinds of tokenization methods. We did not however, attempt to op-

timize these classifiers or utilize more sophisticated deep learning approaches

in a way that could maximize their potential in this task, and so this is left

for future research to explore.

Based on the ability of all classifiers to perform well above chance with

both tokenization methods in distinguishing tweets from depressed users from

those of a control group over the course of a 10-fold cross validation, it is

our belief that such findings may prove useful in future considerations of how

to leverage widely used social media platforms in identifying individuals that

may be at risk for suffering from debilitating mental health conditions such

as depression. Thus far, most studies in Korean dealing with the detection

of depression, even those that have leveraged social media data, have relied

on expensive and time-consuming surveys and interviews that often have a

low follow-through rate. Studies such as ours are a significant indicator that

relatively simple and cost-effective measures may, with more research, prove

to be a source of great contribution to the diagnosis and treatment of a large

group of individuals who suffer without seeking help from mental health pro-

fessionals.

Furthermore, in exploring our dataset, we find certain features that we

might intuitively expect to find from depressed users, such as words or emoti-

cons likely to indicate negative emotional states. In addition, we find a uniform

inability of the employed classifiers to reliably classify a portion of tweets with-

64

out a significant drop in precision. The number of tweets belonging to this

problematic group is dramatically reduced by morphological analysis, how-

ever. We theorize that this drop in precision may be due to language that

is idiosyncratic to language use on Twitter, and that is greatly reduced by

the morphological analysis provided by a tokenizer such as Mecab. That there

are still some tweets that are difficult to classify is not entirely unexpected,

as it is not clear intuitively or otherwise that we should expect the language

employed by depressed users on a platform such as Twitter to be exclusively

distinct from users not suffering from depression, thus leaving room for posts

that may be indistinguishable from those posted by a control group.

Our employment of a simple two-layer feed forward network fails to out-

perform our best performing shallow learning classifier, and we posit that this

may be due to there either being no more complexity in the data to be mined,

or in our simple network not being complex or optimized enough to make use

of such potential. As indicated above, we leave this avenue open for future

research to explore.

Lastly, we acknowledge that there are caveats to our study. Depressed

Twitter users, and in particular those willing to go public with a diagnosis,

may not be a good representation of depressed individuals as a whole. In addi-

tion, using a balanced dataset is not an accurate representation of the occur-

rence of depression in the general population, and that compounding this fact

is the potential for undiagnosed users, or alternatively diagnosed users who

have not publicly disclosed their diagnosis, to contaminate the control group.

Based on this study, however, we believe there is good reason to believe that

the methods employed in this thesis are as viable for Korean social media data

65

as they are increasingly being demonstrated to be in a growing body of work

done with English data. We also remain optimistic that with larger datasets, a

more optimized machine learning approach, and a more controlled experimen-

tal environment backed by personal data from willing volunteers, the results

and applications of approaches such as those employed in this experiment can

grow exponentially.

66

Bibliography

Ahn, J. (2012). “Depression, suicide, and Korean society”. In: Journal of the

Korean Medical Association 55.4, pp. 320–321.

Al-Mosaiwi, M. and T. Johnstone (2018). “In an Absolute State: Elevated Use

of Absolutist Words Is a Marker Specific to Anxiety, Depression, and Sui-

cidal Ideation”. In: Clinical Psychological Science 0.0, p. 2167702617747074.

doi: 10.1177/2167702617747074. eprint: https://doi.org/10.1177/

2167702617747074. url: https://doi.org/10.1177/2167702617747074.

Alexa Top 500 Global Sites. https://www.alexa.com/topsites. (Accessed

on 06/09/2018).

Aramaki, E., S. Maskawa, and M. Morita (2011). “Twitter Catches the Flu:

Detecting Influenza Epidemics Using Twitter”. In: Proceedings of the

Conference on Empirical Methods in Natural Language Processing. EMNLP

’11. Edinburgh, United Kingdom: Association for Computational Linguis-

tics, pp. 1568–1576. isbn: 978-1-937284-11-4. url: http://dl.acm.org/

citation.cfm?id=2145432.2145600.

Association, A. P. et al. (2013). Diagnostic and statistical manual of mental

disorders (DSM-5®). American Psychiatric Pub.

Bagroy, S., P. Kumaraguru, and M. De Choudhury (2017). “A social media

based index of mental well-being in college campuses”. In: Proceedings

67

of the 2017 CHI Conference on Human Factors in Computing Systems.

ACM, pp. 1634–1646.

Bebis, G. and M. Georgiopoulos (1994). “Feed-forward neural networks”. In:

IEEE Potentials 13.4, pp. 27–31.

Beck, A. T., R. A. Steer, and G. K. Brown (1996). “Beck depression inventory-

II”. In: San Antonio 78.2, pp. 490–8.

Belmaker, R. and G. Agam (2008). “Major depressive disorder”. In: New Eng-

land Journal of Medicine 358.1, pp. 55–68.

Boydstun, A. et al. (2013). “Examining debate effects in real time: A report of

the 2012 React Labs: Educate study”. In: The Political Communication

Report 23.1.

Coppersmith, G., M. Dredze, and C. Harman (2014). “Quantifying mental

health signals in twitter”. In: Proceedings of the Workshop on Compu-

tational Linguistics and Clinical Psychology: From Linguistic Signal to

Clinical Reality, pp. 51–60.

Coppersmith, G. et al. (2015). “CLPsych 2015 shared task: Depression and

PTSD on Twitter”. In: Proceedings of the 2nd Workshop on Compu-

tational Linguistics and Clinical Psychology: From Linguistic Signal to

Clinical Reality, pp. 31–39.

Coppersmith, G. et al. (2016). “Exploratory analysis of social media prior to

a suicide attempt”. In: Proceedings of the Third Workshop on Computa-

tional Lingusitics and Clinical Psychology, pp. 106–117.

Coppersmith, G. et al. (2017). “Scalable mental health analysis in the clinical

whitespace via natural language processing”. In: Biomedical & Health In-

68

formatics (BHI), 2017 IEEE EMBS International Conference on. IEEE,

pp. 393–396.

De Choudhury, M. et al. (2013). “Predicting depression via social media.” In:

ICWSM 13, pp. 1–10.

Eaton, W. W. et al. (2004). “Center for Epidemiologic Studies Depression

Scale: review and revision (CESD and CESD-R).” In:

Fava, M. et al. (2003). “Background and rationale for the sequenced treat-

ment alternatives to relieve depression (STAR D) study”. In: Psychiatric

Clinics of North America 26.2, pp. 457–494.

Freud, S. (1901). The Psychopathology of Everyday Life. Digireads.com. isbn:

1420924915. url: http://www.amazon.com/exec/obidos/redirect?

tag=citeulike07-20\&path=ASIN/1420924915.

Goodwin, F. K. and K. R. Jamison (2007). Manic-depressive illness: bipolar

disorders and recurrent depression. Vol. 1. Oxford University Press.

Gottschalk, L. A. and R. Bechtel (1993). “Computerized content analysis of

natural language or verbal texts”. In: Palo Alto.

Gottschalk, L. A. et al. (1970). “Prediction of changes in severity of the

schizophrenic syndrome with discontinuation and administration of phe-

nothiazines in chronic schizophrenic patients: Language as a predictor and

measure of change in schizophrenia”. In: Comprehensive Psychiatry 11.2,

pp. 123 –140. issn: 0010-440X. doi: https://doi.org/10.1016/0010-

440X(70)90154-9. url: http://www.sciencedirect.com/science/

article/pii/0010440X70901549.

69

Gottschalk, L. A. and G. C. Gleser (1969). The measurement of psychological

states through the content analysis of verbal behavior. Univ of California

Press.

Guntuku, S. C. et al. (2017). “Detecting depression and mental illness on social

media: an integrative review”. In: Current Opinion in Behavioral Sciences

18, pp. 43–49.

Hamilton, M (1986). “The Hamilton rating scale for depression”. In: Assess-

ment of depression. Springer, pp. 143–152.

Ho, T. K. (2002). “A data complexity analysis of comparative advantages of

decision forest constructors”. In: Pattern Analysis & Applications 5.2,

pp. 102–112.

Joachims, T. (1998). “Text categorization with support vector machines: Learn-

ing with many relevant features”. In: European conference on machine

learning. Springer, pp. 137–142.

Kahn, J. H. et al. (2007). “Measuring Emotional Expression with the Linguistic

Inquiry and Word Count”. In: The American Journal of Psychology 120.2,

pp. 263–286. issn: 00029556. url: http://www.jstor.org/stable/

20445398.

Kemp, S. (2016). Digital in 2016 - We Are Social UK. url: https://wearesocial.

com/uk/special-reports/digital-in-2016.

Kim, G. et al. (2013). “National Evidence-based Collaborating Agency (NECA)

Round-table Conference Consensus Statement: multidisciplinary responses

to suicide, the first ranked cause of death in adolescents.” In: Journal of

the Korean Medical Association, Taehan Uisa Hyophoe Chi 56.2.

70

Klein, D. N. and S. R. Black (2013). “Persistent depressive disorder”. In:

Psychopathology: History, Diagnosis, and Empirical Foundations 334.

Kroenke, K. and R. L. Spitzer (2002). “The PHQ-9: a new depression diag-

nostic and severity measure”. In: Psychiatric annals 32.9, pp. 509–515.

Lee, S. W. et al. (2016). “Insights from an expressive writing intervention

on Facebook to help alleviate depressive symptoms”. In: Computers in

Human Behavior 62, pp. 613–619.

Liaw, A., M. Wiener, et al. (2002). “Classification and regression by random-

Forest”. In: R news 2.3, pp. 18–22.

Loveys, K. et al. (2017). “Small but Mighty: Affective Micropatterns for Quan-

tifying Mental Health from Social Media Language”. In: Proceedings of the

Fourth Workshop on Computational Linguistics and Clinical Psychology—

From Linguistic Signal to Clinical Reality, pp. 85–95.

Mann, J. J. et al. (2005). “Suicide prevention strategies: a systematic review”.

In: Jama 294.16, pp. 2064–2074.

Marcus, M. et al. (2012). “Depression: A global public health concern”. In:

Mitchell, T. (1997). Machine Learning. McGraw-Hill International Editions.

McGraw-Hill. isbn: 9780071154673. url: https://books.google.co.

kr/books?id=EoYBngEACAAJ.

Moreno, M. A. et al. (2011). “Feeling bad on Facebook: Depression disclo-

sures by college students on a social networking site”. In: Depression and

anxiety 28.6, pp. 447–455.

71

Mowery, D. et al. (2017). “Understanding depressive symptoms and psychoso-

cial stressors on Twitter: a corpus-based study”. In: Journal of medical

Internet research 19.2.

Na, K.-S. et al. (2015). “Psychological autopsy: review and considerations

for future directions in Korea”. In: Journal of Korean Neuropsychiatric

Association 54.1, pp. 40–48.

Nadeem, M. (2016). “Identifying Depression on Twitter”. In: CoRR abs/1607.07384.

arXiv: 1607.07384. url: http://arxiv.org/abs/1607.07384.

Nelson, J. C. and J. M. Davis (1997). “DST studies in psychotic depression:

a meta-analysis”. In: American Journal of Psychiatry 154.11, pp. 1497–

1503.

Noh, J.-H. 학생 스마트폰 ’SNS 자살징후’ 부모에게 알린다. Ed. by Y. News.

url: http://www.yonhapnews.co.kr/bulletin/2015/03/12/0200000000AKR20150312185600004.

HTML.

O’Dea, B. et al. (2015). “Detecting suicidality on Twitter”. In: Internet Inter-

ventions 2.2, pp. 183–188.

OECD (2016). OECD Factbook 2015-2016, p. 228. doi: https://doi.org/

http://dx.doi.org/10.1787/factbook-2015-en. url: https://www.

oecd-ilibrary.org/content/publication/factbook-2015-en.

Pajer, K et al. (2012). “Discovery of blood transcriptomic markers for depres-

sion in animal models and pilot validation in subjects with early-onset

major depression”. In: Translational psychiatry 2.4, e101.

72

Park, J. et al. (2011). “Ceo’s apology in twitter: A case study of the fake

beef labeling incident by e-mart”. In: International Conference on Social

Informatics. Springer, pp. 300–303.

Park, M., D. W. McDonald, and M. Cha (2013). “Perception Differences be-

tween the Depressed and Non-Depressed Users in Twitter.” In: ICWSM

9, pp. 217–226.

Park, S. et al. (2013). “Activities on Facebook reveal the depressive state of

users”. In: Journal of medical Internet research 15.10.

Park, S. et al. (2015). “Manifestation of depression and loneliness on social

networks: a case study of young adults on Facebook”. In: Proceedings

of the 18th ACM conference on computer supported cooperative work &

social computing. ACM, pp. 557–570.

Pedersen, T. (2015). “Screening twitter users for depression and ptsd with lex-

ical decision lists”. In: Proceedings of the 2nd workshop on computational

linguistics and clinical psychology: from linguistic signal to clinical reality,

pp. 46–53.

Peng, C.-Y. J., K. L. Lee, and G. M. Ingersoll (2002). “An introduction to

logistic regression analysis and reporting”. In: The journal of educational

research 96.1, pp. 3–14.

Qntfy. https://www.qntfy.com/. (Accessed on 06/01/2018).

Resnik, P., A. Garron, and R. Resnik (2013). “Using topic modeling to im-

prove prediction of neuroticism and depression in college students”. In:

Proceedings of the 2013 conference on empirical methods in natural lan-

guage processing, pp. 1348–1353.

73

Rude, S., E.-M. Gortner, and J. Pennebaker (2004). “Language use of de-

pressed and depression-vulnerable college students”. In: Cognition and

Emotion 18.8, pp. 1121–1133. doi: 10.1080/02699930441000030. eprint:

https://doi.org/10.1080/02699930441000030. url: https://doi.

org/10.1080/02699930441000030.

Saeed, S. A. and T. J. Bruce (1998). “Seasonal affective disorders.” In: Amer-

ican family physician 57.6, pp. 1340–6.

Shimodaira, H. (2014). “Text classification using naive bayes”. In: Learning

and Data Note 7, pp. 1–9.

Stirman, S. W. and J. W. Pennebaker (2001). “Word use in the poetry of sui-

cidal and nonsuicidal poets”. In: Psychosomatic medicine 63.4, pp. 517–

522.

Stone, P. J. and E. B. Hunt (1963). “A Computer Approach to Content

Analysis: Studies Using the General Inquirer System”. In: Proceedings

of the May 21-23, 1963, Spring Joint Computer Conference. AFIPS ’63

(Spring). Detroit, Michigan: ACM, pp. 241–256. doi: 10.1145/1461551.

1461583. url: http://doi.acm.org/10.1145/1461551.1461583.

taspinar/twitterscraper: Scrape Twitter for Tweets. https://github.com/

taspinar/twitterscraper. (Accessed on 06/10/2018).

Tausczik, Y. R. and J. W. Pennebaker (2010). “The Psychological Meaning

of Words: LIWC and Computerized Text Analysis Methods”. In: Jour-

nal of Language and Social Psychology 29.1, pp. 24–54. doi: 10.1177/

0261927X09351676. eprint: https://doi.org/10.1177/0261927X09351676.

url: https://doi.org/10.1177/0261927X09351676.

74

Tsugawa, S. et al. (2013). “On estimating depressive tendencies of Twitter

users utilizing their tweet data”. In: Virtual Reality (VR), 2013 IEEE.

IEEE, pp. 1–4.

Tsugawa, S. et al. (2015). “Recognizing depression from twitter activity”. In:

Proceedings of the 33rd Annual ACM Conference on Human Factors in

Computing Systems. ACM, pp. 3187–3196.

Weintraub, W. (1989). Verbal Behavior in Everyday Life. Springer Publishing

Company, Incorporated. isbn: 9780826157904. url: https://books.

google.co.kr/books?id=E1F9AAAAMAAJ.

Werth, J. L. (2004). “The relationships among clinical depression, suicide, and

other actions that may hasten death”. In: Behavioral sciences & the law

22.5, pp. 627–649.

Won, H.-H. et al. (2013). “Predicting national suicide numbers with social

media data”. In: PloS one 8.4, e61809.

Woo, H. et al. (2015). “Public Trauma after the Sewol Ferry Disaster: the

role of social media in understanding the public mood”. In: International

journal of environmental research and public health 12.9, pp. 10974–10983.

Zung, W. W. (1965). “A self-rating depression scale”. In: Archives of general

psychiatry 12.1, pp. 63–70.

은전한닢프로젝트:은전한닢프로젝트를소개합니다. http://eunjeon.blogspot.

com/2013/02/blog-post.html. (Accessed on 06/1/2018).

75

초록

한국어 트위터 데이터를 활용한 우울증 표현

인식

근래자살률에있어OECD국가들중최상위권에있으면서도한국에서우울

증과같은정신건강에대한진단과치료는과거와마찬가지로여전히금기시되는

경향성이 있다. 영어권 국가들에서는 소셜 미디어 텍스트를 이용해 정신건강의

이상 징후를 찾는 연구가 크게 증가하고 있고, 최근에는 한국 교육부도 자체적으

로 소셜 미디어 텍스트 검사 앱을 미성년자 대상으로 발표했다. 따라서 한국어

소설 미디어 텍스트로부터 정신건강 이상 징후를 효과적으로 분류하는 연구는

현재 매우 시의적절한 상황이다. 현재까지 소셜 미디어 데이터를 활용한 다수의

기존 연구들은 심리학적 텍스트 분석 프로그램(LIWC) 또는 설문지와 같이 사전

구축된 어휘자료를 사용해왔고, 특정 분야의 지식과 설문조사를 요구하지 않는

자동 감지 방법에 대한 연구는 상대적으로 적었다. 더욱이 영어 이외의 언어를

대상으로 한 연구는 매우 드물고 한국어에 대해서는 연구가 전무한 상황이다. 본

연구는 한국의 우울증과 자살이 공중 보건 문제에 대해 갖는 중요성을 감안해

이와 같은 부족함을 채우고자 이루어졌다. 본 연구는 어떤 게시된 트윗으로부터

그것을 작성한 사용자가 우울증을 앓고 있는지를 예측하고자 다양한 기계 학습

분류기를 사용하였다. 이를 위해 먼저 우울증을 진단받았다고 주장하는 트윗을

올린 사용자들을 찾은 후에, 한국어 모국어 화자들이 직접 그 트윗 게시물을 토

대로 우울증 진단 여부를 판단하였다. 그리고 우울증을 앓고 있는 것으로 판단된

사용자자로부터최대 3,200개까지의트윗을수집했으며,같은활동시기의정상적

76

사용자들 중 같은 수의 사용자들을 임의로 선택하여 그 트윗들을 통제집단으로

수집하였다. 두 개의 다른 토크나이저와 다수의 기계 학습 분류기를 사용했고,

트크나이저와 분류기의 각 조합에 다라 10-폴드 교차 검증법을 이용하여 평균

정밀도와 F1 스코어를 기록했다. 그 결과, 모든 조합에서 우연보다 훨씬 높은

정확도로 우울증 경향성을 보이는 사용자들을 감지하였다. 그러므로 본 연구는

소셜미디어자료를사용하여정신건강문제를자동탐지하는방법이,기존의심

리학적 텍스트 분석 프로그램(LIWC)이나 비용과 시간이 드는 설문조사에 비해

최소한 그 성능이 갖거나 더 낫다는 점을 확인하였다는 의미를 갖는다.

주요어: 기계학습, 정신 건강, 소셜 미디어, 트위터, 우울증

학번: 2015-22104

77

disclaimer - seoul national...

Documents