disclaimer - seoul national...
TRANSCRIPT
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
문학석사학위논문
Detecting Language from Depressed Users
with Korean Twitter Data
한국어 트위터 데이터를 활용한 우울증 표현 인식
2018 년 8 월
서울대학교 대학원
언어학과 언어학전공
Julius Jacobson
Abstract
Detecting Language from Depressed Users
with Korean Twitter Data
Julius Jacobson
Department of Linguistics
The Graduate School
Seoul National University
Despite leading the OECD in suicides, both the diagnosis and treatment
of mental health conditions such as depression remain a taboo in South Ko-
rea. With research utilizing English social media text to find signals of mental
health conditions becoming ever more abundant, and South Korea’s Ministry
of Education releasing its own social media text scanning app in order to
identify minors at risk, exploration into effective methods of classifying Ko-
rean social media text on the basis of underlying mental health conditions is
perhaps more relevant than ever.
Most studies to date leveraging social media to detect signals tied to men-
tal health conditions have utilized pre-generated dictionaries such as LIWC
or survey data. While there has been some research into automatic detection
methods requiring little or no domain knowledge and no survey data, such
studies are rare outside of English and, to our knowledge, no such study has
yet been done in Korean. Given the unique relevance of depression and suicide
as public health concerns to South Korea, this thesis hopes to be a potential
start to filling this void.
This paper employs various machine learning classifiers to predict whether
a tweet was posted by a depressed user. After searching for users with tweets
stating that they have been diagnosed with depression, Korean native speak-
ers were utilized to determine if such statements indicated a genuine claim of
a diagnosis. Up to 3200 tweets were scraped for each verified user. Then, a set
of tweets from an equal number of random Twitter users that had posted over
the same time period was collected as a control group. Using two different
tokenizers and an array of machine learning classifiers, the average precision
and F1 scores over a 10-fold cross-validation were recorded for all combina-
tions of tokenization and classifiers. All combinations were found to be able to
detect whether a tweet came from a depressed user with an accuracy rating
well above chance. This study, therefore, suggests that detection of mental
health issues using social media data may be a viable approach for further
study and treatment of mental illness, and on par or better than previous
methods relying upon pre-generated dictionaries such as LIWC or expensive
and time-consuming survey data.
Keywords: Machine Learning, Mental Health, Social Media, Depression,
Student Number: 2015-22104
Acknowledgements
First and foremost I would like to thank my advisor, Professor Shin Hy-
opil, for his patience and guidance over the course of the long and challenging
process of completing my degree. I would also like to thank Professor Nam
Seung Ho and Dr. Kim Munhyoung for their advice over the course of the
writing of this thesis.
Secondly, I would like to thank my mother for her tireless encouragement
and support while I was writing this paper on the other side of the world.
And lastly, Derek Hommel and Timour Igamberdiev were constant com-
panions and comrades on this research journey, without whom whiling away
the hours necessary to complete this project would have been, if not impossi-
ble, at least far less enjoyable. Thank you my friends.
5
Contents
1 Introduction 11.1 Text Analysis in Psychology . . . . . . . . . . . . . . . . . . . . 11.2 Mental Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Social Media and Mental Health . . . . . . . . . . . . . . . . . 41.4 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Research Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Literature Review 92.1 Types of Depression . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Diagnostic Methods . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Establishing the Language of Depression . . . . . . . . . . . . . 132.4 Studies Utilizing Non-Korean Social Media Data . . . . . . . . 162.5 Studies Utilizing Korean Social Media Data . . . . . . . . . . . 192.6 Qntfy Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Corpus Data 313.1 The Twitter API and User Selection . . . . . . . . . . . . . . . 323.2 Tweet Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.4 Caveats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Classification Methods 374.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Training, Testing, and Cross Validation . . . . . . . . . . . . . 384.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Linear Support Vector Machines . . . . . . . . . . . . . . . . . 424.6 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.7 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . 44
5 Experiment 475.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Explanation of Metrics . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.1 Accuracy, Precision and Recall . . . . . . . . . . . . . . 485.2.2 The F1 Score . . . . . . . . . . . . . . . . . . . . . . . . 505.2.3 ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Classifier Results with Space Tokenization . . . . . . . . . . . . 515.4 Classifier Results with Mecab Tokenization . . . . . . . . . . . 535.5 Linear SVM Top Features . . . . . . . . . . . . . . . . . . . . . 53
5.6 Precision-Recall Graphs . . . . . . . . . . . . . . . . . . . . . . 575.6.1 Precision-Recall Curves and Hard to Classify Tweets . . 57
5.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 Conclusion 63
List of Figures
3.1 Tweet Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Twitterscraper Tweet Object . . . . . . . . . . . . . . . . . . . 34
5.1 ROC for Space Tokenization . . . . . . . . . . . . . . . . . . . . 525.2 ROC for Mecab Tokenization . . . . . . . . . . . . . . . . . . . 545.3 Linear SVM Top Features with Space Tokenization . . . . . . . 555.4 Linear SVM Top Features with Mecab Tokenization . . . . . . 565.5 Precision-Recall for Space Tokenization . . . . . . . . . . . . . 595.6 Precision-Recall for Mecab Tokenization . . . . . . . . . . . . . 60
List of Tables
3.1 Diagnostic Tweets . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Dataset Composition . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Tokenized Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 F1 Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1 Introduction
1.1 Text Analysis in Psychology
In 2010, Yla R. Tausczik and James W. Pennebaker published The Psy-
chological Meaning of Words: LIWC and Computerized Text Analysis Methods.
The abstract of the paper claimed that ”We are in the midst of a technological
revolution whereby, for the first time, researchers can link daily word use to
a broad array of real-world behaviors.” Nearly a decade later, the revolution
that Tausczik and Pennebaker identified is still in full-swing and more relevant
than ever. In fact, one could argue that, with the rise of machine learning and
the widespread adoption of social media platforms, the full potential of this
revolution is only now beginning to be understood.
But the history of text analysis in the field of psychology significantly
predates not only social media and machine learning, but the internet and
computers entirely. This is a fact which makes itself evident within the lexicon
of the English language, with the common term ”Freudian slip” indicating an
instance of linguistic error that unintentionally reveals a hidden motive on the
part of the speaker or writer, and which has its origin in Freud’s 1901 book
the The Psychopathology of Everyday Life. Decades after its publication in
the 1950s, researchers developed the Gottschalk method, which consisted of
tracking Freudian themes in texts through content-analysis (Gottschalk et al.,
1
1970).
It was not until the 1960’s, however, that the first general purpose com-
puterized text analysis program in psychology, called The General Inquirer,
was produced (Stone and Hunt, 1963). It operated according to a series of
algorithms developed by its author. While it has proven useful in detecting
mental disorders and personality dimensions, it relied on weighted variables
that were not observable to the user (Stone and Hunt, 1963)
In the 1980’s, Walter Weintraub discovered that the usage of first-person
singular-pronouns could be linked to depression, a simple but profound insight
that foreshadowed the kind of impact that future psychological software could
have on the detection of mental health conditions (Weintraub, 1989). In fact,
this finding is utilized by psychological text analysis software to this day, most
notably by the Linguistic Inquiry and Word Count (LIWC) program, which
was developed by Martha Francis and James W. Pennebaker in the mid-90’s
(Tausczik and Pennebaker, 2010). The goal of LIWC is simple: to count words
in psychologically relevant categories over multiple text files.
In recent years, there have been attempts to utilize machine learning
methods in place of explicitly programmed software. As machine learning uti-
lizes statistical methods to allow a model to ”learn” from data without being
told beforehand what linguistic features are relevant, it does not necessarily
require the input of domain experts, as software such as LIWC has. With the
rise of social media, these computational techniques can be experimented with
on large datasets. Shortly, we will discuss some of these experiments before
exploring the results of an approach that has previously been untested in the
Korean language.
2
1.2 Mental Health
South Korea leads the OECD in suicides while also containing a very
large population of social media users (Kemp, 2016), (OECD, 2016). While
the Ministry of Education in South Korea has released an app to alert par-
ents of students that send instant messages or do web searches that indicate
suicidal ideation (학생 스마트폰 ’SNS 자살징후’ 부모에게 알린다 ), the link
between depression and suicide has long been studied and is well understood
(Werth, 2004). Therefore, an application that could detect depressed social
media users may prove effective in creating greater opportunities for early de-
tection and treatment of mental health conditions and the prevention of tragic
outcomes such as suicide.
Furthermore, according to a report put out by the Korean National Evidence-
Based Health Care Collaborating Agency, 5.6 percent of Koreans, approxi-
mately 2 million individuals, have suffered from depression at least once (Kim
et al., 2013). Yet, despite the fact that over 90 percent of suicide victims in
Korea suffer from a diagnosable psychiatric illness such as depression, very few
visit psychiatric clinics, leaving a potentially life-saving diagnosis out of reach
(Na et al., 2015). An automatic classification tool that could, in the privacy
of one’s own or of a loved one’s electronic device, provide some indication of
potential mental health issues, would likely be a helpful first step in preventing
negative outcomes in a culture that struggles to deal with mental health issues
amidst the taboos of being diagnosed. If used in tandem with alert systems
and accompanied with information about means of treatment, those suffering
from depression or other similarly dehabilitating mental health conditions may
3
stand to garner a significant benefit.
1.3 Social Media and Mental Health
As discussed earlier, text analysis has long played a role in the detection
of psychological phenomena such as personality dimensions and mental health
conditions. It is reasonable to assume, then, that social media would play a
prominent role in modern applications of computational text analysis in the
field of psychology. After all, social media provides a gigantic corpus of data
from which to draw insights. In addition, such data represents different re-
gions, economic climates, and even languages. Indeed, social media has been
utilized to gain insights in fields ranging from political science to public health
(Boydstun et al., 2013),(Aramaki, Maskawa, and Morita, 2011). There are,
however, unique challenges in evaluating psychological phenomena. Perhaps
most significantly, diagnosis relies upon a patient’s self-reported experiences.
This means obtaining objective metrics for factors correlating with or causing
depression may be confounded by subjective intentions, such as a desire to
obtain anti-depressant medication, or alternatively to mask one’s condition
from one’s self or from others.
Furthermore, according to the World Health Organization, most depressed
individuals do not receive a diagnosis for their condition as a result of not seek-
ing treatment (Marcus et al., 2012). This suggests that, despite the challenges
posed by mental health conditions, a convenient and practical form of detec-
tion may prove to be of great benefit to those suffering from various forms of
mental illness. This is particularly true in the case of depression, where the
rate of successful treatment is high. According to one study, nearly 8 out of 10
4
patients showed a significant improvement in symptoms of major depressive
disorder within 4 to 6 weeks (Mann et al., 2005). In addition, according to a
study funded by the National Institute of Mental Health, patients reported
an average 65 percent reduction in the symptoms of major depressive disorder
despite not responding to an initial anti-depressant description. This indi-
cates that there are various effective methods for the treatment of depression
that may prove helpful even when medication is proven ineffective in a given
case(Fava et al., 2003). This, combined with the low treatment rate, suggests
that an efficient and practical way to detect depression in particular stands to
benefit both many suffering individuals and society as a whole.
1.4 Research Goals
The goals of our investigation are:
(1) To demonstrate the effectiveness of the data collection method utilized
in numerous studies by the U.S.- based startup Qntfy in the Korean
language.
(2) To demonstrate the ability of this method to capture signal-containing
data even when datasets are relatively small.
(3) To gauge the effectiveness of an array of popular machine learning algo-
rithms when used in conjunction with this method.
(4) To briefly explore features distinguishing depressed users from non-depressed
users.
In accomplishing the above, our research aims to demonstrate the ef-
fectiveness of machine learning classifiers to distinguish depressed users from
5
non-depressed users. In addition, it hopes to indicate the potential for a large-
scale analysis of mental conditions such as depression using Korean social
media text. Furthermore, it suggests a new avenue of exploration for a more
effective and timely method of diagnosis that preserves anonymity, which may
prove invaluable if brought to fruition. The costs of depression are high, on the
level of both society and the individual, and methods that can reach a large
number of users with a minimal investment of resources are worth noting and
exploring.
By demonstrating that automatic data collection and machine learning
classification methods can achieve results surpassing those achieved through
curated lexical data, we indicate that this combination may be more effective
than methods utilizing domain knowledge or curated data. Such a finding is
likely have a significant impact on how resources are allocated in future efforts
to tackle mental health epidemics such as suicide in Korea.
Lastly, as was discussed in Section 1.2, many residents of Korea are ret-
icent when it comes to seeking help for mental health challenges. By demon-
strating the effectiveness of methods that are anonymous, impersonal, and
scalable, it is our hope a foundation is laid for future investigations into men-
tal health diagnosis that address what appear to be the fundamental obstacle
to obtaining a variety of effective treatments: getting an individual diagnosed
in the first place.
1.5 Research Outline
The thesis is structured as follows: Chapter 2 provides a literature review
of sentiment analysis and automated text analysis methods that leverage social
6
media data, as well as of previous studies that utilized computational methods
in conjunction with data that was acquired through crowdsourcing. Chapter 3
discusses the method used to acquire the data used in the experiment, as well
as an overview of the data itself. Chapter 4 provides an overview of the var-
ious machine learning classifiers used in our classification task. The first half
of Chapter 5 reports the findings of a preliminary study that utilized a naive
tokenization method and little optimization of the neural network model. The
second half of Chapter 5 discusses the results obtained after utilizing a more
refined tokenization method and the most significant features distinguishing
depressed users from non-depressed users. Chapter 6 concludes the by dis-
cussing the caveats and limitations of our research and avenues left open to
be pursued in future work.
7
2 Literature Review
This chapter discusses the relevant literature covering past research that
utilized social media data, machine learning, or other methods of computa-
tional data analysis to evaluate the mental health of an individual or group
of individuals. In addition, certain relevant studies from the field of Korean
sentiment analysis are included. Studies done in both Korean and English are
discussed.
2.1 Types of Depression
While it is common parlance to use the term ”depression” or to describe
someone as ”depressed” without the use of an additional modifier, in reality
there are actually multiple types of depression that are diagnosed in a clinical
environment. In this section, we will briefly discuss the varieties of depression
so that the diagnostic terms used in studies discussed in future sections are
clearly understood.
Major Depressive Disorder (MDD) is characterized by a depressed mood,
or lack of interest in activities that were once found pleasurable, for most
days of the week over a period of two weeks or more. Other symptoms in-
clude fluctuations in weight, disruptions in sleep patterns, consistent feelings
of sluggishness or agitation, feelings of guilt, fatigue, trouble concentrating,
and suicidal thoughts (Belmaker and Agam, 2008).
9
Persistent Depressive Disorder is diagnosed when a patient suffers from
depression for a period of 2 years or more. As suggested by its name, symp-
toms are less intense than those characterizing MDD but chronic over time
(Klein and Black, 2013).
Bipolar disorder, also called manic depression, is diagnosed when a pa-
tient suffers from consistent cycles of mood episodes that consist of ”highs”
and ”lows” that are extreme. When in the ”low” portion of a mood cycle, a
patient may experience symptoms characteristic of MDD (Goodwin and Jami-
son, 2007).
Seasonal Affective Disorder (SAD) is a disorder that arises in certain in-
dividuals during periods of the year when the days are shorter and the hours
of available sunlight is decreased, i.e. the winter and fall seasons of the year
(Saeed and Bruce, 1998).
Individuals suffering from Psychotic Depression generally suffer from the
same symptoms as those diagnosed with MDD, however, as the name of the
condition implies, they suffer from symptoms of psychosis as well (Nelson and
Davis, 1997). These can include hallucinations, delusions, or paranoia.
While the above list is by no means comprehensive, it is sufficient for our
discussion below.
2.2 Diagnostic Methods
While doctors may administer a variety of physical examinations to rule
out other potential diagnoses, depression has no reliable medical (i.e. non-
psychological) means of detection. In 2012, however, Reid et. al found biolog-
ical markers of early onset Major Depressive Disorder (Pajer et al., 2012). A
10
panel of of 11 blood markers was found to be sufficient to differentiate users
with early-onset MDD from those with no diagnosis in a small study consist-
ing of 28 participants between the ages of 15 and 19. Considerations of such
studies and of the utility of using biological markers to rule out other diag-
noses aside, however, questionnaires remain the primary means with which
clinicians diagnosis depression. In this section we will briefly discuss some of
these diagnostic tools.
The five questionnaires most commonly cited in the studies used for this
thesis are:
(1) The Patient Health Questionnaire (PHQ-9)
(2) Beck Depression Inventory (BDI)
(3) Zung Self-Rating Depression Scale (SDS)
(4) Center for Epidemiologic Studies Depression Scale (CES-D)
(5) Hamilton Rating Scale for Depression (HRS)
The Patient Health Questionnaire is based on the diagnostic criteria of the
Diagnostic and Statistical Manual of Mental Disorders (DSM), a manual pub-
lished by the American Psychiatric Association, and consists of 9 questions.
Developed in 1999 at Columbia University, it ranks patients’ levels of depres-
sion according to five categories: none or minimal, mild, moderate, moderately
severe, and severe. According the the 5th edition of the DSM, if 5 or more of
the 9 symptoms indicated by the questions of the PHQ-9 have persisted for
two tweeks or more, depression is a likely diagnosis if the symptoms are not
better explained by substance abuse or other medical condition (Association,
11
2013). The questions consist of inquiries into a patient’s interest in activities,
energy levels, mood, sleeping and eating habits, ability to concentrate, ability
to function, and whether or not a patient has entertained suicidal thoughts.
(Kroenke and Spitzer, 2002)
The Beck Depression Inventory (BDI), developed by American psychia-
trist Aaron T. Beck at the University of Pennsylvania, consists of 21 questions
and was first published in 1961 and then revised in 1978. The BDI-II was
published in 1996. Each response is assigned a point value between 1 and 3
and then the sum of response scores are summed to obtain a total score that
indicates the severity of the patient’s depression. In 1996, when the BDI-II was
published, all but three of the items were reworded to reflect updated diagnos-
tic criteria in the third edition of the DSM. A score of 0-13 indicates minimal
or no depression, 14-19 indicates mild depression, 20-28 indicates moderate
depression, and scores above 29 indicate severe depression. (Beck, Steer, and
Brown, 1996)
The Zung Self-Rating Depression Scale (SDS) was published in 1965 by
Dr. William W.K. Zung, a psychiatrist at Duke University. It consists of 20
items that ask the respondent to rate the symptoms of depression on a scale
of frequency from 1 to 4, with 1 meaning ”little of the time” and 4 meaning
”all of the time”. Scores on the test range from 20 to 80, with higher scores
indicating more severe levels of depression. (Zung, 1965)
The Hamilton Rating Scale for Depression (HRSD) was published in 1960
by Max Hamilton while he was a senior lecturer of psychiatry at the University
of Leeds.The patient is rated on a 3 to 5 point scale on anywhere from 17 to
29 items. There are multiple versions of the test, as it was revised in 1966,
12
1967, 1969, and 1980. A score of 0-7 is considered to be normal while a score
over 20 indicates a case of at least moderate depression. (Hamilton, 1986)
Lastly, the Center for Epidemiologic Studies Depression Scale was devel-
oped in 1977 and consists of 20 questions. Each questions asks the respondent
to indicate how frequently they have experienced a given symptom over the
past week. Scores range from 0 to 60, with scores closer to 0 indicating no or
minimal depression, and scores nearer to 60 indicating more severe cases of
depression. (Eaton et al., 2004)
2.3 Establishing the Language of Depression
In this section we will briefly discuss relevant research that has indicated
or suggested a connection between depression and language.
In 1969, The Measurement of Psychological States through the Content
Analysis of Verbal Behavior was published. In it, the authors showed how the
lexical content features of recordable speech behavior could a probabilistic ac-
count for a variety of psychological states (Gottschalk and Gleser, 1969). In
their research, they asked participants to speak for five minutes about any in-
teresting personal life experiences in a stream-of-consciousness fashion. While
the Gottschalk method, as it came to be known, was later used for the di-
agnosis of various cognitive impairments and mental disorders, it has proven
difficult to adapt to a computer program (Gottschalk and Bechtel, 1993).
In 2001, Stirman and Pennebaker examined the word usage of both suici-
dal and non-suicidal poets. 300 poems were selected from the bodies of work
of 18 poets, with 9 being classified as suicidal and 9 being classified as not sui-
cidal. The suicidal poets were labeled as such due to the fact they did, in fact,
13
commit suicide. They used the LIWC program and found that the groups did
not differ in their usage of words correlated to positive and negative emotion,
but did differ when it came to pronoun usage. The suicidal group used more
first-person singular words, and fewer words suggesting identification with a
group or collectve (Stirman and Pennebaker, 2001).
In 2004, Rude, Gortner and Pennebaker sought to replicate the findings
of this and other studies that suggested that self-focus, along with its lin-
guistic indicators such as personal pronouns, was a significant aspect of a de-
pressed psychological state. They asked a sample of undergraduates to write
for 20 minutes abouts ”their deepest thoughts and feelings about coming to
college”.The sample was comprised of 31 depressed participants, 26 formerly
depressed participants, and 67 never-depressed participants. Participants in
the study were classified as depressed or not with the BDI diagnostic ques-
tionnaire. Essays were evaluated with the LIWC software, which compared
files on a word-by-word basis to a dictionary consisting of 2290 words and
word stems organized into various linguistic and psychological categories. The
study found that depressed respondents used more first-person singular words
(such as I, me, my) than did never-depressed respondents. In addition, de-
pressed participants were found to use a greater proportion of negative emo-
tion words, and marginally fewer positive emotion words (Rude, Gortner, and
Pennebaker, 2004).
Resnik et. al used a Latent Dirichlet Allocation and features derived from
the LIWC software in order to develop a linear regression model based on a col-
lection of 6,459 stream-of-consciousness essays collected from college students
between 1997 and 2008. Each essay consisted of approximately 780 words and
14
were responses to being prompted to writing about their thoughts and feelings
in the present moment. Each essay writer also provided data regarding their
personality traits and state of mind. The experiment found that topic model-
ing using Latent Dirichlet Allocation added value to the predictions of clinical
assessments of depression and neuroticism (Resnik, Garron, and Resnik, 2013).
A recent study further corroborated the findings of previous studies sug-
gesting a relationship between singular personal pronoun usage and depression
after conducting a text analysis of 63 internet forums comprised of over 6,800
active members. In addition, it found that absolutist words such as ”always”,
”entirely”, or ”totally” tracked the severity of affective disorder internet fo-
rums more reliably than negative emotion words did. In other words, while
anxiety, depression, and suicidal ideation forums all exhibited greater usage of
absolutist words, suicidal ideation forums exhibited greater usage of absolutist
words than anxiety or depression forums, thus correlating the usage of abso-
lutist terms with the severity of the condition under discussion.(Al-Mosaiwi
and Johnstone, 2018).
This finding that depressed individuals use more first-person singular and
negative emotion words has been posited to be either be an indication of more
self-focus as a response to pain, or alternatively a thinking pattern that is it-
self a causal factor in the emergence of depression. (Tausczik and Pennebaker,
2010).
In summary, both words categorized by the LIWC software program as
negative emotion words and singular personal pronouns have been linked to
depression in previous text analysis studies, with the connection to singular
personal pronoun usage having been found to be particularly robust over mul-
15
tiple studies.
2.4 Studies Utilizing Non-Korean Social Media Data
In this section we will discuss relevant studies that utilized the social
media data of users outside of Korea and whose main contribution was not
based upon upon pre-defined dictionaries or domain knowledge (such as stud-
ies whose findings were based on LIWC). Taken as a whole, these studies
demonstrate and exploit the potential of social media data as a source of pre-
dictive insight into depression. Unlike the studies that we will discuss in a
later section, however, they all rely on surveys and crowdsourcing to establish
a ground truth as a starting point for their analyses.
A study by Moreno et. al selected public Facebook profiles from second
and third-year undergraduates and evaluated their status updates. Disclosures
of depression were modeled in association with demographic Facebook usage
data using a binomial regression analysis. Two hundred profiles were eval-
uated, with 25 percent of profiles containing status updates that referenced
depressive symptoms. The study concluded that those who received positive
support from their friends were more likely to discuss their depressive symp-
toms publicly on Facebook. (Moreno et al., 2011) While not utilizing linguistic
data, this study demonstrated similar goals and intuitions regarding the po-
tential of social media to combat stigma surrounding mental health concerns
and more effectively identify those suffering from mental health challenges.
In perhaps one of the most important studies under consideration in this
section, De Choudhury et. al found that social media contained valuable sig-
nals in detecting individuals that were likely to be suffering from depression
16
(De Choudhury et al., 2013). Crowdsourcing was used to first create a list of
Twitter users that reported being clinically diagnosed as depressed through
the CES-D diagnostic measure. The tweets posted by each of the 476 users for
approximately a year before the onset of depression were collected. User meta-
data was also leveraged. Through a combination of egonetwork, lexical, and
pattern of behavior data, features were constructed to train a support vector
machine classifier which achieved an accuracy of approximately 70 percent. In
addition, the study found that individuals with depression showed a decrease
in social activity, an increase in negative emotion, a greater focus on self, an
increase in medical concerns, an increase in social concerns, and an increase
in reports of religious activity. While many relevant studies were inspired by
this research, one important difference between De Choudhury et. al’s research
and later lines of work, including the experiment put forth in this paper, is the
former study’s reliance on crowdsourcing and surveys. The methods employed
in this paper require little to no domain knowledge, no surveys or crowdsourc-
ing, and focus only on linguistic data, which De Choudhury’s study showed to
be the most effective in distinguishing depressed users from non-depressed.
Moving on to studies done using data in a non-English language, there
have been at least two published studies using Japanese language Twitter
data to detect depression. In 2013, Tsugawa et. al constructed a multiple re-
gression model in order to determine the probability of a user suffering from
depression based on the frequencies of words used. A survey was conducted of
50 Japanese participants using Zung’s SDS. Following the survey, the tweets
posted by respondents over the week prior to the administering of the survey
were obtained through the Twitter API. As an aside, it is relevant to note
17
that this one week limitation is the result of utilizing the Twitter API to ex-
tract tweets and that this limitation was avoided in this study through the
use of alternative extraction methods. 14,757 words were obtained after ex-
cluding particles, auxiliary verbs, adnominal adjectives, and symbols through
the use of the morphological analyzer MeCab. Furthermore, the frequencies
of words used by a participant were normalized by the number of occurrences
of all words in the participant’s total tweet corpus. After running a multiple
regression analysis, it was found that words with negative mood were posi-
tively correlated with higher scores on Zung’s SDS. The correlation coefficient
between the estimated and actual Zung SDS scores was found to be .45. The
study concluded that word frequencies in tweet posts were useful in predicting
to what degree a user was suffering from depression. (Tsugawa et al., 2013)
In another study, Tsugawa et. al built upon the research described in the
preceding paragraph by using other features in addition to word frequencies
for their analysis, as well as by including a larger number of participants: 209
Japanese respondents compared to the previous study’s 50. In addition, this
study used the CES-D and BDI diagnostic tools as opposed to Zung’s SDS.
Features used apart from word frequencies were as follows: topics generated
by LDA, ratios of positive and negative affect words, hourly posting frequency,
daily posting frequency, average number of words per tweet, overall retweet
rate, overall mention rate, ratio of tweets containing a URL, number of users
following the user, and the number of users followed. Differences in these fea-
tures correlated with the presence or absence of depression were explored.
The study found that the most common posting times were consistent
between depressed and non-depressed users, and that no significant differ-
18
ence could be found between depressed and non-depressed users posting times,
which contradicted the findings of De Choudhury et. al (De Choudhury et al.,
2013). Significant differences were found, however, when analyzing the ratios
of tweets containing positive and negative words, tweet urls, post frequencies,
and retweet rates. Again, contrasting with De Choudhury et. al, the rate that
a user was mentioned by other users, and the number of both followers and
followed users did not demonstrate a statistically significant association with
depression. Tsugawa et. al posited that the different findings of the two stud-
ies may be due to cultural differences between Japanese and English language
users of Twitter, though it was granted that more research was needed to
clarify this matter. An SVM classifier trained using the statistically signifi-
cant non-bag of words features achieved an accuracy rating of 66 percent.
2.5 Studies Utilizing Korean Social Media Data
In this section we will discuss studies that leveraged Korean social media
data to either predict or generate insights into suicide or depression.
In 2013, there were three such studies of significance published. Won et. al
compared the predictive potential for the national number suicides of suicide
and dysphoria related weblog entries to social, economic and meteorological
variables over the period of time from 2008 to 2010. After applying a set of
filtering operations to remove noise such as advertisements, all remaining we-
blogs posted to Naver Blogs between January 1st, 2008 and December 31st,
2010 were used to tabulate a ”suicide weblog count” and a ”dysphoria we-
blog count”. The suicide weblog count was defined as the number of weblogs
that included the Korean word for suicide, jasal, at least once. The dysphoria
19
weblog count was defined as the number of weblogs containing the Korean
word for stressed or fatigued, himdeulda, at least once. Significant variables
were detected through individual univariate regression tests. Following this,
a multivariate regression model was generated using the previously identified
significant variables. The study concluded that the dysphoria weblog count
variable was a stronger and more stable predictor of national suicide numbers.
The suicide weblog number was observed to be more variable and suscepti-
ble to sudden and drastic spikes, especially as a response to celebrity suicide
events. (Won et al., 2013)
Another study published in 2013 identified the potential of non-linguistic
Facebook data in identifying depressed users of the platform. Park et. al at-
tempted to use Korean Facebook data to identify symptoms in users suffering
from depression. This study was unique amongst the studies discussed thus
far as it employed the use of its own unique app, EmotionDiary, in order to
engineer its own features based on behavior exhibited on the social media plat-
form. 55 Facebook users were recruited through advertising on the campus of
a large Korean university, and each user was administered the CES-D. Partic-
ipants were classified as depressed if they obtained a score of 25 or higher. The
study concluded that there was a statistically significant positive correlation
between reading tips and facts about depression in the Facebook app and a
score of 25 or higher on the CES-D. Furthermore, the number of friends and
locations tags associated with a subject of the study was negatively correlated
with a score of 25 or higher. While the sample size was small and linguistic
data was not used, this study was another significant attempt at leveraging
social media data to identify possible symptoms of depression in users of a
20
social media platform (Park et al., 2013).
Contrasting with other studies in our overview, Park et. al did not rely
solely upon the social media data of users, but also survey results that indi-
cated users’ views regarding their usage of a social media platform. 14 volun-
teers, 7 depressed and 7 controls, were interviewed about their views regarding
their usage of the Twitter social media platform. Interviewers collected the age,
gender, education level, job title, Twitter IDs, and depression diagnostic his-
tory of each interviewee. Each participant was also required to take the CES-D.
Scores of 22 or above were used to classify participants as depressed. While
253 volunteers completed this initial screening, after screening for participants
that were both active on Twitter and willing to share their tweets, only 69
participants remained; 23 depressed and 46 not depressed. Of these only 24
agreed to do a more in-depth interview. One of the depressed users failed to
show up for the scheduled interview, leaving 14 interviews, with an even split
between depressed and non-depressed participants. This drastic reduction in
sample size serves to demonstrate the difficulty in relying upon volunteered
survey data, which is one motivating factor for the pursuit of automated meth-
ods that is found in the studies of Coppersmith et. al as well as in this work
(Coppersmith, Dredze, and Harman, 2014) (Coppersmith et al., 2015) (Cop-
persmith et al., 2017). The interview consisted of question regarding matters
related to experiences of depression and usage of the Twitter social media
platform.
In addition to the interviews, 1,523,377 tweets were collected from 1,363
friends of individuals in the depressed group and 1,649,761 tweets were col-
lected 1,756 friends of individuals in the control group. A random sample of
21
10,000 tweets was generated from this dataset, with 5000 tweets being ran-
domly selected from each class. After using LIWC to analyze the randomly
sampled tweets, it was found that analysis of the tweets consumed by the two
groups corroborated the differing accounts provided by the two groups for the
type of content they consumed on Twitter. Depressed users reported consum-
ing more emotional content, which was reflected by higher average scores in
multiple LIWC affective lexical categories. The average affectiveness of tweets
in from the depressed user friend group was 4.67, while for the control friend
group it was 3.42. (Park, McDonald, and Cha, 2013)
A study conducted by Park et. al in 2015 leveraged non-linguistic social
media data to gain insight into the online behaviors of depressed Facebook
users. 234 students with Facebook accounts that were at least a year old were
recruited at a large university. Of the 234 students, 212 completed an online
version of the CES-D that was administered through a Facebook app and
consented to providing access to their Facebook wall activities. Of these, 120
participants obtained a score of 20 or less, indicating a low probability of being
depressed. The remaining participants obtained a score between 21 and 60, in-
dicating a a strong probability of being depressed. An attempt to validate the
efficacy of this means of diagnosing participants was made via comparisons
to the results of alternative measures. The first measure was the BDI, and
the second was administered through a face-to-face interivew, which utilized
another diagnostic tool, the Hamilton Depression Rating Scale (HAM-D). It
is worth noting that despite sending out personal invitations and offering free
counseling to users identified as depressed through the CES-D, very few of the
participants responded and ended up being interviewed: only 6 participants
22
out of the 42 that obtained a score over 20 on the CES-D. Despite this, user
metadata was collected, and the correlation of various aspects of user inter-
action with the Facebook platform to CES-D scores was measured through
the Pearson correlation coefficient and a linear regression analysis. The study
found that depressed users used fewer geolocation tags, indicating that they
posted from a less varied group of locations, and that they had fewer Face-
book friends. In addition, they made an equal amount of wall posts, liked more
posts, and viewed more tips about depression through the same proprietary
app through which the CES-D was administered. Depressed users received sig-
nificantly fewer likes on their wall post, and both received significantly fewer
comments than non-depressed users. The rate of comment posting was approx-
imately the same between the two groups. Some limitations of the study that
were identified by the authors included a lack of analysis of linguistic data and
a largely homogeneous sample. In a round of follow-up surveys, a majority of
participants indicated that they found the app administered in the study ben-
eficial and educational. Furthermore, the authors concluded that the similar
level of prevalence of depression between the studied demographics and the
nation-wide sample gathered by various offline studies indicated that online
screening could be a viable means of connecting with depressed individuals,
at least in the demographics that more commonly use social media. (Park
et al., 2015). Also utilizing a proprietary Facebook app, EmotionDiary, Lee
et. al found that an expressive writing activity administered through the app
was effective at treating symptoms of depression. This finding, combined with
the potential of online diagnostic methods, suggest that the recurring theme,
present in many of the aforementioned studies, of participants being unwilling
23
or unlikely to seek a diagnosis of depression, even when symptomatic, may no
longer prove to be the obstacle to treatment it has been if methods such as
those suggested in this paper and the various studies discussed in this section
continue to demonstrate their value. (Lee et al., 2016)
2.6 Qntfy Studies
Qntfy is a U.S.- based company that describes itself as ”a technology
solutions provider bridging data science and human behavior” (Qntfy). They
specialize in leveraging psychological and behavioral data so as to support the
operations of for-profit, not-for-profit, and government organizations. In addi-
tion to its custom data analysis work, Qntfy also publishes original research
in peer-reviewed publications. Many of these studies are the basis for the ap-
proach taken in this thesis, and so we will provide an overview of this work
here.
In 2014, Coppersmith et. al found that linguistics signals related to various
mental health disorders were present in Twitter data and able to be leveraged
through a simple unigram or character 5-gram language model (Coppersmith,
Dredze, and Harman, 2014). First, Twitter users were identified as suffering
from a mental health condition by searching for users who tweeted a state-
ment similar to ”I have been diagnosed with PTSD”, where PTSD could also
be depression, bipolar, or seasonal affective disorder. Up to 3200 tweets were
collected for each user that had posted such tweets, and then a corpus of
control tweets was generated by scraping the tweets of a random sampling of
users that had posted over the same time period. Various so-called pattern
of life data were also measured, and included measurements of how often a
24
user posted, proportion of tweets including mentions of other users, and pro-
portions of self-mentions. An analysis of the text data was also done using
LIWC. While LIWC was valuable in reproducing previous findings concerning
the language of mental health, neither the pattern of life analytics nor the data
obtained through the LIWC analysis were as effective at differentiating users
as the unigram and character 5-gram language models. The authors concluded
that their results indicated that a variety of signals relevant to mental health
were observable in Twitter data, and in particular in its lexical data.
In 2015, Coppersmith et. al published a summary of the Computational
Linguistics and Clinical Psychology (CLPSych) shared and unshared tasks
(Coppersmith et al., 2015). The task used data from Twitter users who stated
that they had been diagnosed with depression or PTSD and demographically-
matched community controls, with the goal of comparing various methods
of modeling language from social media related to mental health. Data was
collected in the fashion described in (Coppersmith et al., 2015) Three binary
classification experiments comprised the shared task: 1) depression vs. control,
2) PTSD versus control, 3) depression versus PTSD. Classifier performance
was measured primarily by average precision. Twitter users were divided into
train and test sets, with the train partition consisting of 327 depressed users,
246 PTSD users, and an age-and-gender-matched control user for each, for
a total of 1,146 users. The test data contained 150 depressed users and 150
PTSD users, which, combined with the matched controls, amounted to a total
of 600 users. Participants in the shared task consisted of four teams: The Uni-
versity of Maryland (UMD), The World Well-Being Project (WWBP), The
University of Minnesota at Duluth (Duluth), and a team comprised of mem-
25
bers employed at Microsoft, IHMC, and Qntfy (MIQ). The authors concluded
that the results of the shared task demonstrated the relative superiority of
topic-modeling over simple linguistic features for the shared tasks, though
such features provided some classification ability, even without the utilization
of complex machine learning techniques.
Using a data collection method similar to that used in the study con-
ducted in 2014, Coppersmith et. al used Twitter data scraped from users that
had been identified as having made a public declaration of a suicide attempt to
perform an exploratory analysis of the tweets posted prior to a user’s suicide
attempt (Coppersmith et al., 2016). 554 users were identified as having made
a public declaration of a suicide attempt, of these, however, only 312 gave an
indication of when their latest attempt was. 163 users provided and exact date,
and of these 125 had data available that was posted prior to their respective
suicide attempts. In a similar fashion to what was accomplished in previous
studies, Coppersmith et. al found that they were able to distinguish those who
attempted suicide from controls using n-gram language models with logistic
regression. It was also found that users that had attempted suicide posted a
greater volume of tweets than users in the control group. An emotional state
generator was also developed using hashstags as labels. This emotion classifier
was used in order to explore the emotional makeup of users’ tweets prior to a
suicide attempt. Based on the labels generated by this automatic classifier, it
was concluded that users that committed suicide posted a greater proportion
of tweets that could be categorized as angry or sad than controls did. These
proportions fall to levels similar to that of controls in the weeks following a
suicide attempt, however. Tweets labeled as fearful or disgusting were similar
26
between the control group and the suicide group in the weeks preceding a
suicide, but the suicide group showed a decrease in these categories to levels
below that of the control group in the weeks following a suicide attempt. In-
terestingly, and perhaps counterintuitively, the suicide group showed a lower
proportion of tweets labeled as indicating loneliness compared to the control
group. Furthermore, this difference tended to widen in the weeks following a
suicide attempt.
In contrast to the previous studies discussed, which generally tracked
users’ activity over years, months, or weeks, or alternatively did not include
time as a variable at all, Loverys et. al sought to explore micropatterns oc-
curring in messages over much shorter periods of time (Loveys et al., 2017).
Data was collected similarly to the method described in the previous studies
in this section. Tweets were collected for users that stated they were diag-
nosed with generalized anxiety disorder, an eating disorder, panic disorder, and
schizophrenia. Users stating that they had attempted suicide were included in
the study as well. These conditions were chosen as they were considered to
have symptoms the most sensitive to timing. Using the VADER method of
sentiment analysis. VADER, or Valence Aware Dictionary and Sentiment Rea-
soner, is a lexicon and rule-based sentiment analysis tool specifically attuned
to sentiments expressed in English on social media. The authors examined the
emotional content of three tweets following an initial tweet when the following
tweets were posted no more than three hours later. Tweets could be counted
in more than one overlapping micropattern if more than three tweets were
posted by a user within three hours. Continuing the line of inquiry followed in
the previous studies discussed in this section, the relative performance of the
27
micropatterns, underlying sentiment labels, and a combination of the two on
a binary classification task. Micropatterns were shown to provide information
beyond that provided by the sentiment labels alone for all mental health cat-
egories.
Also in 2017, a team working at Qntfy developed an annotated scheme for
classifying depressed tweets according to a number of categories to generate the
Depressive Symptom and Psychosocial Stressors Acquired Depression (SAD)
corpus (Mowery et al., 2017). Using the DSM-5, elements of the DSM-IV, and
other descriptions of depressive symptoms documented in the psychiatric lit-
erature in conjunction with additional depression related categories observed
in data, such as weather and media, the authors developed an annotation
scheme. Both a psychiatrist and a counseling psychologist provided feedback
on the annotation categories prior to its finalization. Data for the corpus was
collected by searching the Twitter API using depression-related terms from the
LIWC corpus. In addition, data collected for the CLPsych 2015 shared task
described previously in this section was sampled. To validate the annotation
scheme, two psychology graduate researchers and a postdoctoral biomedical
informatics researcher annotated the 1200 tweets comprising the SAD corpus.
While interannotator agreement was high for tweets indicatingno evidence of
clinical depression, agreement was much lower for depressive symptoms and
psychosocial stressors. Keywords were found to have much more predictive
value for tweets in the the CLPysch data than in the SAD corpus. The au-
thors theorized that this was due to the depression-related vocabulary being
grounded by users’ statements that they had been diagnosed, whereas in the
SAD corpus such terms could appear without such contextual grounding, lead-
28
ing to difficulty in classifying tweets accurately according to pre-defined lexical
categories. While the authors hoped future investigations into machine learn-
ing based postprocessing techniques could mitigate these limitations, overall
this study highlighted the present research difficulties in improving upon com-
putational methods with a qualitative analysis.
Lastly, in 2017 Coppersmith et. al addressed how linguistic signals of de-
pression could by health care professionals as a supplement to data that is
already collected by the health care system (Coppersmith et al., 2017). Using
both a VADER sentiment classifier and the hashtag-derived emotion classifier
developed by Coppersmith et. al (2016), the authors generated probability
distributions for each of the possible sentiment and emotion labels that could
be assigned to the internal chat and communications within a company. In
order to estimate the variety and proportions of emotions and sentiments ex-
pressed by a company on a given day, the authors aggregated the messages and
summed up the probabilities associated with each label, ignoring communica-
tions labeled as ’neutral’ or ’no sentiment’. The rolling mean over a one week
window of various sentiments and emotions were calculated over a 36 day pe-
riod. The data analysis revealed that the company appeared to have increases
in average negative sentiment in the weeks leading up to a big deliverable. In
contrast, peaks of joy were observed in the periods preceding holidays and the
completion of the first derivable of a project. The authors suggested that these
findings were illustrative of the population-level analysis that is now possible
with computational analysis and classification tools. While the classifiers used
in this study were for emotion and sentiment, the authors indicated that the
mental health classification tools they had utilized in the studies discussed
29
earlier in this section were equally conducive to this sort of population-level
analysis.
30
3 Corpus Data
Twitter is an online social media service launched in July of 2006. It has
330 million active users as of October 2017 and was ranked the 13th most
visited site on the internet as of May 16, 2018 (Alexa Top 500 Global Sites).
Messages posted by users, referred to as ”tweets”, are limited to 140 Japanese,
Korean, or Chinese characters, and to 280 characters in other languages. Pho-
tos and other media, as well as urls and screennames, do not count towards the
character limit. Users’ account pages are publicly viewable, while their tweets
are publicly viewable only by default; this setting can be modified by users so
that tweets can only be seen by registered Twitter users that have subscribed
to the user in question. Subscribing to another user puts that user’s tweets
on the ”timeline” of the subscribing user, and this act of subscribing is called
”following”. A user’s timeline is made up of the tweets of the Twitter users
they are following. Users can mention other users through their use of the ”@”
character before another user’s screenname; users have the ability to isolate
and see all tweets containing such mentions of their screenname through the
platform. In addition to being able to reply to other users tweets, tweets can
be ”retweeted” by other users, allowing them to be shared on the timelines
of others with attribution to the original tweeter. Twitter users also have the
option of messaging each other privately and of blocking other users so that
future tweets or direct messages sent to them by the blocked user will not be
31
viewable or trigger a notification. Tweets can be posted by users through the
Twitter website, approved external applications such as smartphone apps, and
through SMS. Users can also click a button to ”like” a tweet. How many likes
a tweet has will be visible to all users that are able to view the tweet.
3.1 The Twitter API and User Selection
Twitter’s developer platform offers several application programming in-
terfaces (API), or sets of methods and properties that can be used to interact
with data on the twitter platform. These APIs allow developers the ability to
interact with username data, media, text data, and other metadata for usage in
other apps. For example, through the Twitter developer platform and the use
of its APIs, a smartphone developer could create an app that allows users to
access their Twitter feed within another app, while a web developer could use
an API to embed relevant tweets on a website. The API could be thought of as
a list of rules and directions from accessing data and features. The Twitter API
platform includes access to numerous endpoints, where an endpoint is unique
url address pointing to an object. Twitter objects are generally represented as
JSON files and consist of tweet objects, user objects, Twitter entities, Twitter
extended entities and geospatial objects. Twitter API usage is rate limited,
which means that the number of tweets that can be extracted is limited over
a 15-minute window. At the time of this writing, the rate limit is 450 calls
for past data and 15 calls per every 15-minute period for live data. Real-time
data uses a separate API called the Streaming API, so-called because it allows
developer to interact with tweets just as they are uploaded. Another limitation
of the Twitter API is that it does not provide tweets that are older than 7
32
Figure 3.1: Tweet object retrieved through the Twitter API.
days old. Due to these limitations, a supplemental tool for tweet location and
extraction was necessary.
The Twitterscraper python script developed by Ahmet Taspinar bypasses
the Twitter API and instead used the Twitter website advance search func-
tion and the BeautifulSoup library to extract tweets (taspinar/twitterscraper:
Scrape Twitter for Tweets). This allows us to retrieve tweets that are greater
than 7 days old. As with objects retrieved through the Twitter API, each
tweet is retrieved as a JSON object. For each tweet retrieved, twitterscraper
retrieves the username of the user that posted the tweet, the tweet id, the
tweet url, the tweet text, the tweet html, the tweet timestamp, the number of
of likes the tweet has received, the number of replies to the tweet that have
been posted, and the number of times the tweet has been retweeted by other
users. Because the script utilizes the Twitter website’s advanced search func-
tionality, various arguments can be given to queries. Searches for tweets can
be restricted by timespan, region and language.
3.2 Tweet Extraction
A search query was run using the Twitterscraper script for the terms ”
우울증 진단” (eu-ool jeung jindan), or ”depression diagnosis” in Korean. The
search was restricted to posts made in Korea and made in the Korean language
33
Figure 3.2: An example of a tweet object obtained through the Twitterscraper Pythonscript.
Genuine Statements of Diagnosis
“병원가보세여 저도 불면증와서 갓다가 우울증진단받구옴 꼭꼭 병원가보세욧ㅠ ㅜ”“5년 내내 우울증에 시달리다가 올해 처음 정신과 가서 불안장애랑 우울증 진단받고 대략 5개월째 치로중ㅇ이야”“아 병원은...일상 유지가 힘들어서 정신과 찾아갔는데 만성 우울증이랑 PTSD 진단받고 상담이랑 약물치료 병행하고있어요”
Disingenuous or Less Certain Statements of Diagnosis
“아 우울증이란 진단을 안 받아서 글치 직장인들 태반이 우울증일게야 아마”“우울증 자가 진단... 60 점 나왔다 – ;; 아∼!!!!!!!!! 심심해서 그래 ㅋㅋㅋ”,”네이트 검색어에 직장인 회사 우울증 나오길래, 클릭 했는데....점수가.....고도 판정.... 나 전문가 진단 받으래..ㅠㅠ 자가 진단에서 이런 결과나 받고.”
Table 3.1: Tweets used by native speakers to detect users belonging to the positiveclass.
over an approximately 8-year period from 2010 to 2017. 441 users were initially
retrieved for the depressed class. After using human annotators to identify
tweets that indicate a claim of a diagnosis of depression (see Table 3.1 for
examples of such tweets), the number of users in the depressed category was
reduced to 139. Then, with a limit of 3,200 tweets per user, the past tweets
of the users we identified as belonging to the depressed class were extracted.
A control group of 4,000 randomly selected users posting over the same time
period was constructed by searching for an empty string and having their tweet
data extracted as well, again with a cap of 3,200 tweets. In order make finding
meaningful signals easier usernames and URLs were removed from the data.
The final numbers of users and tweets can be seen in Table 3.2.
34
Users Tweets
Depressed 139 17140Control 4000 594281
Table 3.2: Number of users and tweets broken down by category
3.3 Tokenization
Tokenization is the process of taking a text string and breaking it into
smaller portions, called tokens, that can be counted and used as a feature of
a document through which a classifier will ”interpret” a document.
For the first round of classifier experimentation, simple space tokenization
was used. This means that tweets were simply divided according to the white
spaces appearing in the tweet, with each token being a substring between two
whitespaces within a tweet. As indicated above, urls and screennames were
removed from all tweets prior to tokenization. For a second round of trials,
the open source text segmentation library MeCab was used. While originally
developed for Japanese by the Nara Institute of Science and technology, a Ko-
rean fork of the project, Mecab-ko, has been developed by the Eunjeon Project
(은전한닢 프로젝트: 은전한닢 프로젝트를 소개합니다.). Unlike a simple space
tokenizer, Mecab is able to identify parts of speech and select out sequences of
syllables with semantic relevance. It can also recognize misspellings in many
cases, as well as utilize information encoded by punctuation in a text se-
quence. Examples of tweets tokenized by both the space tokenization method
and MeCab-ko can be seen in Table 3.3.
35
Space Tokenization
’ㅠㅠ’ ’필요할거에요’ ’동의가’ ’부모’ ’미성년자는’
’댕댕이’
’아님’ ’고치겠다는’ ’뜯어’ ’때까지’ ’조곤조곤’ ’말은’ ’아니다’ ’몰아치기식’
’자두되나’
’시간이’ ’유익한’ ’상담으로’ ’친절한’ ’교수님의’ ’있었고’ ’의미’ ’더욱’ ’있어’ ’오를’ ’정상에’ ’빠짐없이’’분도’ ’다녀왔습니다’ ’산행을’ ’심학산으로’ ’경기도’ ’함께’ ’환우분들과’ ’다지고자’ ’희망을’ ’완치에의’ ’토요일에는’’19’ ’지난
Mecab
’ㅠㅠ’ ’..’ ’.’ ’에요’ ’거’ ’할’ ’필요’ ’가’ ’동의’ ’부모’ ’는’ ’성년자’ ’미’ ’@’
’댕댕이’
’?’ ’아님’ ’것’ ’다는’ ’겠’ ’고치’ ’어’ ’뜯’ ’까지’ ’때’ ’될’ ’조곤조곤’ ’은’ ’말’ ’다’ ’아니’’식’ ’기’ ’몰아치’ ’”’ ’는’
’나’ ’되’ ’자두’ ’..’ ’.’
’*’ ’이’ ’시간’ ’유익’ ’상담’ ’친절’ ’님’ ’교수’ ’고’ ’었’ ’의미’ ’더욱’ ’있’ ’수’ ’오를’ ’정상’’빠짐없이’ ’도’ ’한’ ’습니다’ ’다녀왔’ ’산행’ ’으로’ ’산’ ’심학’ ’경기도’ ’함께’ ’과’ ’들’ ’분’ ’환우’’고자’ ’다지’ ’희망’ ’완치’ ’에’ ’토요일’ ’19’ ’9’ ’지난’ ’을’ ’의’ ’어’ ’..’ ’.’ ’는’
Table 3.3: Examples of Tokenized Tweets
3.4 Caveats
As in previous studies utilizing this method of dataset creation, there
are certain caveats to keep in mind in regard to its efficacy. As indicated by
Coppersmith et al. in their paper that introduced the data collection method
used for this experiemnt, significant among these are the following: 1) Indi-
viduals willing to speak publicly on such a taboo subject may only represent
a unique subpopulation, and not depressed individuals as group. 2) The claim
that a given user has been diagnosed with depression is not verified, but sim-
ply taken at face value. 3) There is the possibility that some percentage of
the control group users are also depressed. 4) Twitter users themselves are
possibly not representative of the greater population of depressed individuals.
(Coppersmith, Dredze, and Harman, 2014)
36
4 Classification Methods
In the previous chapter, we discussed our dataset and how we would
organize it so that various machine learning classifiers could learn how to tell
if a tweet came from a depressed user or a nominally non-depressed user. In this
chapter, we will provide a brief discussion of the tools employed to accomplish
that task, i.e., the machine learning classifiers themselves. To begin, let’s start
with a very brief introduction to machine learning and its applications.
4.1 Definitions
Machine learning is a subfield of computer science that employs statistical
knowledge through algorithms to enable models of the world to be ”learned”
by computers. These models can then be made to make predictions about the
world. American computer scientist Tom Michael Mitchell formally defined
machine learning as follows:
”A computer program is said to learn from experience E with re-
spect to some class of tasks T and performance measure P if its
performance at tasks in T, as measured by P, improves with expe-
rience E.”
(Mitchell, 1997)
For our purposes, task T would be determining whether or not a tweet
was posted by a user diagnosed with depression. The performance measures P
37
utilized for determining how well a given ”computer program”, or algorithm,
accomplishes this task are discussed in the following chapter. That leaves us
with the task of defining E, or the ”experience” that the classifiers will need
in order to improve at their assigned task.
4.2 Training, Testing, and Cross Validation
The task of the experiment is a supervised learning task, due to the nature
of the training, or experience as defined above, that is employed to improve
the performance of the model generated by the learning algorithm. In essence,
some of the tweets in the dataset are set aside for a given classifier so that the
classifier can hopefully start to detect patterns between the text and the labels
that have been given to the texts, i.e., depressed or not-depressed. Then, it
observes other labeled tweets and makes predictions based on its experiences.
The classifier’s performance is evaluated on this test set. If there were not
a pre-determined correct or incorrect answer for the algorithm’s predictions,
and if the classifier did not train with data explicitly labeled and divided into
groups according to intentions and knowledge held outside of its own findings,
i.e. the understanding of what constitutes a depressed twitter user versus a
control user, as was outlined in the previous chapter, then the task would
become an unsupervised learning task. In an unsupervised learning task, the
goal is often for the algorithm to find interesting patterns in a dataset either
as an end in itself or as a prelude to improving performance on a supervised
learning task.
When training and testing for a classification task, however, there are
certain pitfalls that have to be avoided. As indicated above, the data is sepa-
38
rated into at least two subsets, a training set and a test set. This is because
what is being sought in the improvement of a classifier’s performance is in-
formation that is generalizable to new, unseen data, not to see how well it
memorized the data it has already seen. That would be an extreme example
of overfitting, or a case where a classifier does very well on a very specific and
limited set of data, but does not perform well when given new or novel data
to classify. But even after splitting the dataset into two subsets, what if all of
the data in our training set belonged to depressed users, while all of the data
in the test set belonged to the control group? One would expect the classifier
to perform very poorly on the training set. In this case, the classifier would
be underfitting, or not detecting any signals that would have allowed it to in-
crease its performance on the task. It could be said that the classifier is highly
biased towards the training data, and was not able to learn more generalizable
patterns because it was so drastically limited by the data it could train on.
When training and testing a classifier, it is desirble to avoid both overfitting
or underfitting the data, and often if one is confronted with making choices in
an experiment design that decreases the probability of one, it will increase the
probability of the other. This is a tradeoff between bias and variance, where a
biased classifier is one that underfits, and a classifier exhibiting a large degree
of variance is one that overfits.
One solution to this problem is cross-validation. Cross-validation is the
process of training and testing a classifier that consists of partitioning the
dataset into multiple subsets, and then alternating which subset serves the
test set with each ”fold”, or iteration, of the process. After each subset has
been used as a training set once, the performance of the classifier on each test
39
set is averaged as the measure of its overall performance. There are many ways
to split the data and many variations of cross-validation, but our experiment
relies upon the widely-used convention of 10-folds for our cross-validation pro-
cess, meaning that our classifiers will train on 90 percent of the data and test
on the remaining 10 percent ten times, with the data comprising the test set
being unique and alternated for each iteration. This reduces bias by allow-
ing our classifiers to learn from all of the signals available in the data, while
potentially guarding against overfitting by repeatedly alternating the data is
used for the test set.
Having provided a brief indication of the nature of how a machine learning
classifier operates, we will conclude this chapter by providing a basic overview
of the classification algorithms that were employed in our experiment. Naive
Bayes was chosen as it is the simplest algorithm and often used for text clas-
sification tasks, even if only as a baseline to compare other approaches with,
and because is very fast and easy to train. Because the feature space is quite
large and we assume the classes to be linearly separable, we also employ logis-
tic regression and linear SVM classifiers. A random forest classifier was also
implemented as an alternative approach to the two previous linear methods,
and also because it tends to work well in high-dimensional spaces.
4.3 Naive Bayes
When tasked with document classification, Bayesian classifiers use Bayes
theorem, along with its underlying assumption of feature independence, to
determine the probability of a document belonging to one class or another.
For example, if they know how often the term ”depressed” appears in the
40
depressed category of documents than in the control category, it will use the
tabulations from the data it trained on to determine how likely it is that
the given document came from either the depressed or control class given the
presence of the term ”depressed” in the document. In other words, the relative
frequency of ”depressed” in each class. These probabilities are summed across
each feature in a document, giving the document a probability for each class.
(Shimodaira, 2014)
P (θ|D) = P (θ)P (D|θ)P (D)
(4.1)
4.1: The equation for Bayes Theorem
4.4 Logistic Regression
Like linear regression, logistic regression makes linear associations be-
tween features and observations. In the case of document classification, fea-
tures are often word frequencies or ratios. Unlike linear regression, however,
logistic regression applies a logistic transformation to a linear function in order
to output a probabilistic class prediction between 0 and 1. This is equivalent
to saying that the logarithm of the odds of an observation belonging to a given
class can be represented by a linear function, or for the purposes of this study,
a linear function of token counts. This logarithm of the odds is also called the
logit of the probability. A decision threshold is applied to round the proba-
bility value generated by the logistic function to a discrete categorical value.
(Peng, Lee, and Ingersoll, 2002)
41
π =1Xβ
1 + eXβ(4.2)
4.2: The logistic regression function
4.5 Linear Support Vector Machines
This algorithm treats each observation as a a vector of features. Each
observation can therefore be conceptualized as a point in a high dimensional
space. For our purposes, an observation is a tweet and a feature is a token’s
relative frequency in one class versus the other. After the documents have
been converted to vectors, a decision boundary function is used in order to
maximize the margins between the observations. This is equivalent to saying
that instead of simply finding a line or hyperplane that separates observations
belonging to one class or another, the algorithm finds the line or hyperplane
that maximizes the distance between the most similar observations belonging
to different classes, which are known as the support vectors. When evaluating
an unlabeled observation, the algorithm attempts to determine where in the
space it is located relative to the hyperplane it has constructed to divide the
two classes. (Joachims, 1998)
⇀wT⇀x+ b = 0 (4.3)
4.3: The equation for the decision boundary of the LSVM algorithm
4.6 Random Forest
The random forest classifier is a classifier that combines aggregate bag-
ging and decision tree learning (Liaw and Wiener, 2002). A decision tree can
42
be conceptualized as a series of binary questions, that, when answered in a
hierarchical sequence, allow for the identification of an entity or observation.
For example, if we were to attempt to predict whether or not an individual
survived a natural disaster, we may start by inquiring if the individual is male
or female, as know that all survivor were female. Because we can so accurately
separate the classes with this one feature, gender, it would make up the root
of the decision tree so long as there is no other feature of the survivors that
so clearly distinguishes them from those that didn’t survive. If the answer is
male and no males survived the disaster, we know that this individual did
not survive. On the other hand, if the individual was female, we could then
ask another question from the next level in the decision hierarchy. This next
decision can be conceptualized as a branch of the decision tree, and it will ask
about a feature relevant to distinguishing between the two classes, but that
was not as exclusive to one class versus the other as the feature at the previous
branch or at the root of three would be. At this branch, we might ask if she
was under 170 centimeters tall, and if no one that died in the incident was
both female and under 170 centimeters tall, we would know that she survived.
This final determination that this individual is a survivor is called the leaf,
or the decision, of the tree. For our classification task, each token can be con-
ceptualized as a feature that the tree could ask about. For each feature, the
tree must decide whether to split, i.e., form another binary branch, based on
that feature. This is determined by which feature costs us the least in terms of
predictive power to split on. This process is repeated in a recursive condition
on smaller and smaller subgroups until some terminal condition is met. One
way of determining such a terminal condition is to set a minimum number
43
of observations for each leaf or decision. In other words, going back to our
survivor example, if we set the minimum number of example belonging to a
decision as eight, we would ignore any potential leaf that describes less than
eight observations belonging to a particular class. Another way is to set the
maximum depth, which limits how many branches a tree can have between its
root and its leaf.
Decision trees have many advantages, but one disadvantage is that they
are prone to becoming overly complex and overfitting data, which a random
forest minimizes through aggregate bagging. Aggregate bagging consists of cre-
ating a number of random samples with replacement from the larger dataset,
and then, for each of these subsamples, training a tree. This aggregation of
trees you end up with is the random forest. To test the random forest in the
case of a classification task, we take the majority vote of the trees in the forest
as its prediction. One last important note is that random forests use a tree
learning algorithm that learns on a random subset of features at each poten-
tial split. This is too prevent too many trees from becoming correlated to each
other due to a few dominant signals (Ho, 2002).
4.7 Feedforward Neural Network
A feedforward neural network is a type of artificial neural network com-
posed of a collection of linked computational units, often referred to as nodes
or neurons, that are arranged in multiple layers, where information in the
resulting network of units is propagated forward from one layer to another,
from an initial input layer, to one or more hidden layers, and then finally to an
output layer, in a non-cyclic fashion. In this study, the initial layer consisted
44
of the inputs, represented as a ”bag of words”, or a vector of zeros and ones,
with each value in the vector representing the presence or absence of a unique
token. These input are then multiplied by a vector of weights, which are the
parameters for the network that are trained through a learning process. The
output produced by a hidden layer, or a layer between the input and output
layers, is the result of the application of an activation function to the values
from the previous layer. In a fully connected network such as the one utilized in
our research, each unit multiplies the vector of values from the previous layer
by weights associated with that unit, which can be conceptualized as a con-
nection or synapse between neurons. The values are then summed before the
activation function is applied to it. In this study, a sigmoid function was used,
meaning that each unit in a hidden layer forward propagated a value between
0 and 1 to either a subsequent hidden layer or the output layer. This value can
be conceptualized as representing the confidence of the network in the weights
associated with that unit that are applied to its inputs. The weights used
to parameterize the network are learned through backpropagation, a process
where the error of the network’s prediction is distributed through the preced-
ing layers and units of the network and the weights associated with each pair
of units is adjusted accordingly (Bebis and Georgiopoulos, 1994). The feedfor-
ward neural network used in this study consisted of two hidden layers of 100
units each.
45
5 Experiment
Inspired by the Qntfy studies discussed in section 2.6 and the lack of
research in this direction in foreign languages, especially Korean, an experi-
ment was conducted with data extracted according to the methods discussed
in Section 3.2. A comparison of performance in a binary classification task be-
tween depressed users and controls was conducted using the machine learning
classification methods discussed in Chapter 4 and two separate methods of
tokenization. To the best of our knowledge, this is the first study done using
Korean data attempting to distinguish the tweets of users suffering from a
mental health condition from those of a control group.
5.1 Methodology
After extracting the data, for each classification experiment a random
sample of tweets equal in number to the number of tweets in the depressed user
class were set aside for training and testing in cross-validation. This resulted
in having a corpus of 34280 tweets for training and testing. It was decided
to balance the classes instead of utilizing a class ratio that more accurately
represents the prevalence of depression in the population in order to ease
classifier interpretation and creation, while keeping in mind that results must
be interpreted with this bias in mind. This is a far smaller dataset than those
used in the Qntfy studies, but as stated in Section 1.4, one goal of our project
is to gauge the efficacy of this method of data extraction and classification
47
on a smaller dataset. The tweets were shuffled in order to help ensure that
each batch used during the training process was representative of signals in
the entire dataset and not an idiosyncratic cluster.
5.2 Explanation of Metrics
To measure the performance of the classifiers on the binary classification
task, we utilize F1 scores and receiver operating characteristic (ROC) graphs.
We explain these metrics and also the rationale for using them in this study.
5.2.1 Accuracy, Precision and Recall
The accuracy rating of a classifier, or the rate of correct predictions rel-
ative to the total number of predictions made, is usually not a satisfactory
measure when evaluating the performance of a classifier on a task when the
task is related to diagnosing individuals with medical conditions. While a sim-
ple accuracy rating may suffice if the underlying task has to do with predicting
the result of a game given a certain state, or if predicting an event with a priori
equally likely outcomes, this does not extend to cases such as cancer detection
where a random patient undergoing screening is more likely to be cancer-free
than not, and in which the costs of a false negative prediction are far greater
than that of a false positive prediction. In cases such as these, metrics that
indicate how effective the classifier is at identifying positive cases are used, as
the performance we are interested in in such cases is in detecting the positive
class. This rate of the number of positive predictions relative to the number
of positive cases observed is the recall of a classifier.
On the other hand, if a classifier simply predicted all cases as positive,
that would not be very helpful, either. Consider again the cancer detection ex-
48
ample. A classifier could obtain a one-hundred percent precision rating simply
by diagnosing every patient with cancer. While false positives are less costly
than false negatives in this task, cancer treatment is time consuming and, in
most cases, life altering. And medical resources are limited; even if treatment
were painless, it simply is not practical to treat everyone as if they had cancer.
And of course, there is also the unnecessary trauma experienced by the patient
when receiving false diagnosis. For all of these reasons, we must also consider
the precision, or the rate of correct positive predictions to total positive pre-
dictions made by a classifier. Ideally, the classifier in our example will be able
to attain a high level of recall while also maintaining a high level of precision;
this means that it is able to identify the patients that need treatment while
minimizing the number of incorrectly diagnosed patients, and thereby inflict-
ing unnecessary stress upon patients and incurring significant costs to both
patients and the health care system.
Of course, there are a wide variety of possible classification tasks, and the
relative importance of recall versus precision can vary a great deal depending
on the task. That being said, for the purposes of our experiment, the three
measures discussed below are used in place of accuracy because they better
represent the greater importance of recall and precision over simple accuracy
in detecting depressed users.
True Positive Predictions
Total Positive Predictions(5.1)
5.1: The equation for determining the precision of a classifier
49
True Positive Predictions
Positive Observations(5.2)
5.2: The equation for determining the recall of a classifier
5.2.2 The F1 Score
The F1 score is a measure that indicates how well a classifier does in terms
of both its precision and its recall, given a probability threshold at which it
determines whether an observation belongs to the positive class. This is both
its strength and its weakness as a metric. With one number, we can gain
insight into how the classifier performs in a generalized way, but the F1 score
does not specifically tell us how well the classifier performs in terms of either
precision or recall; simply put, it is a representation of the balance between
the two. Given our balanced classes, the threshold used in our experiments
was .5. It is calculated as follows: F1 = 21
recall+ 1
precision
5.2.3 ROC Curves
An ROC curve is a graph generated by plotting the true positive rate
(TPR) along the y-axis and the false positive rate (FPR) along the x-axis
at various threshold settings, with the threshold being the probability level
at which the classifier determines that an observation belongs to the positive
class. For example, if the threshold is set to .5, then the classifier will predict
that an observation is positive if the classifier determines that it has a .5 or
greater probability of belonging to the positive class. The best possible classi-
fier would be represented by an ROC graph as having a coordinate at the (0,1)
coordinate of the graph, maximizing the area under the curve. This point in-
dicates that he classifier is able to achieve a one-hundred percent true positive
50
rate while also achieving a one-hundred percent true negative rate. In contrast,
a classifier that made random guesses would be represented by a diagonal line
beginning at the 0,0 coordinate that divides the graph in half. Points above
this line indicate that a classifier performs better than random chance, while
points below the line indicate the reverse. Lastly, the ROC curve graphs in
this section indicate the AUC, or area under curve, for each fold of a 10-fold
cross validation for each classifier. The AUC indicates the probability that the
classifier will identify a random observation from the positive class as more
likely to be of the positive class than a randomly selected observation from
the negative class. An AUC of .5 indicates performance no better than that
of a random guess, while an AUC of 1 would represent perfect classification
accuracy on a given test set.
5.3 Classifier Results with Space Tokenization
As discussed in section 3.3, our first trial of experiments were conducted
using simple space tokenization. This means that tokens were generated simply
by splitting strings into chunks based on where white spaces appeared in a
string. As mentioned previously, this is done after removing usernames and
URLs from strings. We utilize this minimalist approach to tokenization in
order to determine how much predictive insight a classifier can obtain through
relatively unfiltered twitter data. Given that a goal of this thesis is to minimize
the need for domain knowledge or expertise, this is a reasonable baseline to
explore. A 10-fold cross validation was used for the training and testing of all
classifiers. As we can see in Table 5.3, the standard deviations of each classifiers
F1 score over 10-fold cross validation is low, suggesting that the classifiers
51
performance is consistent over each subset of data. Logistic regression achieves
the highest F1 score, with a score of .75. The performance disparity between
the classifiers is not great, however. That all of the classifiers perform this
task with a success rate well above chance guessing is visualized clearly in
Figure 5.3. Lastly, Table 5.3.2 shows that each classifier obtains similar levels
of performance as the linear classification method utilized on a similar dataset
in (Coppersmith, Dredze, and Harman, 2014).
(a) Logistic Regression (b) Naive Bayes
(c) Linear SVM (d) Random Forest
Figure 5.1: ROC graphs representing performance with space tokenization.
52
Space Tokenization Mecab
Logistic Regression .75 .84
Multinomial Naive Bayes .72 .84
Linear Support Vector Machine .72 .81
Random Forest .71 .83
Feedforward Neural Net .73 .83
Table 5.1: F1 Scores
5.4 Classifier Results with Mecab Tokenization
For our second round of trials, we utilize the Mecab-ko tokenizer discussed
in Section 3.3 By contrasting simple space tokenization with the results ob-
tained with Mecab, we can obtain an intuition for the degree to which morpho-
logical analysis aids classifiers in finding relevant signals that distinguish the
two classes of users. Looking at Table 5.3.3, it is clear that the morphological
analysis provided by Mecab-ko provides a significant boost to performance.
The relative performance of each classifier remains approximately the same,
with multinomial naive bayes seeing the most significant boost. It now per-
forms as well as logisic regression, with an F1 score of .84.
5.5 Linear SVM Top Features
While feature exploration was not the focus of our study, we have also
provided a graph for the top features distinguishing the two classes using both
space tokenization and Mecab. While the use of the Mecab tokenizer led to a
significant increase in the performance of the linear SVM classifier, with the
F1 score jumping from .74 to .81, the top features for the depressed class when
using Mecab are quite different from those generated by simple space tokeniza-
tion. The top features for the depressed class in the case of space tokenization
53
(a) Logistic Regression (b) Multinomial Naive Bayes
(c) Linear SVM (d) Random Forest
Figure 5.2: ROC graphs representing performance with tokenization performed byMecab-ko.
contain Korean emoticons for sad faces and words we might intuitively as-
sociate with negative emotion such as the Korean word for envy, bureopda
(부럽다). The top features generated in the case of Mecab tokenization, how-
ever, do not contain these features. It does, however, contain the words ’pogi’
포기, which loosely translated means to give up or renounce, ’yushil’ (유실)
, which is to be swept away or lost, ’dangyeobyeong’ (당뇨병), diabetes, and,
perhaps most appropriately, ’ooeuljeung’ (우울증), depression.
54
Figure 5.3: Linear SVM Top Features with Space Tokenization.
55
Figure 5.4: Linear SVM Top Features with Mecab Tokenization.
56
5.6 Precision-Recall Graphs
While organizing the data from our experiment, precision-recall graphs
were also generated. They have not been included or discussed thus far because
they do not serve our purposes outlined in sections 1.4 and 5.2. We will briefly
discuss them here, however, in order to address an aspect of our findings.
5.6.1 Precision-Recall Curves and Hard to Classify Tweets
As was explained earlier, precision is the rate of true positive predictions
relative to total positive predictions, while recall is the ratio of positive predic-
tions to positive observations. Precision-recall curves plot recall on the x-axis
and precision on the y-axis. Because precision is the the probability of a pos-
itive prediction being correct, it is highly sensitive to the base probabilities
of the respective classes. It is for this reason that precision-recall curves are
often used when there is a severe class imbalance. Because this study and the
Qntfy studies used balanced classes, however, ROC curves were used in place
of precison-recall curves. However, if we look at the precision-recall curves gen-
erated by all classifiers used in our experiments when using space tokenization,
we observe an interesting phenomenon: there is a consistent and dramatic drop
in precision at a 70 percent recall rate for all classifiers. This means that there
are some depressed classed tweets that the classifiers have great difficulty in
identifying and are only identified when the decision threshold is low. As a re-
sult, the precision drops precipitously. We observe a similar but less steep drop
with Mecab tokenization and logistic regression at the threshold that gener-
ates an approximately 90 percent recall rate. For multinomial naive bayes, we
again observe a similar but less pronounced drop in precision. Interestingly,
57
random forest, while obtaining a lower average precision across all thresholds
compared to logistic regression, does not produce this sudden drop in preci-
sion before using a threshold that generates an almost perfect recall rate 1. It
can be theorized that these problematic tweets contain signals that are shared
between depressed and non-depressed users that are reduced when utilizing
morphological analysis. Some of these are reduced by the morphological anal-
ysis conducted by Mecab, which may suggest that part of this overlap may be
due to idiosyncracies of tweet texts that are reduced or eliminated when using
Mecab tokenization, leading to fewer problematic tweets that require very low
thresholds to identify as belonging to the positive class.
1Average precision is a measure that summarizes the precision of a classifier over aset of thresholds, where thresholds are the probability levels at which a classifier makes adetermination as to whether or not an observation belongs to the positive class. The precisionat every threshold is weighted by the increase in recall from the previous threshold.
58
(a) Logistic Regression (b) Naive Bayes
(c) Linear SVM (d) Random Forest
Figure 5.5: Precision-recall graphs representing performance with space tokenization.
5.7 Discussion
As we can see, this method of data collection provides great potential
for the discovery of mental health signals when using various commonly used
machine learning classifiers. Logistic regression proved to be the best perform-
ing of all classifiers, regardless of the tokenization method used. Multinomial
naive bayes proved to have an F1 score high as logistic regression when using
the Mecab tokenizer, .84, but the MNB classifier’s average precision was still
somewhat lower, .81 versus .86 for logistic regression. Results when using sim-
59
(a) Logistic Regression (b) Naive Bayes
(c) Linear SVM (d) Random Forest
Figure 5.6: Precision-recall graphs representing performance with Mecab tokenization.
ple space tokenization were worse, with logistic regression outperforming the
MNB classifier both in terms of its F1 score as well as in its average precision,
with scores of .75 to .72 and .82 to .81, respectively. This suggests that logistic
regression would be the preferred choice for the classification task for both a
recall-prioritizing approach or a precision-prioritizing approach.
The feedforward neural network did not perform better than the shallow
learning classifiers, and this may be due to the very large feature space in con-
junction with memory limitations on our experiment leading to an inability
to train a sufficiently complex network to take advantage of the potential of a
60
feedforward network. Alternatively, there simply may not be enough data for
the network to train on so as to allow the network to take into account more
abstract nuances differentiating the two classes.
Given the significant increase in performance when using the Mecab to-
kenizer, we can infer that morphological analysis plays a key role in how ef-
fective any classification method may turn out to be. While some tweets in
the positive class are still difficult to identify, as was seen in Section 5.6, the
morphological analysis of Mecab reduced the amount of positive class tweets
that were difficult to identify, as evidenced by the lower thresholds at which
classifiers saw a decrease in accuracy when using Mecab tokenization versus
space tokenization.
The results of the experiment demonstrate that linguistic signals of de-
pression are as available and useful in a classification task in Korean using
social media data and machine learning as they have been suggested to be by
previous studies done in English. The true positive rate and and false positive
rate numbers cannot, however, be taken as close approximations of what values
would be produced on an actual population of randomly selected users, due to
the imbalance between depressed and non-depressed individuals in an actual
population, as well as the fact that there may be contamination of depressed
users in the randomly selected control group. Nevertheless, the relative efficacy
of the classifiers and the comparable performance with past studies that also
utilized datasets with balanced or nearly-balanced classes, shows that these
methods are at least as effective in Korean as they have proven to be in En-
glish.
Furthermore, in our brief examination of the top features identified by
61
the linear SVM classifier when using both Mecab and space tokenization, we
can see that some linguistic signals that we might expect to see in depressed
users tweets emerge, such as sad face emoticons and words associated with
negative emotion.
While this study only dealt with Korean data, we believe it indicates the
potential of these methods for the automatic detection of linguistic signals
of mental health conditions in a variety of languages. Availability of a good
tokenizer for the language may be key in achieving optimal results, however,
though it was demonstrated that even with simple space tokenization, signifi-
cant differentiation between the classes was able to be achieved.
62
6 Conclusion
Using social media to gain insight into signals tied to mental mental health
issues, both linguistic and otherwise, is an enterprise that has, in recent years,
grown at a rapid rate in terms of both research interest as well as in its viabil-
ity in providing results that can truly distinguish a target group from controls
in a way that is reliable, scalable, and feasible. However, most of this research
has been done in English. Our study demonstrates that an inexpensive data
collection technique first introduced in research done on English social me-
dia data, in conjunction with commonly used machine learning classifiers, is
sufficient in distinguishing depressed social media users from a randomly se-
lected control group. Furthermore, it demonstrates, to our knowledge for the
first time, that this process is at least as effective with Korean data as it has
proven to be in past studies with English data.
In accomplishing this goal, we searched for users on the social media plat-
form Twitter that claimed to have been clinically diagnosed with depression
using the Korean language. Korean native speakers were then employed to
ensure that these claims were sincere and not sarcastic, made in jest, or oth-
erwise not an indication of an actual diagnosis of depression. Using a Python
script that allowed us to bypass the limitations of the Twitter API while still
leveraging publicly available data, we scraped the posting history of users
identified as diagnosed with depression according to the aforementioned stan-
63
dard, and then matched the resulting dataset with an equal amount of tweets
from randomly selected controls posting over the same time period. We then
set various machine learning classifiers to work on a binary classification task
using two kinds of tokenization methods. We did not however, attempt to op-
timize these classifiers or utilize more sophisticated deep learning approaches
in a way that could maximize their potential in this task, and so this is left
for future research to explore.
Based on the ability of all classifiers to perform well above chance with
both tokenization methods in distinguishing tweets from depressed users from
those of a control group over the course of a 10-fold cross validation, it is
our belief that such findings may prove useful in future considerations of how
to leverage widely used social media platforms in identifying individuals that
may be at risk for suffering from debilitating mental health conditions such
as depression. Thus far, most studies in Korean dealing with the detection
of depression, even those that have leveraged social media data, have relied
on expensive and time-consuming surveys and interviews that often have a
low follow-through rate. Studies such as ours are a significant indicator that
relatively simple and cost-effective measures may, with more research, prove
to be a source of great contribution to the diagnosis and treatment of a large
group of individuals who suffer without seeking help from mental health pro-
fessionals.
Furthermore, in exploring our dataset, we find certain features that we
might intuitively expect to find from depressed users, such as words or emoti-
cons likely to indicate negative emotional states. In addition, we find a uniform
inability of the employed classifiers to reliably classify a portion of tweets with-
64
out a significant drop in precision. The number of tweets belonging to this
problematic group is dramatically reduced by morphological analysis, how-
ever. We theorize that this drop in precision may be due to language that
is idiosyncratic to language use on Twitter, and that is greatly reduced by
the morphological analysis provided by a tokenizer such as Mecab. That there
are still some tweets that are difficult to classify is not entirely unexpected,
as it is not clear intuitively or otherwise that we should expect the language
employed by depressed users on a platform such as Twitter to be exclusively
distinct from users not suffering from depression, thus leaving room for posts
that may be indistinguishable from those posted by a control group.
Our employment of a simple two-layer feed forward network fails to out-
perform our best performing shallow learning classifier, and we posit that this
may be due to there either being no more complexity in the data to be mined,
or in our simple network not being complex or optimized enough to make use
of such potential. As indicated above, we leave this avenue open for future
research to explore.
Lastly, we acknowledge that there are caveats to our study. Depressed
Twitter users, and in particular those willing to go public with a diagnosis,
may not be a good representation of depressed individuals as a whole. In addi-
tion, using a balanced dataset is not an accurate representation of the occur-
rence of depression in the general population, and that compounding this fact
is the potential for undiagnosed users, or alternatively diagnosed users who
have not publicly disclosed their diagnosis, to contaminate the control group.
Based on this study, however, we believe there is good reason to believe that
the methods employed in this thesis are as viable for Korean social media data
65
as they are increasingly being demonstrated to be in a growing body of work
done with English data. We also remain optimistic that with larger datasets, a
more optimized machine learning approach, and a more controlled experimen-
tal environment backed by personal data from willing volunteers, the results
and applications of approaches such as those employed in this experiment can
grow exponentially.
66
Bibliography
Ahn, J. (2012). “Depression, suicide, and Korean society”. In: Journal of the
Korean Medical Association 55.4, pp. 320–321.
Al-Mosaiwi, M. and T. Johnstone (2018). “In an Absolute State: Elevated Use
of Absolutist Words Is a Marker Specific to Anxiety, Depression, and Sui-
cidal Ideation”. In: Clinical Psychological Science 0.0, p. 2167702617747074.
doi: 10.1177/2167702617747074. eprint: https://doi.org/10.1177/
2167702617747074. url: https://doi.org/10.1177/2167702617747074.
Alexa Top 500 Global Sites. https://www.alexa.com/topsites. (Accessed
on 06/09/2018).
Aramaki, E., S. Maskawa, and M. Morita (2011). “Twitter Catches the Flu:
Detecting Influenza Epidemics Using Twitter”. In: Proceedings of the
Conference on Empirical Methods in Natural Language Processing. EMNLP
’11. Edinburgh, United Kingdom: Association for Computational Linguis-
tics, pp. 1568–1576. isbn: 978-1-937284-11-4. url: http://dl.acm.org/
citation.cfm?id=2145432.2145600.
Association, A. P. et al. (2013). Diagnostic and statistical manual of mental
disorders (DSM-5®). American Psychiatric Pub.
Bagroy, S., P. Kumaraguru, and M. De Choudhury (2017). “A social media
based index of mental well-being in college campuses”. In: Proceedings
67
of the 2017 CHI Conference on Human Factors in Computing Systems.
ACM, pp. 1634–1646.
Bebis, G. and M. Georgiopoulos (1994). “Feed-forward neural networks”. In:
IEEE Potentials 13.4, pp. 27–31.
Beck, A. T., R. A. Steer, and G. K. Brown (1996). “Beck depression inventory-
II”. In: San Antonio 78.2, pp. 490–8.
Belmaker, R. and G. Agam (2008). “Major depressive disorder”. In: New Eng-
land Journal of Medicine 358.1, pp. 55–68.
Boydstun, A. et al. (2013). “Examining debate effects in real time: A report of
the 2012 React Labs: Educate study”. In: The Political Communication
Report 23.1.
Coppersmith, G., M. Dredze, and C. Harman (2014). “Quantifying mental
health signals in twitter”. In: Proceedings of the Workshop on Compu-
tational Linguistics and Clinical Psychology: From Linguistic Signal to
Clinical Reality, pp. 51–60.
Coppersmith, G. et al. (2015). “CLPsych 2015 shared task: Depression and
PTSD on Twitter”. In: Proceedings of the 2nd Workshop on Compu-
tational Linguistics and Clinical Psychology: From Linguistic Signal to
Clinical Reality, pp. 31–39.
Coppersmith, G. et al. (2016). “Exploratory analysis of social media prior to
a suicide attempt”. In: Proceedings of the Third Workshop on Computa-
tional Lingusitics and Clinical Psychology, pp. 106–117.
Coppersmith, G. et al. (2017). “Scalable mental health analysis in the clinical
whitespace via natural language processing”. In: Biomedical & Health In-
68
formatics (BHI), 2017 IEEE EMBS International Conference on. IEEE,
pp. 393–396.
De Choudhury, M. et al. (2013). “Predicting depression via social media.” In:
ICWSM 13, pp. 1–10.
Eaton, W. W. et al. (2004). “Center for Epidemiologic Studies Depression
Scale: review and revision (CESD and CESD-R).” In:
Fava, M. et al. (2003). “Background and rationale for the sequenced treat-
ment alternatives to relieve depression (STAR D) study”. In: Psychiatric
Clinics of North America 26.2, pp. 457–494.
Freud, S. (1901). The Psychopathology of Everyday Life. Digireads.com. isbn:
1420924915. url: http://www.amazon.com/exec/obidos/redirect?
tag=citeulike07-20\&path=ASIN/1420924915.
Goodwin, F. K. and K. R. Jamison (2007). Manic-depressive illness: bipolar
disorders and recurrent depression. Vol. 1. Oxford University Press.
Gottschalk, L. A. and R. Bechtel (1993). “Computerized content analysis of
natural language or verbal texts”. In: Palo Alto.
Gottschalk, L. A. et al. (1970). “Prediction of changes in severity of the
schizophrenic syndrome with discontinuation and administration of phe-
nothiazines in chronic schizophrenic patients: Language as a predictor and
measure of change in schizophrenia”. In: Comprehensive Psychiatry 11.2,
pp. 123 –140. issn: 0010-440X. doi: https://doi.org/10.1016/0010-
440X(70)90154-9. url: http://www.sciencedirect.com/science/
article/pii/0010440X70901549.
69
Gottschalk, L. A. and G. C. Gleser (1969). The measurement of psychological
states through the content analysis of verbal behavior. Univ of California
Press.
Guntuku, S. C. et al. (2017). “Detecting depression and mental illness on social
media: an integrative review”. In: Current Opinion in Behavioral Sciences
18, pp. 43–49.
Hamilton, M (1986). “The Hamilton rating scale for depression”. In: Assess-
ment of depression. Springer, pp. 143–152.
Ho, T. K. (2002). “A data complexity analysis of comparative advantages of
decision forest constructors”. In: Pattern Analysis & Applications 5.2,
pp. 102–112.
Joachims, T. (1998). “Text categorization with support vector machines: Learn-
ing with many relevant features”. In: European conference on machine
learning. Springer, pp. 137–142.
Kahn, J. H. et al. (2007). “Measuring Emotional Expression with the Linguistic
Inquiry and Word Count”. In: The American Journal of Psychology 120.2,
pp. 263–286. issn: 00029556. url: http://www.jstor.org/stable/
20445398.
Kemp, S. (2016). Digital in 2016 - We Are Social UK. url: https://wearesocial.
com/uk/special-reports/digital-in-2016.
Kim, G. et al. (2013). “National Evidence-based Collaborating Agency (NECA)
Round-table Conference Consensus Statement: multidisciplinary responses
to suicide, the first ranked cause of death in adolescents.” In: Journal of
the Korean Medical Association, Taehan Uisa Hyophoe Chi 56.2.
70
Klein, D. N. and S. R. Black (2013). “Persistent depressive disorder”. In:
Psychopathology: History, Diagnosis, and Empirical Foundations 334.
Kroenke, K. and R. L. Spitzer (2002). “The PHQ-9: a new depression diag-
nostic and severity measure”. In: Psychiatric annals 32.9, pp. 509–515.
Lee, S. W. et al. (2016). “Insights from an expressive writing intervention
on Facebook to help alleviate depressive symptoms”. In: Computers in
Human Behavior 62, pp. 613–619.
Liaw, A., M. Wiener, et al. (2002). “Classification and regression by random-
Forest”. In: R news 2.3, pp. 18–22.
Loveys, K. et al. (2017). “Small but Mighty: Affective Micropatterns for Quan-
tifying Mental Health from Social Media Language”. In: Proceedings of the
Fourth Workshop on Computational Linguistics and Clinical Psychology—
From Linguistic Signal to Clinical Reality, pp. 85–95.
Mann, J. J. et al. (2005). “Suicide prevention strategies: a systematic review”.
In: Jama 294.16, pp. 2064–2074.
Marcus, M. et al. (2012). “Depression: A global public health concern”. In:
Mitchell, T. (1997). Machine Learning. McGraw-Hill International Editions.
McGraw-Hill. isbn: 9780071154673. url: https://books.google.co.
kr/books?id=EoYBngEACAAJ.
Moreno, M. A. et al. (2011). “Feeling bad on Facebook: Depression disclo-
sures by college students on a social networking site”. In: Depression and
anxiety 28.6, pp. 447–455.
71
Mowery, D. et al. (2017). “Understanding depressive symptoms and psychoso-
cial stressors on Twitter: a corpus-based study”. In: Journal of medical
Internet research 19.2.
Na, K.-S. et al. (2015). “Psychological autopsy: review and considerations
for future directions in Korea”. In: Journal of Korean Neuropsychiatric
Association 54.1, pp. 40–48.
Nadeem, M. (2016). “Identifying Depression on Twitter”. In: CoRR abs/1607.07384.
arXiv: 1607.07384. url: http://arxiv.org/abs/1607.07384.
Nelson, J. C. and J. M. Davis (1997). “DST studies in psychotic depression:
a meta-analysis”. In: American Journal of Psychiatry 154.11, pp. 1497–
1503.
Noh, J.-H. 학생 스마트폰 ’SNS 자살징후’ 부모에게 알린다. Ed. by Y. News.
url: http://www.yonhapnews.co.kr/bulletin/2015/03/12/0200000000AKR20150312185600004.
HTML.
O’Dea, B. et al. (2015). “Detecting suicidality on Twitter”. In: Internet Inter-
ventions 2.2, pp. 183–188.
OECD (2016). OECD Factbook 2015-2016, p. 228. doi: https://doi.org/
http://dx.doi.org/10.1787/factbook-2015-en. url: https://www.
oecd-ilibrary.org/content/publication/factbook-2015-en.
Pajer, K et al. (2012). “Discovery of blood transcriptomic markers for depres-
sion in animal models and pilot validation in subjects with early-onset
major depression”. In: Translational psychiatry 2.4, e101.
72
Park, J. et al. (2011). “Ceo’s apology in twitter: A case study of the fake
beef labeling incident by e-mart”. In: International Conference on Social
Informatics. Springer, pp. 300–303.
Park, M., D. W. McDonald, and M. Cha (2013). “Perception Differences be-
tween the Depressed and Non-Depressed Users in Twitter.” In: ICWSM
9, pp. 217–226.
Park, S. et al. (2013). “Activities on Facebook reveal the depressive state of
users”. In: Journal of medical Internet research 15.10.
Park, S. et al. (2015). “Manifestation of depression and loneliness on social
networks: a case study of young adults on Facebook”. In: Proceedings
of the 18th ACM conference on computer supported cooperative work &
social computing. ACM, pp. 557–570.
Pedersen, T. (2015). “Screening twitter users for depression and ptsd with lex-
ical decision lists”. In: Proceedings of the 2nd workshop on computational
linguistics and clinical psychology: from linguistic signal to clinical reality,
pp. 46–53.
Peng, C.-Y. J., K. L. Lee, and G. M. Ingersoll (2002). “An introduction to
logistic regression analysis and reporting”. In: The journal of educational
research 96.1, pp. 3–14.
Qntfy. https://www.qntfy.com/. (Accessed on 06/01/2018).
Resnik, P., A. Garron, and R. Resnik (2013). “Using topic modeling to im-
prove prediction of neuroticism and depression in college students”. In:
Proceedings of the 2013 conference on empirical methods in natural lan-
guage processing, pp. 1348–1353.
73
Rude, S., E.-M. Gortner, and J. Pennebaker (2004). “Language use of de-
pressed and depression-vulnerable college students”. In: Cognition and
Emotion 18.8, pp. 1121–1133. doi: 10.1080/02699930441000030. eprint:
https://doi.org/10.1080/02699930441000030. url: https://doi.
org/10.1080/02699930441000030.
Saeed, S. A. and T. J. Bruce (1998). “Seasonal affective disorders.” In: Amer-
ican family physician 57.6, pp. 1340–6.
Shimodaira, H. (2014). “Text classification using naive bayes”. In: Learning
and Data Note 7, pp. 1–9.
Stirman, S. W. and J. W. Pennebaker (2001). “Word use in the poetry of sui-
cidal and nonsuicidal poets”. In: Psychosomatic medicine 63.4, pp. 517–
522.
Stone, P. J. and E. B. Hunt (1963). “A Computer Approach to Content
Analysis: Studies Using the General Inquirer System”. In: Proceedings
of the May 21-23, 1963, Spring Joint Computer Conference. AFIPS ’63
(Spring). Detroit, Michigan: ACM, pp. 241–256. doi: 10.1145/1461551.
1461583. url: http://doi.acm.org/10.1145/1461551.1461583.
taspinar/twitterscraper: Scrape Twitter for Tweets. https://github.com/
taspinar/twitterscraper. (Accessed on 06/10/2018).
Tausczik, Y. R. and J. W. Pennebaker (2010). “The Psychological Meaning
of Words: LIWC and Computerized Text Analysis Methods”. In: Jour-
nal of Language and Social Psychology 29.1, pp. 24–54. doi: 10.1177/
0261927X09351676. eprint: https://doi.org/10.1177/0261927X09351676.
url: https://doi.org/10.1177/0261927X09351676.
74
Tsugawa, S. et al. (2013). “On estimating depressive tendencies of Twitter
users utilizing their tweet data”. In: Virtual Reality (VR), 2013 IEEE.
IEEE, pp. 1–4.
Tsugawa, S. et al. (2015). “Recognizing depression from twitter activity”. In:
Proceedings of the 33rd Annual ACM Conference on Human Factors in
Computing Systems. ACM, pp. 3187–3196.
Weintraub, W. (1989). Verbal Behavior in Everyday Life. Springer Publishing
Company, Incorporated. isbn: 9780826157904. url: https://books.
google.co.kr/books?id=E1F9AAAAMAAJ.
Werth, J. L. (2004). “The relationships among clinical depression, suicide, and
other actions that may hasten death”. In: Behavioral sciences & the law
22.5, pp. 627–649.
Won, H.-H. et al. (2013). “Predicting national suicide numbers with social
media data”. In: PloS one 8.4, e61809.
Woo, H. et al. (2015). “Public Trauma after the Sewol Ferry Disaster: the
role of social media in understanding the public mood”. In: International
journal of environmental research and public health 12.9, pp. 10974–10983.
Zung, W. W. (1965). “A self-rating depression scale”. In: Archives of general
psychiatry 12.1, pp. 63–70.
은전한닢프로젝트:은전한닢프로젝트를소개합니다. http://eunjeon.blogspot.
com/2013/02/blog-post.html. (Accessed on 06/1/2018).
75
초록
한국어 트위터 데이터를 활용한 우울증 표현
인식
근래자살률에있어OECD국가들중최상위권에있으면서도한국에서우울
증과같은정신건강에대한진단과치료는과거와마찬가지로여전히금기시되는
경향성이 있다. 영어권 국가들에서는 소셜 미디어 텍스트를 이용해 정신건강의
이상 징후를 찾는 연구가 크게 증가하고 있고, 최근에는 한국 교육부도 자체적으
로 소셜 미디어 텍스트 검사 앱을 미성년자 대상으로 발표했다. 따라서 한국어
소설 미디어 텍스트로부터 정신건강 이상 징후를 효과적으로 분류하는 연구는
현재 매우 시의적절한 상황이다. 현재까지 소셜 미디어 데이터를 활용한 다수의
기존 연구들은 심리학적 텍스트 분석 프로그램(LIWC) 또는 설문지와 같이 사전
구축된 어휘자료를 사용해왔고, 특정 분야의 지식과 설문조사를 요구하지 않는
자동 감지 방법에 대한 연구는 상대적으로 적었다. 더욱이 영어 이외의 언어를
대상으로 한 연구는 매우 드물고 한국어에 대해서는 연구가 전무한 상황이다. 본
연구는 한국의 우울증과 자살이 공중 보건 문제에 대해 갖는 중요성을 감안해
이와 같은 부족함을 채우고자 이루어졌다. 본 연구는 어떤 게시된 트윗으로부터
그것을 작성한 사용자가 우울증을 앓고 있는지를 예측하고자 다양한 기계 학습
분류기를 사용하였다. 이를 위해 먼저 우울증을 진단받았다고 주장하는 트윗을
올린 사용자들을 찾은 후에, 한국어 모국어 화자들이 직접 그 트윗 게시물을 토
대로 우울증 진단 여부를 판단하였다. 그리고 우울증을 앓고 있는 것으로 판단된
사용자자로부터최대 3,200개까지의트윗을수집했으며,같은활동시기의정상적
76
사용자들 중 같은 수의 사용자들을 임의로 선택하여 그 트윗들을 통제집단으로
수집하였다. 두 개의 다른 토크나이저와 다수의 기계 학습 분류기를 사용했고,
트크나이저와 분류기의 각 조합에 다라 10-폴드 교차 검증법을 이용하여 평균
정밀도와 F1 스코어를 기록했다. 그 결과, 모든 조합에서 우연보다 훨씬 높은
정확도로 우울증 경향성을 보이는 사용자들을 감지하였다. 그러므로 본 연구는
소셜미디어자료를사용하여정신건강문제를자동탐지하는방법이,기존의심
리학적 텍스트 분석 프로그램(LIWC)이나 비용과 시간이 드는 설문조사에 비해
최소한 그 성능이 갖거나 더 낫다는 점을 확인하였다는 의미를 갖는다.
주요어: 기계학습, 정신 건강, 소셜 미디어, 트위터, 우울증
학번: 2015-22104
77