alz hack ii
TRANSCRIPT
![Page 1: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/1.jpg)
AlzHack
Data Driven Diagnosis of Alzheimer's Disease
Frank Kelly
![Page 2: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/2.jpg)
Goal definition
![Page 3: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/3.jpg)
Diagnose Alzheimer’s disease as early as possibleBenefit to millions of people (potentially)
Our goal:
![Page 4: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/4.jpg)
Why is Alzheimer’s disease diagnosis important?Chronic neurodegenerative disease
60-70% of dementia cases = Alzheimer's
48 million people affected worldwide (2015)
Wrecks people’s lives (+ their families’)
800,000 people (in the UK) formally diagnosed
Only 43% of those with the condition get a diagnosis
Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238
![Page 5: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/5.jpg)
Demographic changes mean it
will be more widespread
Chart credit: economist.com
By 2050 the number of dementia
sufferers is expected to triple
A global, mounting problem
![Page 6: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/6.jpg)
How is Alzheimer’s disease diagnosed today?
Medical history
Mental status tests
Physical and neurological examination
Blood tests and brain imaging
Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf
![Page 7: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/7.jpg)
A gradual decline
-20 years
-10 years
Death-15 years
-5 years
Earliest Alzheimer’s Mild to moderate Severe
Common diagnosis period
![Page 8: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/8.jpg)
Who are we ?
Full bios: https://alzhack.wordpress.com
What is our approach? We’re doing citizen science● No lab, or lab coats
● Readily available data
● Other people’s research
![Page 9: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/9.jpg)
Diagnose Alzheimer’s disease as early as possibleWhy?
Participate in clinical drug trials Benefit from treatment
More time to plan
Take own decisions
Better carer relationship
Reduce anxieties about unknowns
Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/
![Page 10: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/10.jpg)
Design of Study&
Data Collection
![Page 11: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/11.jpg)
How the disease manifests itselfProtein plaques and
tangles accumulate in the
brain:
Disrupting
communication
between nerve cells
Kills nerve cells
Loss of brain tissue
Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org
![Page 12: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/12.jpg)
How the disease manifests itself (1)
Starts in the hippocampus
Harder to form new memories
Difficult to recollect from days or
hours ago
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
![Page 13: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/13.jpg)
How the disease manifests itself (2) ...then takes root in other areas
2. Language processing
3. Logical thought
4. Emotions
5. Senses
6. Older memories
7. Balance and coordination
Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA
![Page 14: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/14.jpg)
Relevant symptoms
Confusion with
time/place
Spatial memory
Problems with words
Misplacing items
Decreased / poor
judgment
Withdrawal from
work
Mood change Difficulty
with familiar
tasks
Challenges in planning
Speech
Short term memory loss
-20 years
-10 years Death
-15 years
-5 years
Earliest Alzheimer’s Mild to moderate Severe
![Page 15: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/15.jpg)
Previously...
![Page 16: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/16.jpg)
Previously: Analysis of a single user’s emails ● An Alzheimer’s
disease sufferer’s
emails over 4 years
● Conversion of email
text to vectors
● Counts, lengths and
other metrics
FeaturesMemory, language and sentiment related metrics extracted
![Page 17: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/17.jpg)
Results
Some “explainable” trends
ChallengesSingle user: lack of data and likely bias
Scaling up: security concerns & deletion
![Page 18: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/18.jpg)
How did we get more data?
![Page 19: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/19.jpg)
Forum post scraping
First lxml, then BeautifulSoup
● Two sub-forums
● ~3,600 threads
● ~78,000 posts
○ Post content
○ Post metadata
○ User metadata
![Page 20: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/20.jpg)
Data preparation
Content punctuation
sanitised by regexp
substitutions.
Sub forum
post data
(x2)
![Page 21: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/21.jpg)
User labelling
![Page 22: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/22.jpg)
How do we label a user?● Users frequently post in both sub-forums
● To differentiate:
○ Assume that OPs (thread starters) in a sub-forum are of that category
○ Otherwise look at ratio of posts (replies) between the two sub forums*
FP = First Post in thread SP = Subsequent Post in thread
![Page 23: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/23.jpg)
How do we label a user?
Thread
Reply
Dementia Partner
Discard Unknown
![Page 24: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/24.jpg)
Features and EDA
![Page 25: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/25.jpg)
Sentiment “polarity”
(out-of-the-box via
NLTK & TextBlob)
● Alternatively can
train your own text
classifier:
http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
‘Mood change’as a feature
![Page 26: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/26.jpg)
● Average of sentence
sentiments per post
● Slightly higher
sentiment for
dementia sufferers’
posts
![Page 27: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/27.jpg)
Language-oriented featuresLexical functions
Comprehension functions
Empty phrases
Paraphasias and
neologisms
Vocabulary-related
Readability
“Go ahead” phrases
Unintended or
invented words
Difficult words count
Dale-Chall readability
Flesch Kincaid
Flesch Reading Ease
Counts of “ummm...errr”
Words that are not in
common usage
![Page 28: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/28.jpg)
Simple language features● Sentence count
● Word count
● Words per sentence
● Unique word count
● Unique words to total ratio
● “Go Ahead” words (Empty phrases)
![Page 29: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/29.jpg)
Readability(package readability-lxml)
● Avg syllables per word
● Avg letter per word
● Flesch reading ease
● Flesch kincaid grade
● Polysyllabcount
● Automated readability index
● Number of “difficult” words
● Dale-chall readability score
● Gunning fog
![Page 30: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/30.jpg)
Vocabulary & word counts
![Page 31: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/31.jpg)
Memory-oriented features● Sort posts by username and timestamp, add a shifted column
![Page 32: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/32.jpg)
Apply comparison function
between post and previous post:
○ NLTK edit_distance (fuzzy
match)
○ Cosine similarity between TF-
IDF vectors
![Page 33: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/33.jpg)
Part of speech (POS) features● Tag words and
tally up
frequencies
● Calculate
“rates”
![Page 34: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/34.jpg)
Models & results
![Page 35: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/35.jpg)
Explanatory or predictive modelling ?
● Actually both.
● First ‘interpret’ a classifier (explanatory)
● Secondly need a ‘real-time’ detection system (predictive)
![Page 36: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/36.jpg)
Data modelling strategy (used for initial ML runs)Aggregation of posts
● pandas: groupby, agg by username
Balancing out the dataset
● Many more partner users than sufferers
● Subsample larger (partner) dataset to even things up
Validate using random train and test sets
● Randomly select 80% of users for training, 20% test
![Page 37: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/37.jpg)
Model Results for Misc. Features● Median values (aggregated over all posts per user)
Best: SVM Radial basis function classifier (with grid
search)
User classification accuracy: 57%
![Page 38: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/38.jpg)
Model Results for Memory Features● Median values (aggregated over all posts per user)
Best: K-nearest neighbours Classifier
User classification accuracy: 63%
![Page 39: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/39.jpg)
Model Results for Readability Features● Median values (aggregated over all posts / user)
Best: K-nearest neighbours Classifier
User classification accuracy: 59%
![Page 40: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/40.jpg)
Model Results for Part-Of-Speech Features● Median values (aggregated over all posts per user)
Best: SVM Radial basis function classifier (with grid
search)
User classification accuracy: 61%
![Page 41: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/41.jpg)
Model Results for All Features● Median values (aggregated over all posts per user)
Best: Naïve Bayes Classifier
User classification accuracy: 63%
![Page 42: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/42.jpg)
Re-think: Classify posts, not users
● Currently group by userID
● Some users post more than others
● Posts would utilise full “richness” of the dataset
● Double round of sampling required on post set:
○ 3 - 4 times more “partners” than dementia sufferers
○ Partners post approx. 3 times more posts than sufferers do
![Page 43: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/43.jpg)
Model Results for All Features (by post)● Filtered set of posts
Best: Random Forest Classifier
Accuracy of 68% percent in ability to classify a post
![Page 44: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/44.jpg)
Wrap up
![Page 45: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/45.jpg)
Results in summary● Best performing feature group so far on aggregated set by user:
○ Memory-based features
● Best performing individual feature on aggregated set by user:
○ Verb rate = ratio of verbs to word count in post
● Best performing individual feature on individual post:
○ Cosine similarity to previous post
● Aligns with symptoms expected in early stage to mild dementia
![Page 46: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/46.jpg)
Future avenues● Data
○ Further data gathering (more blogs including non-alzheimer's topic blogs)
○ Better user identification (e.g. active learning)
● Features
○ More and better
○ Types of individual dementia distinguish
○ More memory-related features (e.g. LSI)
● Clustering of posts into ‘topics’ or users into ‘types’
○ gensim / LDA topic modelling
○ Early stage / medium condition / advanced condition posters
● Classification and modelling
○ Time series analysis
○ New sampling techniques, input validation and models
![Page 47: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/47.jpg)
Future: Time series analysis
● Noisy datasets
○ Apply numerical Bayesian
inference
● Are we looking for a steady
change in the mean?
○ Ramp detection
● Or a sudden change in
variance?
○ Step change detection
Dementia sufferer
Partner
![Page 48: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/48.jpg)
Conclusions● Introduction to Alzheimer’s and its impact
● Explanation of our technical approach and surrounding challenges
● Initial observations and predictions
● Tough problem and a worthwhile cause for data science
● Please contact us if you would like to help, or have ideas:
[email protected] https://alzhack.wordpress.com/contribute-2/
Thank you!
![Page 49: Alz Hack II](https://reader031.vdocuments.pub/reader031/viewer/2022021921/58f12c3f1a28ab78218b45c1/html5/thumbnails/49.jpg)