alz hack ii

49
AlzHack Data Driven Diagnosis of Alzheimer's Disease Frank Kelly

Upload: frank-kelly

Post on 15-Apr-2017

440 views

Category:

Science


3 download

TRANSCRIPT

Page 1: Alz Hack II

AlzHack

Data Driven Diagnosis of Alzheimer's Disease

Frank Kelly

Page 2: Alz Hack II

Goal definition

Page 3: Alz Hack II

Diagnose Alzheimer’s disease as early as possibleBenefit to millions of people (potentially)

Our goal:

Page 4: Alz Hack II

Why is Alzheimer’s disease diagnosis important?Chronic neurodegenerative disease

60-70% of dementia cases = Alzheimer's

48 million people affected worldwide (2015)

Wrecks people’s lives (+ their families’)

800,000 people (in the UK) formally diagnosed

Only 43% of those with the condition get a diagnosis

Figures: wikipedia & http://www.bbc.co.uk/science/0/21878238

Page 5: Alz Hack II

Demographic changes mean it

will be more widespread

Chart credit: economist.com

By 2050 the number of dementia

sufferers is expected to triple

A global, mounting problem

Page 6: Alz Hack II

How is Alzheimer’s disease diagnosed today?

Medical history

Mental status tests

Physical and neurological examination

Blood tests and brain imaging

Example test sheet:: http://www.ftdrg.org/wp-content/uploads/4a-CCT_revised-Picture-stimulus.pdf

Page 7: Alz Hack II

A gradual decline

-20 years

-10 years

Death-15 years

-5 years

Earliest Alzheimer’s Mild to moderate Severe

Common diagnosis period

Page 8: Alz Hack II

Who are we ?

Full bios: https://alzhack.wordpress.com

What is our approach? We’re doing citizen science● No lab, or lab coats

● Readily available data

● Other people’s research

Page 9: Alz Hack II

Diagnose Alzheimer’s disease as early as possibleWhy?

Participate in clinical drug trials Benefit from treatment

More time to plan

Take own decisions

Better carer relationship

Reduce anxieties about unknowns

Sketch: http://www.businessfinancenews.com/28526-will-astrazeneca-plc-and-eli-lilly-give-breakthrough-in-alzheimers/

Page 10: Alz Hack II

Design of Study&

Data Collection

Page 11: Alz Hack II

How the disease manifests itselfProtein plaques and

tangles accumulate in the

brain:

Disrupting

communication

between nerve cells

Kills nerve cells

Loss of brain tissue

Facts: https://www.alzheimers.org.uk/site/scripts/documents_info.php?documentID=100 Imagery: www.alz.org

Page 12: Alz Hack II

How the disease manifests itself (1)

Starts in the hippocampus

Harder to form new memories

Difficult to recollect from days or

hours ago

Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA

Page 13: Alz Hack II

How the disease manifests itself (2) ...then takes root in other areas

2. Language processing

3. Logical thought

4. Emotions

5. Senses

6. Older memories

7. Balance and coordination

Video: https://www.youtube.com/watch?v=Eq_Er-tqPsA

Page 14: Alz Hack II

Relevant symptoms

Confusion with

time/place

Spatial memory

Problems with words

Misplacing items

Decreased / poor

judgment

Withdrawal from

work

Mood change Difficulty

with familiar

tasks

Challenges in planning

Speech

Short term memory loss

-20 years

-10 years Death

-15 years

-5 years

Earliest Alzheimer’s Mild to moderate Severe

Page 15: Alz Hack II

Previously...

Page 16: Alz Hack II

Previously: Analysis of a single user’s emails ● An Alzheimer’s

disease sufferer’s

emails over 4 years

● Conversion of email

text to vectors

● Counts, lengths and

other metrics

FeaturesMemory, language and sentiment related metrics extracted

Page 17: Alz Hack II

Results

Some “explainable” trends

ChallengesSingle user: lack of data and likely bias

Scaling up: security concerns & deletion

Page 18: Alz Hack II

How did we get more data?

Page 19: Alz Hack II

Forum post scraping

First lxml, then BeautifulSoup

● Two sub-forums

● ~3,600 threads

● ~78,000 posts

○ Post content

○ Post metadata

○ User metadata

Page 20: Alz Hack II

Data preparation

Content punctuation

sanitised by regexp

substitutions.

Sub forum

post data

(x2)

Page 21: Alz Hack II

User labelling

Page 22: Alz Hack II

How do we label a user?● Users frequently post in both sub-forums

● To differentiate:

○ Assume that OPs (thread starters) in a sub-forum are of that category

○ Otherwise look at ratio of posts (replies) between the two sub forums*

FP = First Post in thread SP = Subsequent Post in thread

Page 23: Alz Hack II

How do we label a user?

Thread

Reply

Dementia Partner

Discard Unknown

Page 24: Alz Hack II

Features and EDA

Page 26: Alz Hack II

● Average of sentence

sentiments per post

● Slightly higher

sentiment for

dementia sufferers’

posts

Page 27: Alz Hack II

Language-oriented featuresLexical functions

Comprehension functions

Empty phrases

Paraphasias and

neologisms

Vocabulary-related

Readability

“Go ahead” phrases

Unintended or

invented words

Difficult words count

Dale-Chall readability

Flesch Kincaid

Flesch Reading Ease

Counts of “ummm...errr”

Words that are not in

common usage

Page 28: Alz Hack II

Simple language features● Sentence count

● Word count

● Words per sentence

● Unique word count

● Unique words to total ratio

● “Go Ahead” words (Empty phrases)

Page 29: Alz Hack II

Readability(package readability-lxml)

● Avg syllables per word

● Avg letter per word

● Flesch reading ease

● Flesch kincaid grade

● Polysyllabcount

● Automated readability index

● Number of “difficult” words

● Dale-chall readability score

● Gunning fog

Page 30: Alz Hack II

Vocabulary & word counts

Page 31: Alz Hack II

Memory-oriented features● Sort posts by username and timestamp, add a shifted column

Page 32: Alz Hack II

Apply comparison function

between post and previous post:

○ NLTK edit_distance (fuzzy

match)

○ Cosine similarity between TF-

IDF vectors

Page 33: Alz Hack II

Part of speech (POS) features● Tag words and

tally up

frequencies

● Calculate

“rates”

Page 34: Alz Hack II

Models & results

Page 35: Alz Hack II

Explanatory or predictive modelling ?

● Actually both.

● First ‘interpret’ a classifier (explanatory)

● Secondly need a ‘real-time’ detection system (predictive)

Page 36: Alz Hack II

Data modelling strategy (used for initial ML runs)Aggregation of posts

● pandas: groupby, agg by username

Balancing out the dataset

● Many more partner users than sufferers

● Subsample larger (partner) dataset to even things up

Validate using random train and test sets

● Randomly select 80% of users for training, 20% test

Page 37: Alz Hack II

Model Results for Misc. Features● Median values (aggregated over all posts per user)

Best: SVM Radial basis function classifier (with grid

search)

User classification accuracy: 57%

Page 38: Alz Hack II

Model Results for Memory Features● Median values (aggregated over all posts per user)

Best: K-nearest neighbours Classifier

User classification accuracy: 63%

Page 39: Alz Hack II

Model Results for Readability Features● Median values (aggregated over all posts / user)

Best: K-nearest neighbours Classifier

User classification accuracy: 59%

Page 40: Alz Hack II

Model Results for Part-Of-Speech Features● Median values (aggregated over all posts per user)

Best: SVM Radial basis function classifier (with grid

search)

User classification accuracy: 61%

Page 41: Alz Hack II

Model Results for All Features● Median values (aggregated over all posts per user)

Best: Naïve Bayes Classifier

User classification accuracy: 63%

Page 42: Alz Hack II

Re-think: Classify posts, not users

● Currently group by userID

● Some users post more than others

● Posts would utilise full “richness” of the dataset

● Double round of sampling required on post set:

○ 3 - 4 times more “partners” than dementia sufferers

○ Partners post approx. 3 times more posts than sufferers do

Page 43: Alz Hack II

Model Results for All Features (by post)● Filtered set of posts

Best: Random Forest Classifier

Accuracy of 68% percent in ability to classify a post

Page 44: Alz Hack II

Wrap up

Page 45: Alz Hack II

Results in summary● Best performing feature group so far on aggregated set by user:

○ Memory-based features

● Best performing individual feature on aggregated set by user:

○ Verb rate = ratio of verbs to word count in post

● Best performing individual feature on individual post:

○ Cosine similarity to previous post

● Aligns with symptoms expected in early stage to mild dementia

Page 46: Alz Hack II

Future avenues● Data

○ Further data gathering (more blogs including non-alzheimer's topic blogs)

○ Better user identification (e.g. active learning)

● Features

○ More and better

○ Types of individual dementia distinguish

○ More memory-related features (e.g. LSI)

● Clustering of posts into ‘topics’ or users into ‘types’

○ gensim / LDA topic modelling

○ Early stage / medium condition / advanced condition posters

● Classification and modelling

○ Time series analysis

○ New sampling techniques, input validation and models

Page 47: Alz Hack II

Future: Time series analysis

● Noisy datasets

○ Apply numerical Bayesian

inference

● Are we looking for a steady

change in the mean?

○ Ramp detection

● Or a sudden change in

variance?

○ Step change detection

Dementia sufferer

Partner

Page 48: Alz Hack II

Conclusions● Introduction to Alzheimer’s and its impact

● Explanation of our technical approach and surrounding challenges

● Initial observations and predictions

● Tough problem and a worthwhile cause for data science

● Please contact us if you would like to help, or have ideas:

[email protected] https://alzhack.wordpress.com/contribute-2/

Thank you!

Page 49: Alz Hack II