paper #63144 oracle data mining and epidemiological analysis scott a. rappoport, ocp mts...
Post on 19-Dec-2015
217 views
TRANSCRIPT
paper #63144
Oracle Data Mining and Epidemiological Analysis
Scott A. Rappoport, OCP
MTS Technologies
OracleWorld 2003
San Francisco, CA
paper #63144
Presentation Goals
Short intros Vocabulary
Present Basic Medical Terms Describe Data Mining Models and Terms
Synthesize What questions are we asking? Applying DM to Epidemiological issues
Demonstrate the DM4J components The future:
Challenges 10g features
paper #63144
The DM Dimension
Data Mining capability readily accessible to the end users opens a whole new dimension of what can be performed in the medicine.
New questions are being generated based on the availability of these new techniques.
This is a cutting edge (bleeding edge) advanced technique
paper #63144
A few disclaimers…
Medical data is highly sensitive information… Thus:
No personally identifiable info is presented No specific aggregated information on disease types,
locations, or time is provided Scaled back list of attributes in demos
However, demos will give an indicative application of the technology.
paper #63144
About you
What percentage of the audience: Has a medical background?
Physician Epidemiology/research/academic
Has an IT background? DBA Developer
Knows a lot about Data Mining? Statistics?
Has at least two of the above? Three?
paper #63144
About me
Oracle Certified DBA and Developer ASQ Certified Quality Engineer Principal Architect, supporting the Naval Health
Research Center in San Diego, CA Instructor of Oracle, Data Warehouse, and Web
Services courses at UCSD-Extension Papers on Java, DataWarehousing – IOUG/ODTUG Biochemistry degree/ worked in a diagnostics firm Son of a clinical pathologist
paper #63144
Medical Lexicon Epidemiology
Study of the relationships of various factors determining the frequency and outbreak of disease.
Nosocomial Outbreaks originating within a hospital.
Nosology Study of the classification of diseases.
ICD9/10 International Classification of Diseases: v9 or 10. Classification of
disease by major category – represented by a three-digit code, followed by a specific type, represented by a two-digit code.
DNBI: Disease Non-Battle Injury. Military classification of disease types.
paper #63144
Nosology/ICD9 Disease Classification
Over 12,000 separate diseases Classified into 13 areas Further sub-classed Set off by 3 digit code, then additional 2 digit
descriptor for better granularity DNBI – military designations
paper #63144
Epidemiological/Medical Practice Questions
What factors affect the onset of disease within a population?
What is the likelihood that a patient will require follow-up treatment, hospitalization, or that the case will worsen?
Are there particular clusters of patients that are more likely to develop a certain disease?
How often is a case mis-diagnosed? Is a particular treatment likely to cure the ailment?
paper #63144
Summarizing the Concerns
Predictive concerns Classification of risks and subjects Attribute ranking concerns Multi-factor relevance Dealing with large numbers of attributes Clustering questions Unknown associations
paper #63144
Epidemiological techniques
Statistical packages Chi-square ANOVA / ANCOVA / MANOVA Multi-variate Analysis (Attribute Scoring):
Multiple Logistic Regression (binomial/dichotomous)Multiple Linear Regression (multiple/category)
Covariance 2x2 matrix
paper #63144
Risk factors/classification
Environmental: exposure, location, job risks, diet Genetic: Genetic markers present? Clinical: Blood/other diagnostics data Familial: Other family members? Who, what? History: Past illnesses? What? When? How often? Socio-economic: Job, married, education, age, gender Lifestyle: Exercise, smoker, alcohol Ethnic/National/Geographical
paper #63144
Patient Data Universe
Patient history
Diagnosticdata
Family History
Treatment Fac/Personnel
Geographicfactors
Lifestylefactors
Druginteractions
Genomic
Total Patient DescriptionPhysician'snote
Ethnic/race/national
A vast amount of data potentially to be collected andmined in the patient data universe !!!
paper #63144
Reporting techniques/hierarchies
Data
DataMining
On-LineAnalytical
Processing(OLAP)
Ad HocQueries
Operational ReportingWhat (specific events)happened yesterday?
What hiddenassociations or
clusters of attributesmay exist?
What is likely tohappen tomorrow
(based on past trends/aggregations)?
Why did that happenyesterday?
Use
r S
ophi
stic
atio
n
paper #63144
Reporting Examples
Query Technique
Reporting needs Example
Operational reporting
Basic information on an event Find the diagnosis of patient #A1234 on this date.
Ad-hoc User define queries to help understand an event
Does the specific patient have a past history of such a diagnosis?
OLAP Summarized data of events across many dimensions
What is the incidence rate of this disease among this patient type? For this area, season, hospital, etc? Is this becoming more prevalent?Data Mining Attribute associations,
predictive modeling, clustering of populations by attribute sets.Across many attributes and records
What are the risk factors for this disease? What is the likelihood a treatment will succeed for a patient? What specific populations are at risk?
paper #63144
Data Mining Techniques
Classification Seeks to find out attributes that best predict a dependent variable
Clustering Seeks groupings of attributes in populations
Association What is the likelihood that event A will lead to or occur with event B,
C, or D… Attribute Importance
Ranking of attributes based on their effects on a given dependent variable
Lift Model: Measures how well a model can identify a given target
paper #63144
Data Mining Terms
Confusion Matrix: Tests model accuracy. Actual to predicted evaluated, scored by
incidence of false-positives / false-negatives. False-negative:
disease present, results not shown False-positive:
disease not present, results show Supervised learning:
target value is specified. Classification / regression Unsupervised learning:
Relations/target attributes not known. Clusters/Assoc
paper #63144
Data Mining Terms (cont’d)
Support: The measure of how often the collection of items in an
association occur together as a percentage of all the transactions.
Confidence: Confidence of rule "B given A" is a measure of how much
more likely it is that B occurs when A has occurred. ROC:
Receiver Operating Characteristic. Used in Lift models to determine how well the model identifies targets as opposed to random selection.
paper #63144
Supervised/Unsupervised
Supervised Prediction odds of success Classification
ModelTest (obtain false-positives/negativesApplyLift
Attribute ImportanceDetermine attributes with the most effect on resultWant to split on this attribute
paper #63144
Supervised/Unsupervised
Unsupervised No a priori knowledge find hidden relations/ associations/ groupings Clustering
What groups of subjects share values of attributes that are closely related?
AssociationsFind events that are related; i.e., if A (and/or B)
happens, what are the odds that C will happen?
paper #63144
Classification Modeling
Used to find a predictive model of independent attributes on the outcome of a dependent attribute
Algorithms: Naïve Bayes, Adaptive Bayes NetWork
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Attributes
Branches
Numberof Levels
Pruned
paper #63144
Classification Model (cont’d)
Replaces: Multi-variate Analysis
Multiple Logistic Regression (binomial/dichotomous) Multiple Linear Regression (multiple/category)
Questions: Given a set of factors, what is the likelihood that a
disease will be expressed? What is the likelihood the disease will lead to a more
severe ailment? What category (multi-option) of health based on inputs?
paper #63144
Classification Model: To Do’s
1. Create a model: Classification Build
2. Refine: Run an Attribute Importance Model to help define best attributes to “split”
3. Test the model: Classification Test
4. Predict results: Classification Apply
5. Targeting: Classification Lift
paper #63144
Clustering
Unsupervised model that attempts to find groups within the population that share similar attributes
Algorithm: k-means, O-Cluster
AGE
INCOME
C2
AGEC1
INCOME
AGE
INCOME
C2
AGEC1
INCOME
C1
C2
AGE
INCOME
C1
C2
AGE
INCOMEAge
Age
Rank
Rank
C2
C1
Centroids Histograms Courtesy Charlie Berger, Oracle
paper #63144
Clustering (cont’d)
k-means only takes numeric values, and requires the number of clusters to be specified. Good for smaller datasets with fewer attributes.
O-Clusters: more robust than k-means Questions:
What groups of people are present in a population, and what are their common attributes?
How are the members distributed along those attributes? Are there given clusters of people related to a specific
disease family? Are members more or less susceptible?
paper #63144
Association Models
Unsupervised model that returns a set of rules determining if one or more attributes are associated with other attributes.
Scored by support/confidence What is the likelihood of A happening if B happens? Often used with sparsely populated data sets. Questions:
What is the relationship between overweight recruits, smoking, and attrition in boot camp?
paper #63144
Applications/Demos
Review of the parts of the process: JDeveloper9i layout, model wizards, creation, run ODM Browser: task review, navigation, results
Creation of models in JDeveloper9i with DM4J Wizards Clustering Model Build and analyze histograms Association Model Build: Analyze rules Classification Model: Build, Test, Apply, Lift Attribute Importance
paper #63144
Challenges
Most data sources have not been modeled to collect the range of data needed.
Bio-informatics opens a whole new range of study not even imagined a few years ago.
Data Stores are inconsistent. Doctors notes are not uniform. Legacy Apps are a mess. (COBOL, poorly
documented, personnel retired…)
paper #63144
More challenges
Vast amounts of data/ processing Confusion matrix on attributes with large
categories. Structuring questions “to peel away” masking
factors, and be sensitive to subtle associations Bringing it to the masses Overcoming resistance to change.
paper #63144
New Native 10g Features
Text Mining – to help us search through physicians’ notes
Support Vector Machines (SVM): “Neural Networks on Steroids.”
Non-negative Matrix Factorization (NMF): Algorithm to help “boil down” many attributes into a manageable set.
Enhanced Bio-informatics support in the DB. Transformation creation (currently alpha)
paper #63144
Summary
Covered a multi-disciplinary topic Attempted to show how DM is uniquely suited to
Epidemiological study Showed the ease by which models can be made Still, model creation requires trained personnel Many challenges remain to fully exploit this
technology.
paper #63144
Special Thanks to….
Mark Kelly, Oracle Data Mining Robert Haberstoh, Oracle Data Mining
Charlie Berger, Director Oracle Data Mining
paper #63144
Follow-up
Please fill out the on-line survey Session #63144
Feel free to contact me:Scott Rappoport, OCP
Principal Technical Staff MemberMTS [email protected]