2015 find you - fuzzy big data - a case study
TRANSCRIPT
+
The Initial Question – Finding you?
The Approach
An Example
Issues
Digging In to Quality
Lessons Learned
A Case Study
+More questions…
What is Influence?
Who is Influential? Why?MDs vs PhDsClinical TrialsGrantsBig Office
KOLsKey Opinion
Leaders
+Finding Influence:Traditional Approach
Costly “Primary” Research
Cheap“Secondary”
ResearchDesk Research
Bibliometrics
COUNTING
Interviews
Surveys
Friends
ESTIMATING
+Finding Influence:A Better Way??????????
Social Network Analysis
Really hard.
Scale requires automation
Same named people
Tons of Data
What’s important?
+Who is the Influencer?
ANALYZE
SCQAA
AYSO1
American Youth Soccer Organization1
ANALYZE
SOCIAL NETWORK
SOCIAL NETWORK
Influence are “Socially” prestigious leaders2
Social prestige a sociologists way to measure importance to a group.2
An Social Network Analysis Primer
+
* Date Range: 2005 – 2010
The Melanoma Community
Category Researchers
Connected 21,500 56%
Disconnected 17,000 44%
Genetics
Guidelines
Basic Science
Pathology/Diagnosis
Drug Therapy/Clinical Studies
Immunology
Influence by Interest
+Who is in the crowd?
Data Relationships Disambiguation
1 2 3
• Medical research• Clinical trials• Patents• Conferences/Symposiums• Clinical practice
•Bob “Knows” Mary.•Bob “Works with” Mary
•Match “John Smith” correctly
+Who in the Crowd is Influential?
Social Network Algorithms Top Influentials
4 5
• Map Relationships • Rank Influence •Relative Social Prestige
6
+Beliefs - Assumptions
• Medical research• Clinical trials• Patents• Conferences/Symposiums• Clinical practice• Social Media (Youtube, Online
Forums, Twitter etc…)
More Is BetterHigher FidelitySmooths Errors
More Are BetterMore MeaningSmooths Errors
• Research Head vs Assistants• Professor vs Student• Speaker vs Audience• Trial Principal Investigator vs
Investigator• Primary Grant Investigator vs
Sub-investigator• Private practice doctor vs
employed
Data
Relationships
+Beliefs - Assumptions
• Unique IDs • Name / Institution relatively unique• Data windowing cuts problem
RepetitiveEasily AutomatedEasily Solved
Off the ShelfMany examplesEasy
• Open source software• Academic papers• Commercial vendors (especially
in Defense/Crime networks)
Name Matching
Visualizing
+Beliefs - Assumptions
• Academic Papers• Open source software and
algorithms
Many AvailableEasily Automated
Simple Concept • Many books and papers• “Sociology” concepts accepted
Algorithms
Influence
+Challenges
More IS NOT betterApples and OrangesCreates IssuesHides Errors
More IS NOT BETTERExponential data growthDistorts MeaningIncreases Errors
Beliefs FindingsData
Relationships
More Is BetterHigher FidelitySmooths Errors
Are BetterMore MeaningSmooths Errors
n(n-1)/2
People
Relations
10 45
100 4, 950
110 x
+Challenges
RepetitiveEasily AutomatedEasily Solved
Off the ShelfMany examplesEasy
Name Matching
Visualizing
Very Very DifficultNot Easy to FULLY automateExpensive
Exponential growthDistorts Understanding (complex)Increases Errors
Beliefs Findings
Really Hard!
+
The most difficult “sales”
problem
Challenges
Many AvailableEasily Automated
Simple Concept
Algorithms
Influence
Beliefs Findings
A Few AvailableDidn’t work at scale
Wide industry differences
+Solutions
Data + Relationships = Meaning
Carefully select data• Data Windowing By Subject, By Type
Semi-automated Data Matching Good Enough!
Relationships Define You• Provide context• Define your sphere of influence
+Solutions
Networks + Algorithms = Influence
Relationships -- More important than visuals.
Split Visuals and Number Crunching
Algorithms are key
Influence is Relative
+What Is Quality? What is Good?
Objective: Find Influencers.
Faster. Better. Or Cheaper?
• Speed is nice.
• Cheap is nice.
Accelerate Market Adoption
Replace inconsistent manual approaches with systematic measurable process
+What is Measurable?Compare Traditional vs. New Examine Results
Compare • Their list to our results• Our results to their list
John SmithPaul Herzog Mary Patterson
Think. Reconcile. Rework.
+What is Measurable?
1Publications, Clinical Trials and Grants
Think. Reconcile. Rework.
Relationships1 Result
All MESS!!!
Separate Unrelatable!
Normalized Proven Different!
First Attempts
Next Iterations
Relationships2 Result
Focused Type Promising.2Publications
+What is a relationship?
Think. Reconcile. Rework.
Why co-authorships?
What’s different between co-grantees, co-investigators, co-anything?
• Demonstrate actual relationship• Information flow both ways• Not “just” informational flow • Long lasting and repetitive• True social collaboration vs artifact of
convenience
+Right person? Right relationship?
Think. Reconcile. Rework.
Publications are remarkable messy
• No Personal IDs• Sabaticals and changing
organizations• UCLA vs University of Los Angeles
+Use Academia’s Approach
Think. Reconcile. Rework.
Assume NO Match and Match
John P Jones J JonesJ P Jones J JonesJ Jones J JonesJohn Jones J Jones----------------- -----------4 People 1 Person(Lower Limit) (Upper Limit)
Really 2 People
John Paul Jones John Andrew Jones
This just didn’t work!!!
+Our rules engine
First Middle
Last Organization Keywords
John P Jones UCLA Cancer, Bone
J P Jones University of California at Los Angeles
Cancer, Bone
J Jones USC, Los Angels Cancer, Calcium
John Jones UC Riverside Cancer, Muscle
First + Last + OrganizationInitials + Last + Close OrgInitials + Last + Keyword(s) …. ….
Scored Value
+Our rules engine
First Middle
Last Organization Keywords
John P Jones UCLA Cancer, Bone
J P Jones University of California at Los Angeles
Cancer, Bone
J Jones USC, Los Angels Cancer, Calcium
John Jones UC Riverside Cancer, Muscle
Think. Reconcile. Rework.
This just didn’t work!!!
Too Many Rules!
+Create our own approach!
Match Relationships, not Names!
Insight! People…..
• Work Together!!!!• Research the same area!!!
How do you match these people ?
Article1 - P. Peters, J.Jones, S.Smith
Article2 - P. Peters, J.Jones, G.Gill
Article3 - J.Jones, G.Gill, M. Mason
+Start with the relationships
J.Jones
P. Peters
S.Smith
P. Peters
S.Smith G.Gill
J.Jones
M.Mason
G.Gill
+Start with the relationships
J.Jones
P. Peters
S.Smith
P. Peters
S.Smith G.Gill
J.Jones
M.Mason
G.Gill
+Start with the relationships
J.Jones
P. Peters
S.Smith G.Gill
M.Mason
Think. Reconcile. Rework.
Does this really work?
+Testing
Objective:
• Is it Accurate?
Interpretation:
• Can we make reliable decisions?• Is the customer happy? • Can the customer spot errors?• What is the error rate?
+Testing: Discovering Error Rate
1. Manually match (2000) names, twice! With different people.
2. Automatically match (2000) names.
3. Compare each way.
Results: Automatic approach under matched 2 names.
Impact: None.
+Testing: Is the client happy?
1. Use Client’s KOL List and compare to ours
2. Use Our KOL List and compare to theirs
These sound alike but are different.
• Why are we missing… ? Clients use Favorites (Bias)
• Why are you missing… ? Clients avoid Conflicts
Think. Reconcile. Rework.
+Testing: Why different?
Our results were proven to be “repeatable, measurable, and indicative of influence!”
Clients
• Different beliefs as to what is influence.• Used different methodologies.• Different market/cultural attitudes.
Think. Reconcile. Rework.
Why did different Client’s react so differently?
+Testing: Why different?Why did different Client’s react so differently?
The industry had no ruler!
Everyone had a different starting point!
+Lessons Learned
Understand what the data represents. Coauthors ≠to co-inventors ≠ co-workers Relationships one-way, two-way, information,
conservative, non-conservative
Understanding is not universal Just because you build an accurate ruler, doesn’t
mean people understand the ruler.
Data Attributes vs. Relationships Relationships are key
Think. Reconcile. Rework.
Be willing to…
+Lessons Learned
Good Enough is Not Good Enough.Understand Why its good enough.Are errors material?
Validate your models with experts.
Challenge the Experts.
Think. Reconcile. Rework.
Be willing to…
+Lessons Learned
Be willing to fail, and try again.
Consistent, repeatable approach is more important that automating everything.
Think. Reconcile. Rework.
Be willing to…