2015 find you - fuzzy big data - a case study

59
+ Finding You – Fuzzy Big Data A Case Study Philip Topham [email protected] [email protected]

Upload: socialwellth

Post on 08-Aug-2015

141 views

Category:

Data & Analytics


0 download

TRANSCRIPT

+

Finding You – Fuzzy Big DataA Case Study

Philip [email protected]@topham.com

+

The Initial Question – Finding you?

The Approach

An Example

Issues

Digging In to Quality

Lessons Learned

A Case Study

+

BackgroundCase Study Background

+Who drives new pharma sales?

Who in the “crowd” is Influential?

+More questions…

What is Influence?

Who is Influential? Why?MDs vs PhDsClinical TrialsGrantsBig Office

KOLsKey Opinion

Leaders

+Finding Influence:Traditional Approach

Costly “Primary” Research

Cheap“Secondary”

ResearchDesk Research

Bibliometrics

Google

COUNTING

Interviews

Surveys

Friends

ESTIMATING

+Finding Influence:A Better Way!

Social Network Analysis

+Finding Influence:A Better Way??????????

Social Network Analysis

Really hard.

Scale requires automation

Same named people

Tons of Data

What’s important?

+

SNA PrimerA very brief overview on Social Network “Influence” Analysis

+Who is the Influencer?

ANALYZE

SCQAA

AYSO1

American Youth Soccer Organization1

ANALYZE

SOCIAL NETWORK

SOCIAL NETWORK

Influence are “Socially” prestigious leaders2

Social prestige a sociologists way to measure importance to a group.2

An Social Network Analysis Primer

+

An Examplemelanoma

+

* Date Range: 2005 – 2010

The Melanoma Community

Category Researchers

Connected 21,500 56%

Disconnected 17,000 44%

+The Most Influential

Category Researchers

Most Influential 316 <2%

Just Connected 16,700 44%

+Influence VariancesCentral versus Strategic

United States: 59%

+Influence is “NOT” quantityJapan 4th by Volume; 13th by Influence

Japan is on the community

edge

Australia: 4%Why in the center?

Highest Melanoma Incidence Rate

Genetics

Guidelines

Basic Science

Pathology/Diagnosis

Drug Therapy/Clinical Studies

Immunology

Influence by Interest

+

OverviewThe General Approach

+Who is in the crowd?

Data Relationships Disambiguation

1 2 3

• Medical research• Clinical trials• Patents• Conferences/Symposiums• Clinical practice

•Bob “Knows” Mary.•Bob “Works with” Mary

•Match “John Smith” correctly

+Who in the Crowd is Influential?

Social Network Algorithms Top Influentials

4 5

• Map Relationships • Rank Influence •Relative Social Prestige

6

+

IssuesChallenges and Solutions

+Beliefs - Assumptions

• Medical research• Clinical trials• Patents• Conferences/Symposiums• Clinical practice• Social Media (Youtube, Online

Forums, Twitter etc…)

More Is BetterHigher FidelitySmooths Errors

More Are BetterMore MeaningSmooths Errors

• Research Head vs Assistants• Professor vs Student• Speaker vs Audience• Trial Principal Investigator vs

Investigator• Primary Grant Investigator vs

Sub-investigator• Private practice doctor vs

employed

Data

Relationships

+Beliefs - Assumptions

• Unique IDs • Name / Institution relatively unique• Data windowing cuts problem

RepetitiveEasily AutomatedEasily Solved

Off the ShelfMany examplesEasy

• Open source software• Academic papers• Commercial vendors (especially

in Defense/Crime networks)

Name Matching

Visualizing

+Beliefs - Assumptions

• Academic Papers• Open source software and

algorithms

Many AvailableEasily Automated

Simple Concept • Many books and papers• “Sociology” concepts accepted

Algorithms

Influence

+Challenges

More IS NOT betterApples and OrangesCreates IssuesHides Errors

More IS NOT BETTERExponential data growthDistorts MeaningIncreases Errors

Beliefs FindingsData

Relationships

More Is BetterHigher FidelitySmooths Errors

Are BetterMore MeaningSmooths Errors

n(n-1)/2

People

Relations

10 45

100 4, 950

110 x

+Challenges

RepetitiveEasily AutomatedEasily Solved

Off the ShelfMany examplesEasy

Name Matching

Visualizing

Very Very DifficultNot Easy to FULLY automateExpensive

Exponential growthDistorts Understanding (complex)Increases Errors

Beliefs Findings

Really Hard!

+

The most difficult “sales”

problem

Challenges

Many AvailableEasily Automated

Simple Concept

Algorithms

Influence

Beliefs Findings

A Few AvailableDidn’t work at scale

Wide industry differences

+Solutions

Data + Relationships = Meaning

Carefully select data• Data Windowing By Subject, By Type

Semi-automated Data Matching Good Enough!

Relationships Define You• Provide context• Define your sphere of influence

+Solutions

Networks + Algorithms = Influence

Relationships -- More important than visuals.

Split Visuals and Number Crunching

Algorithms are key

Influence is Relative

+

Finding Your

Influence is:

Sphere of Influence

Within your Social Network

+

Digging InTo Data Quality

+What Is Quality? What is Good?

Objective: Find Influencers.

Faster. Better. Or Cheaper?

• Speed is nice.

• Cheap is nice.

Accelerate Market Adoption

Replace inconsistent manual approaches with systematic measurable process

+What is Measurable?Compare Traditional vs. New Examine Results

Compare • Their list to our results• Our results to their list

John SmithPaul Herzog Mary Patterson

Think. Reconcile. Rework.

+What is Measurable?

1Publications, Clinical Trials and Grants

Think. Reconcile. Rework.

Relationships1 Result

All MESS!!!

Separate Unrelatable!

Normalized Proven Different!

First Attempts

Next Iterations

Relationships2 Result

Focused Type Promising.2Publications

+What is a relationship?

Think. Reconcile. Rework.

Why co-authorships?

What’s different between co-grantees, co-investigators, co-anything?

• Demonstrate actual relationship• Information flow both ways• Not “just” informational flow • Long lasting and repetitive• True social collaboration vs artifact of

convenience

+Right person? Right relationship?

Think. Reconcile. Rework.

Publications are remarkable messy

• No Personal IDs• Sabaticals and changing

organizations• UCLA vs University of Los Angeles

+Use Academia’s Approach

Think. Reconcile. Rework.

Assume NO Match and Match

John P Jones J JonesJ P Jones J JonesJ Jones J JonesJohn Jones J Jones----------------- -----------4 People 1 Person(Lower Limit) (Upper Limit)

Really 2 People

John Paul Jones John Andrew Jones

This just didn’t work!!!

+Our rules engine

First Middle

Last Organization Keywords

John P Jones UCLA Cancer, Bone

J P Jones University of California at Los Angeles

Cancer, Bone

J Jones USC, Los Angels Cancer, Calcium

John Jones UC Riverside Cancer, Muscle

First + Last + OrganizationInitials + Last + Close OrgInitials + Last + Keyword(s) …. ….

Scored Value

+Our rules engine

First Middle

Last Organization Keywords

John P Jones UCLA Cancer, Bone

J P Jones University of California at Los Angeles

Cancer, Bone

J Jones USC, Los Angels Cancer, Calcium

John Jones UC Riverside Cancer, Muscle

Think. Reconcile. Rework.

This just didn’t work!!!

Too Many Rules!

+Hire out the Problem

Think. Reconcile. Rework.

This just didn’t work!!!

Vendors Solution: ???

+Create our own approach!

Match Relationships, not Names!

Insight! People…..

• Work Together!!!!• Research the same area!!!

How do you match these people ?

Article1 - P. Peters, J.Jones, S.Smith

Article2 - P. Peters, J.Jones, G.Gill

Article3 - J.Jones, G.Gill, M. Mason

+Start with the relationships

J.Jones

P. Peters

S.Smith

P. Peters

S.Smith G.Gill

J.Jones

M.Mason

G.Gill

+Start with the relationships

J.Jones

P. Peters

S.Smith

P. Peters

S.Smith G.Gill

J.Jones

M.Mason

G.Gill

+Start with the relationships

J.Jones

P. Peters

S.Smith G.Gill

J.Jones

M.Mason

G.Gill

+Start with the relationships

J.Jones

P. Peters

S.Smith G.Gill

J.Jones

M.Mason

G.Gill

+Start with the relationships

J.Jones

P. Peters

S.Smith G.Gill

J.Jones

M.Mason

G.Gill

+Start with the relationships

J.Jones

P. Peters

S.Smith G.Gill

M.Mason

+Start with the relationships

J.Jones

P. Peters

S.Smith G.Gill

M.Mason

Think. Reconcile. Rework.

Does this really work?

+Testing

Objective:

• Is it Accurate?

Interpretation:

• Can we make reliable decisions?• Is the customer happy? • Can the customer spot errors?• What is the error rate?

+Testing: Discovering Error Rate

1. Manually match (2000) names, twice! With different people.

2. Automatically match (2000) names.

3. Compare each way.

Results: Automatic approach under matched 2 names.

Impact: None.

+Testing: Is the client happy?

1. Use Client’s KOL List and compare to ours

2. Use Our KOL List and compare to theirs

These sound alike but are different.

• Why are we missing… ? Clients use Favorites (Bias)

• Why are you missing… ? Clients avoid Conflicts

Think. Reconcile. Rework.

+Testing: Why different?

Our results were proven to be “repeatable, measurable, and indicative of influence!”

Clients

• Different beliefs as to what is influence.• Used different methodologies.• Different market/cultural attitudes.

Think. Reconcile. Rework.

Why did different Client’s react so differently?

+Testing: Why different?Why did different Client’s react so differently?

The industry had no ruler!

Everyone had a different starting point!

+

Lessons Learned

+Lessons Learned

Understand what the data represents. Coauthors ≠to co-inventors ≠ co-workers Relationships one-way, two-way, information,

conservative, non-conservative

Understanding is not universal Just because you build an accurate ruler, doesn’t

mean people understand the ruler.

Data Attributes vs. Relationships Relationships are key

Think. Reconcile. Rework.

Be willing to…

+Lessons Learned

Good Enough is Not Good Enough.Understand Why its good enough.Are errors material?

Validate your models with experts.

Challenge the Experts.

Think. Reconcile. Rework.

Be willing to…

+Lessons Learned

Be willing to fail, and try again.

Consistent, repeatable approach is more important that automating everything.

Think. Reconcile. Rework.

Be willing to…

+Thank you