finding nobel prize window by pagerank

Download finding nobel prize window by PageRank

If you can't read please download the document

Upload: yuji-fujita

Post on 16-Apr-2017

450 views

Category:

Documents


1 download

TRANSCRIPT

Finding Nobel prize window by PageRank

FUJITA Yuji, Turnstone Research Inst., Nihon Univ.

The Window

Cited number v.s. PageRank

Graph and Network

Graph theoryPart of mathmatics

Network scienceInter-disciplinary study ofGraph theory

Physics

Social science

Informatics

particular topics from finance, biology, ...

, ., , ,

Graph theory

Date back to 1730'sObjectivesLower dimensional topological structure

Combinatorial and topological studies

TopicsFour colour theorem

Invariants

From Wikipedia

, ., , ,

Network science

ObjectivesStatistics and dynamics

Social, Financial, Technological themes

Topics6 degrees of separation

Scale-free networks

PageRank

, ., , ,

Bibliometrics

Quantitativeevaluation of (academic) documents

Conventional approach: number of citation

Citation networkNode: paper Edge: citation

directed graph

More true metric: PageRank

Citation vs PageRank

Best cited do not have the best score

Top articles

Clinical MedicineEffects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patientsClinical MedicineVitamin E supplementation and cardiovascular events in high-risk patientsImmunologyCytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationImmunologyImmunologic self-tolerance maintained by CD25(+)CD4(+) regulatory T cells constitutively expressing cytotoxic T lymphocyte-associated antigen 4PhysicsString theory and noncommutative geometryPhysicsLarge-N limit of non-commutative gauge theoriesMolecular Biology & GeneticsSmac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibitionMolecular Biology & GeneticsIdentification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteinsMolecular Biology & GeneticsSystematic variation in gene expression patterns in human cancer cell linesMolecular Biology & GeneticsA gene expression database for the molecular pharmacology of cancer

The Protein Data BankEffects of an angiotensin-converting-enzyme inhibitor, ramipril, on cardiovascular events in high-risk patientsThe genome sequence of Drosophila melanogasterString theory and noncommutative geometryThe complete atomic structure of the large ribosomal subunit at 2.4 angstrom resolutionSmac, a mitochondrial protein that promotes cytochrome c-dependent caspase activation by eliminating IAP inhibitionIdentification of DIABLO, a mammalian protein that promotes apoptosis by binding to and antagonizing IAP proteinsThe SWISS-PROT protein sequence database and its supplement TrEMBL in 2000Class switch recombination and hypermutation require activation-induced cytidine deaminase (AID), a potential RNA editing enzymeCytotoxic T lymphocyte-associated antigen 4 plays an essential role in the function of CD25(+)CD4(+) regulatory cells that control intestinal inflammationnil

Graph expression

Embedding: drawing on sphere/space

Matrix

3. 0(), , ,. , .

PageRank overview

Link from a great node is more important degree as a score

But how can it be done? - the process can be lost in a loop..

Figure from The PageRank Citation Ranking: Bringing Order to the Web

, ? ?

Finite state Markov chain

Node: status, Transition matrix: moving along the edgeRow: linked (cited) vector

Column: link (cite) vector

Probability vector refreshed by multiplying the transition matrix

Steady state gives PageRank

Some Markov chain has a unique steady state

Steady state given by eigenvectorA vector such that Mx = ax

Eigenvector given by linear algebraWidely known how to compute

Why PageRank works?

Not all citations are equally significant

Less citation can be a signal of even more great workFundamental work not cited directly

Academic cascade

Meanings of citation

Brainchild

History

Respect

Identity

something more than tag

To reach the top

Many great childrenEach child give birth to many works

= great scientific achievement

Limitations

Prof. Yamanaka's work (CELL, 2006) has poor PageRank score, which is a shame to say at least.

SPAM issues; not so serious as naiive citation count

To practice

Get citation dataProduct or scrape

Transition matrixRandom surfer model

Iterate matrix-vector product operationSparse matrix operation

Data

Tomson-Reuter, Elsevier,

Scrape the web (arXive..)

Common SQL server will hold the data

NLP required

Transition matrix

Not all transition matrix has unique eigenvector

Random surfer model: let the graph be connected and get out of loop

+

=

Adaptation to papers

Old paper cannot cite newer oneNon-uniform random surfing

Adjust decay rate

Sparse matrix

Most of the elements are Zeros

Compressed form reduces space and time

libcsparsemade by UFL people and others, distributed under LGPL

Reference

L Page, S Brin, R Motwani, T WinogradThe PageRank citation ranking: bringing order to the web.

Dylan Walker1,2 , Huafeng Xie2,3 , Koon-Kiu Yan1,2 , Sergei Maslov2Ranking Scientific Publications Using a Simple Model of Network Traffic

P. Chen,1, H. Xie,2, 3, S. Maslov,3, and S. Redner1, Finding Scientific Gems with Google

Hajime BABAGoogle - PageRank

Acknowledgment

Mr. Kazuhisa Takei for ruby interface of libcsparse in ffi

Dr. Mari Jibu for citation data handling

Dr. Wataru Souma for network scientific suggestions and comments

Dr. Yoshi Fujiwara for choosing this topic and invitation

Free software developers

About me

2010- Turnstone Research, Inst.

2011- Nihon Univ. researcher

2009-2010 finance sector

2007-2009 Network analysis at NiCT

2001-2007 Venture firm CEO

1994-2002 Discrete math graduate student

Ski, climbing, bicycle, art