biblio metrics

Upload: dahrren-grace-jalon

Post on 18-Oct-2015

17 views

Category:

Documents


0 download

TRANSCRIPT

  • BIBLIOMETRICSTefko SaracevicRutgers Universityhttp://www.scils.rutgers.edu/~tefko

    Tefko Saracevic

  • What is? all studies which seek to quantify processes of written communication.Pritchard the quantitative treatment of the propertiesd of recorded discourse and behavior pertaining to it.FairthorneRecorded communication - literature->quantitative methods

    Tefko Saracevic

  • Alan Pritchard 1969Coined the term "bibliometrics""the application of mathematics and statistical methods to books and other media of communicationJournal of Documentation (1969) 25(4):348-349

    Tefko Saracevic

  • and other related metrics Also used to study broader than books, articles Scientometrics covering science in general, not just publicationsInfometricsall information objectsWebmetrics or cybermetricsweb connections, manifestationsusing bibliometric techniques to study the relationship or properties of different sites on the web

    Tefko Saracevic

  • ConceptsBasic (primitive) concepts:1. Subject2. Recorded communication -> document, information object3. Subject literatureBibliometrics related to: science of sciencesociology of science - numerical methods

    Tefko Saracevic

  • Literature studiesQualitativeoften in humanities, librarianshipQuantitativebibliometricsMixed

    Tefko Saracevic

  • Reasons for quantitative studies of literatureAnalysis of structure and dynamicssearch for regularities - predictions possibleUnderstanding of patternsorder out of documentary chaosverification of models, assumptionsRationale for policies & design

    Tefko Saracevic

  • Why quantitative studies?Qualitative methods often depend on assertions. authoritative statements, anecdotal evidenceScience searches for regularitiesSuccess of statistical methods in social sciencesNeed for justification & basis for decisionsSomething can be counted - irresistible

    Tefko Saracevic

  • Application in ...History of scienceSociology of scienceScience policy; resource allocationLibrary selection, weeding, policiesInformation organizationInformation managementutilization

    Tefko Saracevic

  • Historical noteBibliometrics long precedes information scienceBut found intellectual home in information sciencestudy of a basic phenomenon - literatureIt is not hot lately, but still produces very interesting resultsBranched out into web studies (web is a literature as well)

    Tefko Saracevic

  • What studied?Governed by data available in documents or information resources in general - that what can be countedauthor(s)originorganization, country, languagesourcejournal, publisher, patent

    Tefko Saracevic

  • what morecontentstext, parts of text, subject, classesrepresentationcitationsto a document, in a document, co-citationutilizationcirculation, various useslinksany other quantifiable attribute

    Tefko Saracevic

  • ToolsScience Citation IndexCompilation of variables from journals in a subjectUse dataPublication counts from indexes, or other data basesWeb structures, links

    Tefko Saracevic

  • Variable: authorsnumber in a subject, field, institution, countrygrowth correlation with indicators like GNP, energy etc.productivity e.g. Lotkas lawcollaboration - co-authorship, associated networksdynamics - productive life, transcience, epidemicspapers/author in a subjectmapping

    Tefko Saracevic

  • Variable: originRates of production, size, growth bycountry, institution, language, subjectComparison between theseCorrelation with economic & other indicators

    Tefko Saracevic

  • Variable: sourcesConcentration most often on journalsGrowth, dynamics, numbersinformation explosion - exponential lawstime movements, life cyclesScatter - quantity/yield distributionBradfords law Various distributions by subject, language, country

    Tefko Saracevic

  • Variable: contentsAnalysis of textsdistribution of words Zipfs lawwords, phrases in various partssubject analysis, classificationco-word analysis

    Tefko Saracevic

  • Variable: representationfrequency of use of index terms, classesdistribution laws - key terms where?thesaurus structure

    Tefko Saracevic

  • Variable: citationsStudied a lot; many pragmatic resultsbase for citation indexes, web of science, impact factors, co-citation studies etcDerived:number of references in articlesnumber of citations to articlesresearch front; citation classicsbibliographic coup[ling

    Tefko Saracevic

  • citations moreco-citationsauthor connections, subject structure, networks, mapscentralityof authors, papersvalidation with qualitative methodsimpact

    Tefko Saracevic

  • Variable: utilizationfrequencydistribution of requests for sources, titlese.g. 20/80 lawrelevance judgement distributionscirculation patternsuse patterns

    Tefko Saracevic

  • Variable: linksDevelopment of link-based metricsin-links, out-linksWeb structureWeb page depth; updatePageRank vs quality

    Tefko Saracevic

  • Examples from classic studiesComparative publications over centuriesNumber of journals founded over timeNumber of abstracts published over timeNational share of abstracts in chemistryNational scientific size vs. economy sizeBibliographic coupling and co-citationWeb structures, links

    Tefko Saracevic

  • Examples of laws & methodsLotkas lawBradfords lawZipfs lawImpact factorCitation structuresCo-citation structures

    Tefko Saracevic

  • Alfred J. Lotka 1926 Statisticsthe frequency distribution of scientific productivityPurpose: to "determine, if possible, the part which men of different calibre contribute to the progress of scienceLooked at Chemical Abstracts Index, then Geschichtstafeln der PhysikJ. Washington Acad. Sci. 16:317-325

    Tefko Saracevic

  • Lotkas law: xn y = CThe total number of authors y in a given subject, each producing x publications, is inversely proportional to some exponential function n of x.Where:x=number of publicationsy=no. of authors credited with x publicationsn=constant (equals 2 for scientific subjects)C=constantinverse square law of scientific productivity

    Tefko Saracevic

  • Lotka's Law - scientific publicationsxn y = CNo. of authors

    Tefko Saracevic

  • Samuel Clement Bradford 1934, 1948 Distribution of quantity vs yield of sources of information on specific subjectshe studied journals as sources, but applicable to otherwhat journals produce how many articles in a subject and how are they distributed? orHow are articles in a subject scattered across journals?Purpose: to develop a method for identification of the most productive journals in a subject & deal with what he called documentary chaos

    First published in: Engineering (1934) 137:85-86, then in his book Documentation, (1948)

    Tefko Saracevic

  • Bradfords law"If scientific journals are arranged in order of decreasing productivity of articles on a given subject, they may be divided into a nucleus of periodicals more particularly devoted to the subject and several groups or zones containing the same number of articles as the nucleus, when the numbers of periodicals in the nucleus and succeeding zones will be as a : n : n2 : n3 "

    Tefko Saracevic

  • Bradford's Law of Scattering an idealized exampleNo. of source journals12122410755No. of articles per source60353025986543Total no. of articles607030501832603520159271301301303

    Tefko Saracevic

  • Bradford's Law of Scattering zones3 sources 130 articles9 sources 130 articles27 sources 130 articlesGarfield hypothesisnucleus

    Tefko Saracevic

  • George Kingsley Zipf 1935, 1949The psycho-biology of language: an introduction to dynamic philology (1935)Human behavior and the principle of least effort: An introduction to human ecology (1949)Looked, among others, at frequency distributions of words in given textscounted distribution in James Joyces UlyssesProvided an explanation as to why the found distributions happen:Principle of least effort

    Tefko Saracevic

  • Zipfs law: r f = c

    Where:r =rank (in terms of frequency)f =frequency (no. of times the given word is used in the text)c =constant for the given textFor a given text the rank of a word multiplied by the frequency is a constantWorks well for high frequency words, not so well for low thus a number of modifications

    Tefko Saracevic

  • Charles F. Gosnell 1944 Obsolescence He studied obsolescence of books in academic libraries via their useCollege Res. Libr. (1994) 5:115-125But this was extended to study of articles via citations, and other sources Age of citations in articles in a subject:half life half of the citations are x year old etc different subjects have very different half-lives

    Tefko Saracevic

  • Curve of obsolescenceNumber of usersAge at time of use

    Tefko Saracevic

  • Eugene Garfield 1955 Focused on scientific & scholarly communication based on citationsScience (1995) 122:108-111Founded Institute for Scientific Information (ISI)major proeduct now ISI Web of KnowledgeImpact factor for journals, based on how much is a journal citedMapping of a literature in a subjectCitation indexes/web of knowledge MAJOR resources in bibliometric studies

    Tefko Saracevic

  • Citation matrixcitedarticlecitedarticlecitedarticlearticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticle

    Tefko Saracevic

  • Science Citation IndexcitedarticlecitedarticlecitedarticlearticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticlecitingarticleAssociation-of-ideas index

    Tefko Saracevic

  • Co-citation analysisArticles that cite the same article are likely to both be of interest to the reader of the cited articlearticlecitingarticlecitingarticleThese two articles are likely to be related

    Tefko Saracevic

  • Impact factor (IF)number of citations received in current year by papers published in the journal in the previous two years divided bynumber of papers published in the journal in the previous two years

    IF has become over time a crucial indicator of journal quality andgiven ISI a monopoly position in the evaluation of journal qualityReported in Journal Citation Reports (1976-)

    Tefko Saracevic

  • Garfields HistCiteBibiliographic Analysis and Visualization Software Provides citation statistics & graphs for people, journals, institutions various citations scores, no. of cited references in articles various graphs with connectionsExample: articles and authors for JASIST (and predecessor names) for 1956-2004includes citations to authors

    Tefko Saracevic

  • ConclusionBibliometrics, & related scientometrics, infometrics, webmetrics provide insight into a number of properties of information objectssome general, predictive laws formulatedstructures have been exposed, graphedmyriad data collected & analyzedA good area for research!

    Tefko Saracevic

  • Sources used in making this presentation among othersRuth Palmquist BibliometricsDonna Bair-Mundy Boolean, bibliometrics, and beyondShort set of bibliometric exercises by J. Downiehttp://people.lis.uiuc.edu/~jdownie/biblio/

    Tefko Saracevic

    Tefko Saracevic, Rutgers University Tefko Saracevic, Rutgers UniversityAlfred Lotka was a statistician with the Metropolitan Life Insurance Company He set out to "determine, if possible, the part which men of different calibre contribute to the progress of science." To do this he looked at listings in the Chemical Abstracts Index during the years 1907 to 1916 and Auerbach's Geschichtstafeln der Physik through the year 1900. Today we would not use the same terminology. We don't speak of "men of different calibre." However, we still look at author productivity. In fact, in the world of academe, programs are judged in part on the basis of the publication productivity of the faculty--a measure in which SCILS, Rutgers program ranks highly. You'll find Lotka's law referred to in many articles in information science literature.

    Tefko Saracevic, Rutgers UniversityLotka's law is basically this: The total number of authors y in a given subject, each producing x publications, is inversely proportional to some exponential function n of x.Lotka's law is also referred to as the inverse square law of scientific productivity. Tefko Saracevic, Rutgers University Basically what this means is that in a given field a very large percentage of authors produce only one paper, fewer authors produce two papers, and so forth. Only a small number of authors produce a substantial number of publications. This law is still used in research in bibliometrics. For example, Egghe and Rao in an article in the August 2002 issue of JASIST worked on applying Lotka's law in cases where there are multiple authors of a single journal article. They referred to their analysis as fractional frequency distributions. Tefko Saracevic, Rutgers UniversityA name you will frequently see in readings on bibliometrics is that of Samuel Bradford. Bradford was also looking for a quantitative means on which to base periodical selection. When he looked at abstracting and indexing journals, he found that there was an unequal distribution of articles on a given topic across journals in certain fields. Looking at data collected by E. Lancaster Jones from bibliographies of Applied Geophysics and Lubrication, Bradford sought to explain the dispersion of journal articles within specific fields. The statement he formulated regarding journal distribution has come to be known as Bradford's Law of Scattering. Later, Eugene Garfield was to discover that Bradford's Law was also true for science journals as a whole.

    Tefko Saracevic, Rutgers UniversityBradford found that "The references are scattered throughout all periodicals with a frequency approximately related inversely to the scope. Tefko Saracevic, Rutgers University Let's say we're looking at articles on a given subject that appeared in journals over a six month period. Each line represents one or more journals. The most productive journals are listed at the top of the list, and other journals are listed in order of decreasing productivity as you continue down the screen. If we look at the top listing, this is a journal that published 60 articles on the topic during the period. On the next line we see that there are two journals, each of which published 35 articles on the topic. If you add the number of articles that appeared in these first three journals, you find that they published a total of 130 articles. Now look at the next group of journals. The top journal of the next tier published 30 articles, the next two journals produced 25 articles each, and so forth. If you add the articles produced by journals in this grouping you see that there are 130. But those 130 articles are scattered across 9 journals. Now look at the third tier. In the third tier 130 articles are scattered across 27 journals. If you look at the number of journals in each tier, you see the relationship that Bradford delineated in his law. In the first tier you have 3 journals. In the second tier you have 9 (which is 32) journals; and in the third tier you have 27 (which is 33) journals. Tefko Saracevic, Rutgers University In terms of periodical selection for our libraries, if we wish to cover a particular area, there will be a core nucleus of journals that will be highly productive in terms of articles related to that area. Then there will be another tier that will be exponentially less productive. And a third tier that will be another order less productive than the second. The analytical tools of bibliometrics allow us to determine which journals are the most productive in a given area, thus allowing us to make informed decisions in our collection development. Tefko Saracevic, Rutgers UniversityIn 1935 George Zipf wrote a book entitled The psycho-biology of language: an introduction to dynamic philology. In the preface of this book he presented the kernel of what later would be referred to as Zipf's law. Zipf was looking at the frequency distribution of words. In later publications he would develop two laws with formulae to describe frequency distribution for both frequently-occurring and less-frequently-occurring words. The law was later applied using numerous examples, not only words in texts in his highly cited 1949 book Human behavior and the principle of least effort: An introduction to human ecology .

    Tefko Saracevic, Rutgers UniversityThis is Zipf's law for high-frequency words. Where r represents the rank of a word in terms of frequency, and f is the number of times the given word is used in the text, r times f for any word will equal a constant for the given text.the word, f is the frequency, and k is the constant (Potter 1988). Zipf illustrated his law with an analysis of James Joyce's Ulysses. "He showed that the tenth most frequent word occurred 2,653 times, the hundredth most frequent word occurred 265 times, the two hundredth word occurred 133 times, and so on. Zipf found, then that the rank of the word multiplied by the frequency of the word equals a constant that is approximately 26,500" (Potter 1988). Note: A Web-based program for counting and ranking the frequencies of the words in a text is available as the Web Frequency Indexer, created and maintained by Dr. Catherine N. Ball, Department of Linguistics, Georgetown University, at http://www.georgetown.edu/faculty/ballc/webtools/web_freqs.html

    Tefko Saracevic, Rutgers UniversityWhile others were using algorithms to determine which publications to collect, the findings of authors like Charles Gosnell help us to determine what to discard or put into less accessible areas such as remote storage sites. Gosnell sought "to discover lines of trend or curves of distribution by means of which this rate of obsolescence may be expressed in mathematical form"Gosnell used for his analysis lists of books for college libraries: Shaw List (1931), Mohrhardt List (1937), Shaw supplement (1940). Using statistical analysis, he developed a curve of obsolescence for books. "Books represent one of the higher forms of culture and the rate at which they are discarded and replaced may give some suggestion as to the rate of evolution of the general culture of which they form a part." Obsolescence was studied as to age of articles being cited, and the concept of half life of a literature in a subject was developed. Different subjects have different half lives, big difference between mathematics and molecular biology

    Tefko Saracevic, Rutgers University Basically, obsolescence holds that materials lose their usefulness or reliability over time. Although the curve of obsolescence generally looks like the one on your screen, the rate at which works become obsolete varies by discipline. This is the end of part 6. Close this file and open boolean_and_bibliometrics_part7.ppt. Tefko Saracevic, Rutgers UniversityOne of the major names in the field of bibliometrics is Eugene Garfield. Dr. Garfield's research has focused on scientific communication and information science but his work has broader implications for virtually every scholarly discipline. In the early 1950s Dr. Garfield was part of the Welch Medical Indexing Project at Johns Hopkins University. This project was eventually to give us MESH, the medical subject headings, and Index Medicus, now available online as Medline. Dr. Garfield was instrumental in the genesis of Current Contents, a current awareness publication for the sciences that is now published in seven discipline-specific editions. One of the contributions Dr. Garfield is known for in information science is the concept of the "impact factor" of an article or journal. This impact factor, as first posited in his 1955 article in Science, can be computed by examining the citations to the work. But probably the most important contribution of Eugene Garfield to the world of science and to scholarly research in general is that of the Science Citation Index.

    Tefko Saracevic, Rutgers University Now let's look at the notion of a citation matrix. If you look at any given scientific article, you will find that it cites previous articles--articles that the authors found relevant, articles that the current authors are building upon or refuting In turn, the current article may be cited in future articles. And these citing articles may themselves be cited by articles down the line. The assumption is that the articles cited, the article in question, and the citing articles deal with the same subject area, the same topic, even if they use different terminology to describe that topic. Dr. Garfield's idea was to use the relationships of the citation matrix to index scientific publications. Tefko Saracevic, Rutgers University As we clicked through the citing article links, we moved forward through the citation matrix, in this way finding later articles on the same topic as our initial article. Tefko Saracevic, Rutgers University We can also use the Web of Science to do co-citation analysis. This operates on the assumption that two articles that cite the same article are likely to be related.Co-citation coupling: If papers A and B are both cited by paper C, they may be said to be related to one another, even though they don't directly cite each other. If papers A and B are both cited by many other papers, they have a stronger relationship. The more papers they are cited by, the stronger their relationship is.Bibliographic Coupling "operates on a similar principle, but in a way it is the mirror image of co-citation coupling. Bibliographic coupling links two papers that cite the same articles, so that if papers A and B both cite paper C, they may be said to be related, even though they don't directly cite each other. The more papers they both cite, the stronger their relationship is