profiling science communities with semantic-powered web mining ( 基于语义 web...
TRANSCRIPT
Profiling Science Communities with Semantic-powered Web Mining
( 基于语义 Web 挖掘技术实现科研团体的动态监测 )
Zhang Zhixiong, Liu Jianhua, Xie JingQian Li, Wu Sizhu, Zou Yiming
National Science Library, Chinese Academy of Science, ChinaDec 5, 2010
Outline
Background Information Ideas & Architecture System implemented Issues and Discussions
Outline
Background Information Ideas & Architecture System implemented Issues and Discussions
1. Background Information
A Digital Age From Information Retrieval to
Knowledge Explore From Information Services to
Intelligence Services National Science Library, Chinese Academy of
Science About 82 persons, Intelligence services
detect scientific research activities identify the research trends monitor the progress of one research area
1. Background Information Sci & Tech Monitoring based on Web Infor
mation plays a key role in our library Intelligence services Projects Concerns in this area
Science Monitoring and Evaluation based on Scientific Web Information (project of National Key Technology R&D Program in the 11th Five year Plan of China, 2007-2010)
Technologies and methodologies for topic burst detection from web resources ( National Social Science Fund 2009-2011)
Technologies study for automatically monitoring the research activities for the key research institutes (Chinese Academy of Science, 2009-2010 )
Automatically Monitoring System for Sci. & Tech. (From Chinese Academy of Science, 2010-2012)
Web Resources
Timely UpdateAbundant
Open
Good resources for monitoring the research activities
UnstructuredNon-semantic??
1. Background Information
Outline
Background Information Ideas & Architecture System implemented Issues and Discussions
2. Ideas & Architecture
Au
tom
atic Extractio
n
Data MiningAnd
In-depth Analysis
Research Profiling
…
2. Ideas & Architecture
4 Main Ideas Monitoring the changes of science communities by
continuously harvesting related information
Turning the free texts into time-stamped objects to support the calculation of the indicators
Building large scale knowledgebase based on time-stamped objects to achieve semantic mining of related topics
Profiling the status of science communities using visualization technologies
Dataset of 2009
(1)Monitoring the changes by continuously harvesting …
Dataset of 2008
Web site,News,RSS…
New Projects
New Research Activities
New Research Plan
New Achievements
New terms
New topics
New research area
……
2. Ideas & Architecture
2. Ideas & Architecture
(2) Turning the free texts into time-stamped objects Information Extraction technologies
Turning the free texts into two types of simple structured objects with time-stamps
Object Type, Object, Time Stamp
Object A, Object B, Relationship, Time Stamp
2. Ideas & Architecture
For example: July 13, 2010, White House Announces National
HIV/AIDS Strategy
We turn it into following time-stamped objects Object type, Object, Time stamp
Strategy, National HIV/AIDS Strategy, July 13, 2010
Object A, Object B, Relationship, Time Stamp White House, National HIV/AIDS Strategy, Announces, July 13, 2010
LarKC: The Large Knowledge ColliderThe aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web.LarKC is sponsoring the following international scientific events:11th International Conference on Principles of Knowledge Representation and Reasoning (KR 2008)…… 关系标识
术语
系统平台
项目
学术会议
Object Extracted
Relation extracted between objects
2. Ideas & Architecture
(3) Building large scale knowledgebase based on time-stamped objects to support the calculation of the indicators
MonitoringModel
A
MonitoringModel
B
Indicator sys A
Indicator 1Indicator 2Indicator 3Indicator 4Indicator …
Indicator sys B
Indicator 1Indicator 2Indicator 3Indicator 4Indicator …
AAAMonitoring
BBBMonitoring
2. Ideas & Architecture
Obj1Obj2
Obj3Obj4
Obj5Obj6
Obj7Obj…
Obj…
A institution dataset
Time1Time2
Time3Time4
Time5Time6
Time7Time…
Time…
2. Ideas & Architecture (4)Profiling the status of science communities
using visualization technologies Visualization technology based on Webs Flex Architecture
Architecture
Data Mining&
Topic Clustering
InformationCrawl
Knowledge Base
Construction
Visual Analysis
InformationExtraction
Research ProfilingBased on Web Mining
In-DepthAnalysis
Annotate the research object
Nutch RSSNewsgroup
OAI
Research Ontology
&Research
Object
Conduct Annotation
Flex\Flare\GraphML
Key Technologies
Outline
Background Information Ideas & Architecture System implemented Issues and Discussions
3. System implemented
Several Monitoring Systems is implemented (implementing) in CAS Experiment systems on research area such as
artificial intelligence Systems on key institutes in energy, Aeronautics &
Space System for monitoring Science strategy and policy
(a cooperation with Science strategy and policy team in intelligent services in CAS)
3. System implemented (1) understand what to monitor
Declaration formal statement & declaration on some key sciences issues
Strategy strategy (strategic plan) for science, technology and innovation
Project key initiative & research program
Budget science budget, science funding, R&D budget
Statistics statistics, science, technology and R&D statistics, GDE on R&D,
S&T Indicators
3. System implemented (1) understand what to monitor
Policy science and technology policy, innovation management,
decision making, policy-making Adjustments
organizational adjustment, change, expansion, organizational restructuring,
Achievement Breakthrough, scientific achievement, research
achievements, outstanding research accomplishments Report
Periodic Report, Annual Report, Technical Report
3. System implemented (2)select the target institutes to monitor
71 institutes are selected, such as: OSTP (Office of Science and Technology Policy) Research Councils UK (RCUK) The National Science Foundation (NSF) The International Energy Agency (IEA) SciDev.Net OECD Worldwatch Institute RAND Science Business Hudson Institute The Brookings Institution
3. System implemented
(3) Identify valuable webpage Identify valuable information from crawled
webpages by sensitive vocabulary sensitive word such as:
Strategic plan, vision & strategy, policies, guidelines, annual Report, organization Chart ……
calculate the importance of the web page, mark the importance of web pages by the number of star
Mark the importance of the web page
3. System implemented
(4) Identify the category Identify the category which the intelligence
belong to 9 intelligence category
Declaration, Strategy, Project, Budget, Statistics, Policy, Adjustments, Achievement, Report
Using automatic classification tools
Category Strategy
Category: projects
Category Budget
3. System implemented
(4) Monitoring Rich Text Rich Text: PDF files, WORD files, PPT
files…… Report, Statistics, Declaration, Summaries High value of information
Identify Rich Text files after each crawl Cache the Rich Text files for future using
Rich Text files
Rich Text files
3. System implemented
(5) Object Extraction Extract key terms and objects from the Web
pages Information Extraction Term extraction
Original Context
Terms and Objects Extracton
Key Terms
Key Objects
3. System implemented
(6) Perform Topic navigation Clustering the web pages in a web site for
easy browsing and exploring Topic clustering based on extracted terms
Topic navigation: SciDev.Net
Topic navigation: SciDev.Net
3. System implemented
(7) Identify important objects Identify important objects in a web site
Key project Key person Key foundation Key conference ……
Identify key objects: SciDev.Net
Identify key objects: SciDev.Net
Identify key objects: SciDev.Net
Identify key objects: Department of Energy
Identify key objects: Department of Energy
3. System implemented
(8) Identify important topics Identify important topics in a web site
topic based on terms frequency
Identify key topics: SciDev.Net
Identify key topics: Science Business
Identify key topics: DOE
3. System implemented
(8) Identify Hot topics Identify the hot topics in a periods
Identify Hot topics: SciDev.Net
3. System implemented
Demo system http://124.16.154.12/OriMonitor/
Outline
Background Information Ideas & Architecture System implemented Issues and Discussions
4 Issues and Discussions
It is relative easy to implement a experiment system.
It is very difficult to push the experiment system into practical usable system.
Lots of semantic technologies to be used Knowledge from expert Ontologies of key institutions
For same algorithm, Knowledgebase make difference
Thanks
Thank You for Your Attention! 谢谢 !