profiling science communities with semantic-powered web mining ( 基于语义 web...

53
Profiling Science Communities with Semantic-powered Web Mining ( 基基基基 Web 基基基基基基基基基 基基基基基 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu Sizhu, Zou Yiming National Science Library, Chinese Academy of Science, China Dec 5, 2010

Upload: tyrone-dennis

Post on 01-Jan-2016

241 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Profiling Science Communities with Semantic-powered Web Mining

( 基于语义 Web 挖掘技术实现科研团体的动态监测 )

Zhang Zhixiong, Liu Jianhua, Xie JingQian Li, Wu Sizhu, Zou Yiming

National Science Library, Chinese Academy of Science, ChinaDec 5, 2010

Page 2: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Outline

Background Information Ideas & Architecture System implemented Issues and Discussions

Page 3: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Outline

Background Information Ideas & Architecture System implemented Issues and Discussions

Page 4: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

1. Background Information

A Digital Age From Information Retrieval to

Knowledge Explore From Information Services to

Intelligence Services National Science Library, Chinese Academy of

Science About 82 persons, Intelligence services

detect scientific research activities identify the research trends monitor the progress of one research area

Page 5: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

1. Background Information Sci & Tech Monitoring based on Web Infor

mation plays a key role in our library Intelligence services Projects Concerns in this area

Science Monitoring and Evaluation based on Scientific Web Information (project of National Key Technology R&D Program in the 11th Five year Plan of China, 2007-2010)

Technologies and methodologies for topic burst detection from web resources ( National Social Science Fund 2009-2011)

Technologies study for automatically monitoring the research activities for the key research institutes (Chinese Academy of Science, 2009-2010 )

Automatically Monitoring System for Sci. & Tech. (From Chinese Academy of Science, 2010-2012)

Page 6: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Web Resources

Timely UpdateAbundant

Open

Good resources for monitoring the research activities

UnstructuredNon-semantic??

1. Background Information

Page 7: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Outline

Background Information Ideas & Architecture System implemented Issues and Discussions

Page 8: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

2. Ideas & Architecture

Au

tom

atic Extractio

n

Data MiningAnd

In-depth Analysis

Research Profiling

Page 9: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

2. Ideas & Architecture

4 Main Ideas Monitoring the changes of science communities by

continuously harvesting related information

Turning the free texts into time-stamped objects to support the calculation of the indicators

Building large scale knowledgebase based on time-stamped objects to achieve semantic mining of related topics

Profiling the status of science communities using visualization technologies

Page 10: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Dataset of 2009

(1)Monitoring the changes by continuously harvesting …

Dataset of 2008

Web site,News,RSS…

New Projects

New Research Activities

New Research Plan

New Achievements

New terms

New topics

New research area

……

2. Ideas & Architecture

Page 11: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

2. Ideas & Architecture

(2) Turning the free texts into time-stamped objects Information Extraction technologies

Turning the free texts into two types of simple structured objects with time-stamps

Object Type, Object, Time Stamp

Object A, Object B, Relationship, Time Stamp

Page 12: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

2. Ideas & Architecture

For example: July 13, 2010, White House Announces National

HIV/AIDS Strategy

We turn it into following time-stamped objects Object type, Object, Time stamp

Strategy, National HIV/AIDS Strategy, July 13, 2010

Object A, Object B, Relationship, Time Stamp White House, National HIV/AIDS Strategy, Announces, July 13, 2010

Page 13: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

LarKC: The Large Knowledge ColliderThe aim of the EU FP 7 Large-Scale Integrating Project LarKC is to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web.LarKC is sponsoring the following international scientific events:11th International Conference on Principles of Knowledge Representation and Reasoning (KR 2008)…… 关系标识

术语

系统平台

项目

学术会议

Object Extracted

Relation extracted between objects

2. Ideas & Architecture

Page 14: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

(3) Building large scale knowledgebase based on time-stamped objects to support the calculation of the indicators

MonitoringModel

A

MonitoringModel

B

Indicator sys A

Indicator 1Indicator 2Indicator 3Indicator 4Indicator …

Indicator sys B

Indicator 1Indicator 2Indicator 3Indicator 4Indicator …

AAAMonitoring

BBBMonitoring

2. Ideas & Architecture

Obj1Obj2

Obj3Obj4

Obj5Obj6

Obj7Obj…

Obj…

A institution dataset

Time1Time2

Time3Time4

Time5Time6

Time7Time…

Time…

Page 15: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

2. Ideas & Architecture (4)Profiling the status of science communities

using visualization technologies Visualization technology based on Webs Flex Architecture

Page 16: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Architecture

Page 17: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Data Mining&

Topic Clustering

InformationCrawl

Knowledge Base

Construction

Visual Analysis

InformationExtraction

Research ProfilingBased on Web Mining

In-DepthAnalysis

Annotate the research object

Nutch RSSNewsgroup

OAI

Research Ontology

&Research

Object

Conduct Annotation

Flex\Flare\GraphML

Key Technologies

Page 18: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Outline

Background Information Ideas & Architecture System implemented Issues and Discussions

Page 19: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

Several Monitoring Systems is implemented (implementing) in CAS Experiment systems on research area such as

artificial intelligence Systems on key institutes in energy, Aeronautics &

Space System for monitoring Science strategy and policy

(a cooperation with Science strategy and policy team in intelligent services in CAS)

Page 20: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented (1) understand what to monitor

Declaration formal statement & declaration on some key sciences issues

Strategy strategy (strategic plan) for science, technology and innovation

Project key initiative & research program

Budget science budget, science funding, R&D budget

Statistics statistics, science, technology and R&D statistics, GDE on R&D,

S&T Indicators

Page 21: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented (1) understand what to monitor

Policy science and technology policy, innovation management,

decision making, policy-making Adjustments

organizational adjustment, change, expansion, organizational restructuring,

Achievement Breakthrough, scientific achievement, research

achievements, outstanding research accomplishments Report

Periodic Report, Annual Report, Technical Report

Page 22: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented (2)select the target institutes to monitor

71 institutes are selected, such as: OSTP (Office of Science and Technology Policy) Research Councils UK (RCUK) The National Science Foundation (NSF) The International Energy Agency (IEA) SciDev.Net OECD Worldwatch Institute RAND Science Business Hudson Institute The Brookings Institution

Page 23: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(3) Identify valuable webpage Identify valuable information from crawled

webpages by sensitive vocabulary sensitive word such as:

Strategic plan, vision & strategy, policies, guidelines, annual Report, organization Chart ……

calculate the importance of the web page, mark the importance of web pages by the number of star

Page 24: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Mark the importance of the web page

Page 25: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(4) Identify the category Identify the category which the intelligence

belong to 9 intelligence category

Declaration, Strategy, Project, Budget, Statistics, Policy, Adjustments, Achievement, Report

Using automatic classification tools

Page 26: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Category Strategy

Page 27: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Category: projects

Page 28: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Category Budget

Page 29: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(4) Monitoring Rich Text Rich Text: PDF files, WORD files, PPT

files…… Report, Statistics, Declaration, Summaries High value of information

Identify Rich Text files after each crawl Cache the Rich Text files for future using

Page 30: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Rich Text files

Page 31: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Rich Text files

Page 32: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(5) Object Extraction Extract key terms and objects from the Web

pages Information Extraction Term extraction

Page 33: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Original Context

Page 34: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Terms and Objects Extracton

Key Terms

Key Objects

Page 35: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(6) Perform Topic navigation Clustering the web pages in a web site for

easy browsing and exploring Topic clustering based on extracted terms

Page 36: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Topic navigation: SciDev.Net

Page 37: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Topic navigation: SciDev.Net

Page 38: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(7) Identify important objects Identify important objects in a web site

Key project Key person Key foundation Key conference ……

Page 39: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key objects: SciDev.Net

Page 40: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key objects: SciDev.Net

Page 41: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key objects: SciDev.Net

Page 42: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key objects: Department of Energy

Page 43: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key objects: Department of Energy

Page 44: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(8) Identify important topics Identify important topics in a web site

topic based on terms frequency

Page 45: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key topics: SciDev.Net

Page 46: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key topics: Science Business

Page 47: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify key topics: DOE

Page 48: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

(8) Identify Hot topics Identify the hot topics in a periods

Page 49: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Identify Hot topics: SciDev.Net

Page 50: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

3. System implemented

Demo system http://124.16.154.12/OriMonitor/

Page 51: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Outline

Background Information Ideas & Architecture System implemented Issues and Discussions

Page 52: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

4 Issues and Discussions

It is relative easy to implement a experiment system.

It is very difficult to push the experiment system into practical usable system.

Lots of semantic technologies to be used Knowledge from expert Ontologies of key institutions

For same algorithm, Knowledgebase make difference

Page 53: Profiling Science Communities with Semantic-powered Web Mining ( 基于语义 Web 挖掘技术实现科研团体的动态监测 ) Zhang Zhixiong, Liu Jianhua, Xie Jing Qian Li, Wu

Thanks

Thank You for Your Attention! 谢谢 !