query enhancement for patent prior-art-search based on keyterm dependency relations and semantic...
TRANSCRIPT
석사학위논문
Master’s Thesis
Query Enhancement for Patent Prior Art Search
with Keyterm Dependency Relations
and Semantic Tags
Khanh Ly Nguyen
Department of Computer Science
KAIST
2011
Query Enhancement for Patent Prior Art Search
with Keyterm Dependency Relations
and Semantic Tags
Query Enhancement for Patent Prior Art Search with
Keyterm Dependency Relations and Semantic Tags
Advisor: Professor Sung-Hyon Myaeng
By
Khanh Ly Nguyen
Department of Computer Science
KAIST
A thesis submitted to the faculty of KAIST in partial fulfillment of the
requirements for the degree of MMaasstteerr ooff SScciieennccee iinn EEnnggiinneeeerriinngg in the De-
partment of Computer Science. The study was conducted in accordance with
Code of Research Ethics1
23rd November , 2011
Approved by
Professor Sung-Hyon Myaeng
1Declaration of Ethical Conduct in Research: I, as a graduate student of KAIST, hereby declare that I have notcommitted any acts that may damage the credibility of my research. These include, but are not limited to: fals i-fication, thesis written by someone else, distortion of research findings or plagiarism. I affirm that my thesiscontains honest conclusions based on my own careful research under the guidance of my thesis advisor.
Query Enhancement for Patent Prior Art Search with
Keyterm Dependency Relations and Semantic Tags
Khanh Ly Nguyen
The present dissertation has been approved by the dissertation committee
as a master’s thesis at KAIST
November 23rd 2011
Committee head
Committee member
Committee member
Professor Sung-Hyon Myaeng
Professor Alice Oh
Professor Ho-jin Choi
i
ICE
20074298
Khanh Ly Nguyen. Query Enhancement for Patent Prior Art Search with Keyterm Depen-dency Relations and Semantic Tags. Department of Information and Communication Engi-neering. Advisor Prof. Sung-Hyon Myaeng.
ABSTRACT
The increasing number of applications and granted patents constantly leads to critical demands for patent
search. Prior art search is one of the most common patent searches and its goal is to find patent documents that
constitute prior art to a given patent. Current patent searches are mostly keyword-based systems and due to com-
plex structures and length of patent documents, they do not perform very well. In this research, we propose a
new query formulation method for patent prior art search by identifying the most discriminate terms using key-
word dependency relations. Instead of using only a separate field, our intention is to select the most significant
field or combination of fields to find the best one for query formulation. Furthermore, we concentrated on appro-
priating number of key terms that should be included in the query by performing experiments with different
query size. Specifically, our work is different from all previously reported ones in a way that instead of using
only keyterm extraction based on dependency relations, our idea is to combine the keyterm extraction with se-
mantic tags; which are identified from patent documents to find prior art patents with similar IPC codes. And for
prior art search evaluation, we applied the re-ranking method based on the IPC classification codes which were
assigned to the patent document since this method can aid in the identification of prior art patents without extra
cost of expert judgments and incompleteness of citations.
In this work, 36 experiments were conducted, and the results show that the proposed method achieves
significant improvement over the baseline. The results indicate that: 1) For query formulation from a separate
field, e.g. query formulated by top 10 terms from Abstract, 18% improvement of Sub-class, 17% improvement of
Main-group, and 13% improvement of Sub-group compared to those of the baseline method can be obtained; 2)
For query formulation from combined fields, e.g. query formulated by top 10 terms from Abstract and top 10
terms from Claims, we can achieve 16% improvement of Sub-class, 16% improvement of Main-group, and 13%
improvement of Sub-group compared to those of the baseline method; 3) For query formulation combined with
semantic tags, e.g. for Abstract, 46% improvement of Sub-class, 42% improvement of Main-group, and 45%
improvement of Sub-group compared to those of the baseline method can be achieved. Experiment results also
show that extracting terms from Description gave the best performance over all other fields (e.g. Abstract,
Claims field). The reason for this is the Description field contains specification about what a process or method
of the invention is and how it differs from previous patents and technology. By identifying IFPS terms from De-
scription, we can achieve better performance if IFPS is used as a query itself or and the best is to use in combina-
ii
tion with query selection by KDR since IFPS includes information related to the areas a patent belong to which
can be very helpful to identify the IPC sub-classes of a patent document (IF) and it includes Problems/Solutions
(PS) which related to limitations of previous patents and effects of present invention that may help to identify
IPC main-groups or sub-groups of the query patent. We also show the effectiveness of IFPS terms when IFPS is
combined with KDR terms or tf*idf. When IFPS is added we gain much more improvement that shows a good
strategy for query expansion.
Our experiments show that terms about details of method or process of the invention are more significant
for query formulation from Abstract or Claims; while terms about limitations or effects are more significant for
query formulation from Description.
Keywords: patent retrieval, prior art retrieval, keyterm dependency relations, semantic tags, term cooccurences
iii
Contents
List of Tables .................................................................................... 1
List of Figures................................................................................... 3
List of Abbreviations ........................................................................ 4
Chapter 1. Introduction ..................................................................... 51.1 Motivation .............................................................................................5
1.2 Contribution...........................................................................................6
1.3 Thesis Organization ...............................................................................7
Chapter 2. Background and Related Works ....................................... 82.1 IPC Taxonomy.......................................................................................8
2.2 Patent document.....................................................................................9
2.3 Patent Analysis and Processing ............................................................10
2.4 Prior art Search ....................................................................................12
Chapter 3. Methodology ................................................................. 143.1 System Description ..............................................................................14
3.2 Pre-processing & Stop-word Removal .................................................15
3.3 IFPS Extraction....................................................................................15
3.3.1 Extraction of Invention Fields: ................................................................ 17
3.3.2 Extraction of Problems and Solutions .....................................................18
3.4 Term Extraction based on Keyterm Dependency Relations...................20
3.5 Query Formulation...............................................................................21
3.6 Patent Indexing & Retrieval .................................................................23
3.7 Re-ranking based on IPC .....................................................................23
Chapter 4. Experiments & Results .................................................. 254.1. Data Collection and Preparation..........................................................25
4.2 Evaluation Metrics ...............................................................................26
4.3 Experimental Results ...........................................................................26
4.3.1 Data Statistics ..........................................................................................26
4.3.2 Baseline ...................................................................................................27
4.3.3 Experimental Results...............................................................................28
iv
4.5 Discussion ...........................................................................................40
4.6 Conclusions & Future works ................................................................43
References ...................................................................................... 45
Acknowledgement .......................................................................... 52
Curriculum Vitae ............................................................................ 53
Publication...................................................................................... 54
- 1 -
List of Tables
Table 1. IPC classifications ..................................................................................................8
Table 2. IPC sections............................................................................................................8
Table 3. Details of Experimental Query Sets ......................................................................22
Table 4. Statistics of the relevant IPC codes .......................................................................25
Table 5. Statistics of the data extracted by KDR method.....................................................27
Table 6. Statistics of Semantic tags: Invention Fields (IF), Problems/Solutions (PS) ...........27
Table 7. Results of queries extracted from Abstract field ....................................................29
Table 8. Results of queries extracted from Claims field ......................................................29
Table 9. Results of queries extracted from Description field ...............................................30
Table 10. MAP values of queries from different fields ........................................................31
Table 11. Results of queries formulated from field combinations of Abstract and Claims ...32
Table 12. Results of queries formulated from field combinations of Abstract and Description
.......................................................................................................................... 33
Table 13. Results of queries formulated from field combinations of Claims and Description33
Table 14. Results of queries formulated from field combinations of Abstract, Claims and
Description ........................................................................................................33
Table 15. Comparision of KDR queries when Titles are added ...........................................34
Table 16. Results of IFPS queries compared with tf-idf queries. .........................................35
Table 17. Results of KDR queries when adding IFPS compared with tf-idf queries for
Abstract .............................................................................................................35
Table 18. Results of KDR queries when adding IFPS compared with tf-idf queries for Claims
.......................................................................................................................... 36
Table 19. Results of KDR queries when adding IFPS compared with tf-idf queries for field
combination of Abstract and Claims...................................................................36
Table 20. Results of KDR queries formulated by top 10 terms from Abstract expanded with
IFPS compared with tf*idf queries formulated by top 10 terms from Abstract and
top 58 terms from Description............................................................................37
Table 21. Results of KDR queries formulated by top 20 terms from Claims expanded with
IFPS compared with tf*idf queries formulated by top 10 terms from Claims and
top 58 terms from Description............................................................................38
- 2 -
Table 22. Results of KDR queries formulated by combination of top 10 terms from Abstract
plus top 20 terms from Claims expanded with IFPS compared with that of tf*idf
queries expanded with top 58 terms from Description. .......................................38
Table 23. Results of tf*idf queries formulated by the top 10 terms from Abstract when IFPS
is added..............................................................................................................39
Table 24. Experiments results of tf*idf queries formulated by top 10 terms from Abstract plus
IFPS compared with top 10 terms from Abstract plus top 58 terms from
Description ........................................................................................................39
Table 25. Results of tf*idf queries formulated by combination of top 10 terms from Abstract,
top 20 terms from Claims when IFPS is added. ..................................................40
Table 26. Example of top 10 terms extracted by KDR and tf*idf for Abstract field. ............41
Table 27. Example of top 10 terms extracted by KDR and tf*idf for Claims field. ..............41
Table 28. Example of top 30 terms extracted by KDR and tf*idf for Descripion field. ........41
Table 29. Example of top 40 ~ 60 terms extracted by KDR and tf*idf for Description field.42
- 3 -
List of Figures
Figure 1. Example of a section hierarchy in IPC...................................................................9
Figure 2. Example of a patent document............................................................................. 11
Figure 3. System Architecture ............................................................................................ 15
Figure 4. Example of relations between semantic tags and the IPC of H01M......................17
Figure 5. Example of Invention Field under applicant defined tag. .....................................18
Figure 6. Example of Invention Field with no applicant defined tag. ..................................18
Figure 7. Sample extracted IFPS ........................................................................................19
Figure 8. Problem Sample Patterns.....................................................................................19
Figure 9. Solution Sample Patterns.....................................................................................19
Figure 10. Example of a KDR graph ..................................................................................21
Figure 11. Steps of re-ranking based on IPC codes and Example ........................................24
- 4 -
List of Abbreviations
IPC
USPTO
International Patent Classification
United States Patent and Trademark Office
NLP
KDR
IFPS
IF
Natural Language Processing
Keyterm Dependency Relation
Invention Fields, Problems and Solutions
Invention-Field
PS Problem/Solution
NP Noun Phrase
VP Verb Phrase
SC Sub-class
MG Main-group
SG Sub-group
- 5 -
Chapter 1. Introduction
1.1 Motivation
Patents are legal documents granted by patenting authorities to protect the inventor’s rights. Patents can
show technological details and relations, reveal business trends, inspire novel industrial solutions, or help make
investment policy that are valuable to the industry, business, law, etc. Companies and inventors who wish to file
a new patent are interested in verifying that the invention is actually new, with reference to the current state-of-
the-art. At the same time, they are interested in discovering infringements for their granted patents. Researchers
are interested in finding patent information to avoid duplicating solutions already covered by patents and/or to
freely reuse expired patents. Managers can exploit patent information to assess competitors, partners and sup-
pliers, and to identify technology trends and new business opportunities. Finally, venture capitalists and inves-
tors can leverage patent-related information to select the targets of their financial operations while third party
resellers can benefit from patent information when selecting their suppliers.
Nowadays, the number of applications and granted patents has been increasing constantly worldwide,
creating a greater demand for patent analysis and search. Patent analysis aims at obtaining relevant patents and
to analyze them as an aggregate to produce patent maps [22] or discover trends [3] [23] [24]. Patent search is
often conducted by inventors, patent attorneys, technical and business experts to find the prior art and mitigate
risks. There are many ways of patent search such as: prior art or novelty search, validity search, infringement
search, clearance search, etc. Prior art search is one of the most common search and its goal is to find patent
documents that constitute prior art to a given patent [17]. Prior art search is performed before filing an applica-
tion to ascertain patentability of an invention to determine novelty of the invention, to invalidate a patent’s
claim of originality. During the application process, patent experts will examine a patent with all of the patents
which have an earlier priority date, called prior art patents, for ensuring that the claims on a target and prior art
patent are not overlapped.
Previously, patent examination process is usually performed manually, which requires considerable ef-
fort and expertise in information retrieval, domain-specific technologies, and business intelligence. In addition,
the increasing amount of patent information and the growing need to access patent information require the de-
velopment of automatic search tools and new methodologies to shorten search times for patent awarding and
can also increase the quality of the patents granted. Current patent search systems are mostly keyword-based
and due to the complex structures of patent documents, they do not perform very well. The success of automatic
prior art search relies upon the selection of relevant search queries, however queries are built by extracting
- 6 -
terms from some textual patent documents fields using TF/IDF [20][21] and give preference to terms in Title
[19] or by taking all words in mostly Claims without filtering [18]. Queries may contain many ambiguous and
vague terms which can affect the retrieval results. Also, it is difficult to know which terms are good for formu-
lating a query. The retrieval of patent documents may be related and relevant to the query but do not contain the
exact key word or phrase. Similarly, many patents returned as a result of query do contain keywords but have
no relevance to the intent of the searcher. Also, the query size is difficult to set. Few query terms make query
processing is fast but information might be misrepresented. Otherwise, many query terms make processing time
prohibitive and the query can contain many noisy terms. Therefore, a good formulation of query is a key factor
to achieve good effectiveness, and in this work, a query enhancement for patent prior art search with keyterm
dependency relations and semantic tags is proposed.
1.2 Contribution
Previous works on prior-art search focused on methods of formulating queries by identifying keywords
from patent documents based on some weighting schemes and using the citation patents to add additional key-
words for a high probability of retrieving relevant results. Most of the works stress on the complexity of patent
structure and used Claims field extracted from a topic patent, which is considered as the most informative part
of a patent, as a search query. Therefore, we propose a method for better query formulation to improve prior art
search in the patent domain based on keyword dependency relations in combination with semantic tags (IFPS).
Instead of using only the Claims field as reported in [4] [5] [18], our idea is to use key words from different
fields of a patent document and combinations of those fields to explore which one is the best for query formula-
tion. Furthermore, we concentrated on deciding the number of key words that should be included in the query
by doing experiments with different query size. To improve the query formulation, we suggest the algorithm to
select the most representative terms based on by dependency relation of terms in the same sentence. More spe-
cifically, instead of using only the term ranking algorithm based on dependency relations, we use this method in
combination with semantic tags extracted from patent documents, which have not been done before to get better
results. And for prior art search evaluation, we applied the re-ranking method based on the IPC classification
codes assigned to the patent document, since this method can aid in the identification of prior art patents with-
out extra cost of expert judgments and incompleteness of citations.
- 7 -
1.3 Thesis Organization
The organization of this thesis is as follows: Chapter 2 describes the background and related works, in
which we will introduce about the IPC taxonomy, and characteristic of a patent document. We also discuss
about related works to the patent analysis and patent search. Chapter 3 gives details of our methodology for
query term extraction from patent documents using keyterm dependency relations and semantic tags. Chapter 4
reports evaluation results for our method using the corpus provided by the NTCIR-6 and the test set we crawled
from the USPTO database. In Chapter 4, we will discuss about the results and compares our approach with the
baseline tf*idf. Lastly, we will conclude with a short summary and mentions about the future work.
- 8 -
Chapter 2. Background and Related Works
2.1 IPC Taxonomy
The International Patent Classification (IPC) is a standard taxonomy developed by the World Intellec-
tual Property Organization (WIPO) for classifying patents and patent applications. The IPC covers all areas of
technology, chemistry, mechanics, and electronics which are classified into sections, classes, subclasses and
groups, therefore a specific topic can be identified easily and accurately. The IPC contains eight sections, about
120 classes, about 630 subclasses, 6,923 groups and approximately 60,700 subgroups as shown in Table 1. Each
section is designated by capital letter from A to H as shown in Table 2.
Table 1. IPC classifications
Table 2. IPC sections
Each section is subdivided into 11 classes, whose symbols consist of the section symbol followed by a
two-digit number. The classification symbol is made up of a letter denoting the IPC section, followed by a
number (two digits) denoting the IPC class (e.g., H01). Optionally, the classification can be followed by a se-
A Human Necessities;
B Performing Operations, Transporting
C Chemistry, Metallurgy
D Textiles, Paper
E Fixed Constructions
F Mechanical Engineering, Lighting, Heating, Weapons
G Physics
H Electricity
- 9 -
quence of a letter (e.g., H01M) denoting the IPC subclass, a number (variable, 1 to 3 digits, e.g., H01M 11) de-
noting the IPC main group, a forward slash (“/”) and a number (variable, 1 to 3 digits, e.g., H01M 11/00) denot-
ing the IPC subgroup. Table 2 shows a section and its classes/subclasses in IPC. And, an example of a section
hierarchy in IPC is shown in Figure. 1.
Figure 1. Example of a section hierarchy in IPC
2.2 Patent document
A patent document contains many items for analysis including structured items which are uniform in
semantics and format (e.g. patent number, application number, patent class, filed date, issue date, etc.) and un-
H SECTION H_ ELECTRICITY
H01 BASIC ELECTRIC ELEMENTS
H01B CABLES; CONDUCTORS; INSULATORS; SELECTION OF
MATERIALS FOR THEIR CONDUCTIVE, INSULATING, OR
DIELECTRIC PROPERTIES (selection for magnetic properties H01F 1/00;
waveguides H01P; installation of cables or lines, or of combined optical and
electric, cables or lines H02G)
H01C RESISTORS
…
H01M PROCESSES OR MEANS, e.g. BATTERIES, FOR THE DIRECT
CONVERSION OF CHEMICAL ENERGY INTO ELECTRICAL ENERGY
(electrochemical processes or apparatus in general C25; semiconductor or
other solid state devices for converting light or heat into electrical energy
H01L, e.g. H01L 31/00, H01L 35/00, H01L 37/00)
H01M 2/00 Constructional details, or processes of manufacture, of the non-active parts
H01M 2/02 . Cases, jackets, or wrappings (working of plastics or substances in a plastic
state
H01M 2/04 .. Lids or covers
…
- 10 -
structured items which are free text of different length (e.g. Title, Abstract, Claims, and Description, etc.). For
patent search, unstructured items are important text fields which dominate the influence on query formulation
but they are known to be difficult to process with traditional text processing and text retrieval techniques be-
cause of technical terminology, vague terms and complex structure. This complicates the examination of a pa-
tent document and particularly influences the patent retrieval process, because it is necessary for a precise query
to narrow the search and find relevant documents. Titles provide the least reliable clues for determining the re-
levancy of a patent because they contain relatively short key words and phrases. Abstracts are more informative
and provide summaries of claimed inventions. Claims include the most central content of a patent and disclose
the novelty of an invention. By reading the claims, we can determine the scope of the patent; however, the
claims may be directed to only one embodiment, method, etc. Typically, claims are written in patent’s specific
styles consisting of one long sentence, starting with “We claim:” or “What is claimed is:” followed by item lists
initialized by numbers. Claims consist of multiple components (e.g. part of a machine or substances of a chemi-
cal compound) and terminologies used in patent claims are highly dependent on the specific topic domain of the
patent (e.g. secondary battery). There are two types of claims which are independent claims and dependent
claims. Independent claims broadly describe the invention and do not have association with any other claims;
while dependent claims depend on a single claim or several claims to give some further limitation of a specific
compound or condition. Descriptions may be the longest text in a patent which elaborate the same content with
Claims in details and further segmented into Field of the invention, Background/Prior Art describing problems
that the invention solves and information related to the technical background, Summary often a restatement of
the Claims showing how the problem is solved; and Detailed description is a full description of the invention
with definitions, specific examples and drawings. Some patents may not have all these segments. Figure 2
shows an example of a patent document.
2.3 Patent Analysis and Processing
In recent years, patent analysis and processing have long been considered useful in product innovation,
patent maps [22] or trend discovery [3] [23] [24]. Patent documents contain important technical knowledge and
research results; however they are lengthy and contain such a lot of terminology that requires much of human
efforts for analysis. To obtain useful information, experts have to scan or read indexed patent documents from
long lists of noisy results, which is a rather time-consuming task and requires a careful manual selection. With
the rapid increase of the number of patent documents, there is a need to find a way to obtain useful and precise
patent information quickly. Thus, automatic tools in patent analysis and processing for assisting innovators or
patent applicants are in great demand.
- 11 -
Figure 2. Example of a patent document
A patent document contains structured and unstructured text. There have been approaches for patent
analyses based on structured text for years [28] [29]. For unstructured text, text mining techniques have been
applied to derive information to assist patent analysis and processing tasks. In [30], a number of text mining
techniques; including text segmentation, summary extraction, feature selection, term association, cluster genera-
tion, topic identification, and topic mapping, have been developed. Sentences were extracted by simply splitting
a period and question mark. Each sentence was then weighted by the number of keywords, title words, and clue
words it contains and position of the paragraph containing the sentence and the position of the sentence in the
containing paragraph. Natural language processing techniques have been also applied for analysis of patent
claims [33], for similarity analysis [34] [35] and for improving readability of patent [36] [37]. In [32], a NLP
methodology was proposed for analyzing patent claims that combines symbolic grammar formalisms with data
Patent No. 7,897,284
Publication Date March 1, 2011
Title Lithium secondary battery
Abstract A lithium secondary battery is provided with a positive
electrode, a negative electrode (1), a separator interposed between the
positive and negative electrodes…
Claims What is claimed is:
1. A lithium secondary battery comprising: a negative electrode
comprising a negative electrode current collector and…
2…
Description
- Field of the invention The present invention relates to lithium secondary batteries, and more
particularly…
- Description of Related
Art
Various mobile communication devices and mobile electronic devices
such as laptop computers have emerged in recent years, and this has
lead to…
- Summary of the
invention
Accordingly, it is an object of the present invention to provide a
lithium secondary battery that is capable of minimizing…
- Description of the
drawings
FIG. 1 is a cross-sectional view illustrating a portion of the negative
electrode of one example of the lithium secondary battery…
- Detail Description of
the invention
The lithium secondary battery according to the present invention is
provided with…
- 12 -
intensive methods while enhancing analysis robustness. [31] focused on discovering significant-rare words from
Claims in a patent database. [33] presented a system called COA (Claim Originality Analysis) to assess a patent
by evaluating the originality of the invention described in it. [32] proposed an approach to find problem solved
concepts from Detailed Description of a patent document by assigning more weight to the sentences appearing
at the beginning and ending of the text.
2.4 Prior art Search
Since patents play an important role in Intellectual Property protection, recently there has been a grow-
ing interest in research into patent retrieval. Patent retrieval started from the NTCIR-3 [1] with released patent
test collections to enable researchers to systematically evaluate their methodologies. In the NTCIR-4 [2], a
search task related to the prior-art search, also called invalidity search run, was presented. The goal of the prior
art search was to identify previously published patents in the collection which have the closest prior art to a giv-
en patent. Also, it is relevant for the purpose of a technical survey to evaluate the novelty or to invalidate pa-
tent’s claim of originality. Prior-art search is an essential step in the examination process of patent applications;
however, it is time-consuming and laborious. Therefore, it is important to identify discriminative terms from
patent documents to formulate queries that enhance the success of automatic prior art search.
Previously, most of the researches focus on Claims field by applying different term weighting methods
for query generation because Claims are thought to be the most informative part of a patent. To enhance the
initial query, query expansion techniques were performed by extracting effective and concrete terms from De-
scription field. In [4], Claims are first broken into components and then each component is separately used to
extract query terms. Query expansion is performed by using these terms to extract related query terms from De-
tailed Description field of the patent document. A similar work was introduced in [14], where query terms were
components extracted from the topic claim and expanded by extracting query terms from explanation sentences
related to the components in Detailed Description. [26] studies the rhetorical structure of a claim. They applied
an associative document retrieval method, in which a document is used as a query to search for other similar
documents. To produce an initial query, each Claims is segmented into multiple components and then used to
search for candidate documents on a component by component basis. [27] uses two retrieval stages which con-
sist of query term extraction from Claims. In the first stage, the query from Claims was used to retrieve the top
1,000 patents and then several techniques were used to re-rank the top 1,000 patents in the second stage. Evalu-
ation results show that the effectiveness of the method varies depending on the test sets used. However, [18]
does not distil any terms from the Claims but took all the words as one long query and no query expansion was
- 13 -
done. In [18] [20] [21], queries are built by extracting terms from one of the text fields such as Title, Abstract,
Claims, Description. [46] shows that words from the title field are the least useful for prior-art search, and
TF/IDF and terms in Title are given preference [19].
In NTCIR-4, expert judgments were used as the relevance data for patent evaluation, however only 34
query topics were developed because of the cost. Also, in NTCIR-4 the IPC codes were integrated with a prob-
abilistic retrieval model for estimating the document prior. In NTCIR-5 and NTCIR-6, citations were used and
thousands of query topics were developed automatically. However, evaluation based on citations has some limi-
tations such as citations have different degrees of relevancy; citation language may differ from the patent appli-
cation’s own publication language; and the citation lists are incomplete [47]. Therefore, the IPC codes have
been used as a feature for document filtering and patent retrieval. In [26] the authors use IPC codes for docu-
ment filtering and show how this feature can help in patent retrieval.
- 14 -
Chapter 3. Methodology
This chapter describes our methodology for query formulation for patent prior art search.
3.1 System Description
Figure 3 shows the overall system architecture of our patent retrieval system. The system is composed of
query formulation based on semantic tags (IFPS), query formulation based on keyterm dependency relations,
patent indexing, patent retrieval, re-ranking and evaluation the results.
In query formulation based on semantic tags (left part, as shown in Figure 3), only Description fields
from patent document are extracted as input text. There are two steps in the IFPS extraction: extracting Inven-
tion Field (IF) and extracting Problems-Solutions (PS). Details of IF extraction will be discussed in Section 3.3.
For Problem-Solution extraction, Description fields are parsed with the Open NLP POS tagger [49]. Then we
apply pattern matching method to extract Problems and Solutions. After that, we combine IF and PS and re-
move all redundant and stop-words to formulate queries.
In query formulation based on keyterm dependency relations, terms from each patent field will be ex-
tracted. Each patent field will be used as input text and will be pre-processed. All redundant and stop-word will
be removed. The text will be segmented into sentences by stop punctuation. Each sentence will be represented
as a graph, in which in each term is a node in the graph. Weight of node will be calculated and ranked in des-
cending order. Then, top N terms from each separate field will be selected to formulate queries. Queries are also
formulated by merging queries from different fields.
Then, queries will be sent to the patent indexing to retrieve similar documents with relevance scores.
The retrieved documents will be re-ranked based on the IPCs. Finally, we will evaluate the results.
- 15 -
Figure 3. System Architecture
3.2 Pre-processing & Stop-word Removal
Given the input text, we segment the text into sentences by stop punctuation. Unimportant terms are de-
leted from the input text field. We used the Rijsbergen’s stopword list which consists of 570 words. We also
used the stopword list that we manually collected from patent documents consisting of 150 words that occur
frequently in patents but are meaningless to the content of a patent (e.g. figure, relates, said, apparatus, method,
device, etc.) The total number of stopwords we used in this research is 720 words.
3.3 IFPS Extraction
A patent can be captured by a few elements such as “What problem does the invention solve?”, “What is
the invention?”, and “What does the invention do?” [48]. The problem that an invention is going to solve is
- 15 -
Figure 3. System Architecture
3.2 Pre-processing & Stop-word Removal
Given the input text, we segment the text into sentences by stop punctuation. Unimportant terms are de-
leted from the input text field. We used the Rijsbergen’s stopword list which consists of 570 words. We also
used the stopword list that we manually collected from patent documents consisting of 150 words that occur
frequently in patents but are meaningless to the content of a patent (e.g. figure, relates, said, apparatus, method,
device, etc.) The total number of stopwords we used in this research is 720 words.
3.3 IFPS Extraction
A patent can be captured by a few elements such as “What problem does the invention solve?”, “What is
the invention?”, and “What does the invention do?” [48]. The problem that an invention is going to solve is
- 15 -
Figure 3. System Architecture
3.2 Pre-processing & Stop-word Removal
Given the input text, we segment the text into sentences by stop punctuation. Unimportant terms are de-
leted from the input text field. We used the Rijsbergen’s stopword list which consists of 570 words. We also
used the stopword list that we manually collected from patent documents consisting of 150 words that occur
frequently in patents but are meaningless to the content of a patent (e.g. figure, relates, said, apparatus, method,
device, etc.) The total number of stopwords we used in this research is 720 words.
3.3 IFPS Extraction
A patent can be captured by a few elements such as “What problem does the invention solve?”, “What is
the invention?”, and “What does the invention do?” [48]. The problem that an invention is going to solve is
- 16 -
called Problem (P) and what the invention is and what it does to solve the problems is called Solution (S). For
example, “long-cycle-life lithium secondary cells” is the problem and “utilizing a lithium ionic reaction” is the
solution.
Problems and Solutions can be shared within a number of patents in the same domain. Intuitively, Prob-
lems and Solutions are important for describing the gist of a patent without processing lengthy queries. In addi-
tion, the Invention-Field (IF) of a patent can help for describing the area of technology (domain) which a patent
belong to (e.g. secondary battery). As in [45], patents would belong to the same domain if they are in the same
semantic tags which are defined by patent applicants (e.g. Means of solving the problems, Effects of the inven-
tion, Application field, etc.). In US patents, we examine whether semantic tags such as Invention-Field, Prob-
lems or Solutions have relations with the IPC codes which can aid in identification of related prior art patents.
Therefore by extracting Invention-Fields, Problems and Solutions, we can reduce the size of an input patent
query that can help in searching for the prior art efficiently. In the domain of “secondary battery”, for example,
we can retrieve about 1000 patents from the USPTO database indexing for each patent query. However, it would
be very difficult and time-consuming to process words one by one to identify which patents is most related to
the topic patent.
Figure 4 shows an example of IFPS phrases extracted from patents in Batteries domain and how IFPS
phrases assist in identifying the IPCs that a patent belong to. As shown in the figure, IF phrases such as “rechar-
geable batteries”, “alkaline storage batteries”, or “high power nickel metal hydride batteries” contain the word
“batteries” which is the same as the name of IPC Sub-class (Batteries). Also for problem phrases such as “posi-
tive electrode”, “positive electrode material” or “composite positive electrode material” all contain the word
“electrode” which is the same as the name of IPC Main-group (Electrodes). Similarly, solution phrases such as
“nickel based multi metals oxide”, “nickel hydroxide material”, “composite nickel electrode hydroxide particu-
late” all contain “nickel” which is the same as the name of IPC Sub-group.
The task of IFPS phrase extraction is to extract Invention Fields and Problem/Solution phrases from patent
document, which consists of the following three steps:
Step 1: Invention Fields from each patent document are extracted
Step 2: By parsing patent documents using Open NLP POS tagger, we can apply pattern matching and ex-
tracted a key terms as Noun Phrases or Verb Phrases.
Step 3: After generating two candidate lists, we merged all the key phrases in the lists and remove all stop
words and redundant words. As a result, we have a set of IFPS phrases.
The details for Step 1 & 2 are described as follows:
- 17 -
Figure 4. Example of relations between semantic tags and the IPC of H01M
3.3.1 Extraction of Invention Fields:
Invention Fields are extracted from Description fields, which is generally the first sub-field of De-
scription of a patent document. Though all patent documents have a similar kind of structure as described in 2.2,
titles of fields are fixed but the names of detailed elements is normally labeled by applicants with no standard
format. Therefore, automatically identifying Invention Fields part of patent documents is also a challenge since
a number of patents have separate Invention Fields but they use inconsistent phrases such as “Field of the Inven-
tion” or “Technical Field”. Other patents instead of separation, they include Invention Fields in variations of
“Background of the Invention”, “Prior art”, “Description of the Related Art”, etc. Meanwhile, a few patents do
not have Invention Fields (about 10%). Therefore, to extract Invention Fields, we extract the subfields that con-
tain the variations of “Field of the Invention”. As shown in Figure 5, Description field contains a separate Inven-
tion Field under applicant-defined-tag “Field of the Invention” and Invention Field we extract is in italic font.
For the case that do not have separate “Field of the Invention”, we extracted sentences that contain “relates to”
which is mostly used for describing Invention Fields, in the variations of “Background of the Invention”. As
shown in Figure 6, Description field contains a non-separated Invention Field that is the first two sentences con-
tain “relates to” under the tag “Background of the Invention”. The reason we do not use the clue “relates to” for
- 18 -
all extractions because it will bring too many sentences from the other fields (e.g. Embodiments, Detailed De-
scription, etc.) that may not be relevant to Invention Fields.
Figure 5. Example of Invention Field under applicant defined tag.
Figure 6. Example of Invention Field with no applicant defined tag.
3.3.2 Extraction of Problems and Solutions
Problems and Solutions are also extracted from the Description fields since Problems are often found in
“Background of the Invention” while Solutions are mostly found in the followed Summary parts. We use Open
NLP POStagger to tag the input descriptions. We manually analyzed patent documents for generating a list of
clues which are generally used by a large number of patents. We utilize these linguistic clues for creating 24
patterns so that Problems and Solutions can easily be extracted through a pattern matching process.
After extracting IFPS, we remove all redundant and stop-words using the stop word list (Section 3.2.) to
formulate queries.
Figure 7 shows examples of sentences that contain Invention Fields, Problems and Solutions in italic.
- 18 -
all extractions because it will bring too many sentences from the other fields (e.g. Embodiments, Detailed De-
scription, etc.) that may not be relevant to Invention Fields.
Figure 5. Example of Invention Field under applicant defined tag.
Figure 6. Example of Invention Field with no applicant defined tag.
3.3.2 Extraction of Problems and Solutions
Problems and Solutions are also extracted from the Description fields since Problems are often found in
“Background of the Invention” while Solutions are mostly found in the followed Summary parts. We use Open
NLP POStagger to tag the input descriptions. We manually analyzed patent documents for generating a list of
clues which are generally used by a large number of patents. We utilize these linguistic clues for creating 24
patterns so that Problems and Solutions can easily be extracted through a pattern matching process.
After extracting IFPS, we remove all redundant and stop-words using the stop word list (Section 3.2.) to
formulate queries.
Figure 7 shows examples of sentences that contain Invention Fields, Problems and Solutions in italic.
- 18 -
all extractions because it will bring too many sentences from the other fields (e.g. Embodiments, Detailed De-
scription, etc.) that may not be relevant to Invention Fields.
Figure 5. Example of Invention Field under applicant defined tag.
Figure 6. Example of Invention Field with no applicant defined tag.
3.3.2 Extraction of Problems and Solutions
Problems and Solutions are also extracted from the Description fields since Problems are often found in
“Background of the Invention” while Solutions are mostly found in the followed Summary parts. We use Open
NLP POStagger to tag the input descriptions. We manually analyzed patent documents for generating a list of
clues which are generally used by a large number of patents. We utilize these linguistic clues for creating 24
patterns so that Problems and Solutions can easily be extracted through a pattern matching process.
After extracting IFPS, we remove all redundant and stop-words using the stop word list (Section 3.2.) to
formulate queries.
Figure 7 shows examples of sentences that contain Invention Fields, Problems and Solutions in italic.
- 19 -
Figure 7. Sample extracted IFPS
Figure 8 and Figure 9 show examples of Problem and Solution patterns, respectively. The rationale be-
hind developing patterns based on the clues is as follows. Since Problems or Solutions can be Noun Phrases
(NP) or Verb Phrases (VP), we observe some patterns to indicate the PS phrases. For example, the pattern “me-
thod/NN for/IN” usually followed by a noun phrase or a noun phrase can precede the patter “can/MD be/VB
provided/VBN”. The patterns were extracted by analyzing the data, and generalized by unifying with common
syntactic labels. For example, “can/MD be/VB provided/VBN” and “can/MD be/VB obtained/VBN” will be
unified as “can/MD be/VB provided/VBN|obtain/VBN”.
Figure 8. Problem Sample Patterns
Figure 9. Solution Sample Patterns
Problem Sample Pattern Input text{NP} + can/MD be/VBprovided/VBN|improved/VBN|obtained/VBN
Thus, a nickel-metal hydride storage battery of high capacitycan be provided.
{NP} + is/VBZ improved/VBN in/IN + {NP} …alkaline storage battery is improved in charging efficiencyapparatus/NN|methods/NNS for/IN + {NP|V-ing + NP}
Apparatus for integrated-circuit battery devices
provided/VBN + {NP} / There is provided an alkaline storage battery …
Solution Sample Pattern Input textutilizing/VBG|employing/VBG|using|VBG +{NP}
lithium secondary battery employing the nonaqueouselectrolyte.
to/TO +{VBG+NP} fuel cell within an external/JJ circuit/NN
- 20 -
3.4 Term Extraction based on Keyterm Dependency Relations
In the traditional process of term extraction, researchers represent a story into a bag of words (BOW) and
use some criterions to score and sort these words. In that way, words are assumed to be independent; however,
these words; which can have strong dependency relations for describing the event, are ignored. Hence these me-
thods often bring noise, which leads to reduced precision and recall. Recent studies have demonstrated the im-
portance of dependency relations between words for topic tracking [40], text classification [38] [39], query ex-
pansion [41] or passage retrieval [42].
Our approach is based on the method for building Keyword Dependency Profile [40] which utilizes
keyword dependency relations (KDR) for topic tracking. The intuition is that a word may have strong depen-
dency relations with other words, which is important for describing information. Keyword Dependency Rela-
tions is evaluated by their co-occurrences in the same sentences. The weight of a keyword is high if it strongly
depends on the importance of the other keywords, in which a word initial weight is calculated by the tf-idf value.
For example, there are two sentences:
Sentence 1: “Thus, a nickel-metal hydride storage battery of high capacity can be provided.”
Sentence 2: “Nickel based alloy layer for perpendicular recording media.”
In the first sentence, “nickel” and “battery” co-occur in the same sentence, so it may probably related to
the Battery domain. In the second sentence, “nickel” co-occurs with other words but not with “battery”, so it is
not related to the Battery domain.
Figure 10 is an example of words come from the sentence “Thus, a hydride storage battery of high ca-
pacity can be provided”. After removing all stop-words and punctuations, we have a list of keywords K = “hy-
dride, storage, battery, capacity, provided”. The graph of words will be created as shown in Figure 10. The
number on a word is the initial importance weight which is calculated by tf*idf, and the numbers (e.g. 1, 2, 3)
besides edges are the frequency of two words that co-occur in the same sentences. After weighting by KDR,
weights of terms change as in Figure 10. Words that have more edges and more important node connected will
have higher weight, for example “hydride” has higher weight since it connects to important nodes such as “ca-
pacity”, “storage”, “battery”.
- 21 -
Figure 10. Example of a KDR graph
An input text will be segmented into sentences. After removing all redundant and stop-words, each sen-
tence will be represented as a graph, in which each word is a node n in the graph, and each edge e is the connec-
tion between two nodes. Weight of each node is calculated by the following formula:
( ) = ( ) × ( ) × , + 1,, in which w(nk) is the weight of node k,
m is the number of nodes that co-occur with node nk in the same sentence,
l is the neighbor node that co-occur with node nk in the same sentence,
tf(nl) is the term frequency of node nl,
idf(nl) is the inverse document frequency of node nl,
ek,l is the edge that connect node nk and node nl, and tf(ek,l) is the frequency of edge ek,l in the input text.
3.5 Query Formulation
Query formulation for prior-art search is to select the most informative terms from a query patent docu-
ment to form an effective query which can distinguish relevant patents from non-relevant patents in the patent
collections. Our experiments focused on selecting the most significant field or combinations of those fields to
explore which one is the best for query formulation. Instead of selecting terms only from a separate field, we
choose a particular number of terms from each field to have a better formulation of query. We concentrated on
deciding appropriate number of terms that should be included in the query by doing experiments with different
query size. We do not use Titles as a separate field because they contain relatively short key words and phrases,
but we want to see values of Titles when combined with other fields. To get better results, instead of using terms
extracted based on keyterm dependency relations, we want to combine with IFPS extracted from patent docu-
ments, which have not been done before.
- 22 -
Table 3. Details of Experimental Query Sets
After extracting terms from a field by applying a weighting algorithm (e.g. KDR or tf*idf), query formu-
lation is performed by taking the top N number of terms with highest weight and formulated the N terms as one
query. There are four types of queries that we used in the retrieval process including queries from a separate
field, queries from merged fields, queries merged with Titles, and queries merged with IFPS. Table 3 shows the
details of the query sets.
For separate fields, we choose query size as N = 10 for Abstract since the minimum number of terms in
Abstract is 11; N = 10, 20 for Claims since minimum number of terms is 23; and N = 10, 20, 30, 40, 60 for De-
scription since the minimum number of terms in Description is 61.
No. Query Set Query Description
Separate Field 1 Abs Top 10 words from Abstract2 Cla Top N words from Claims (N = 10 ~ 20)3 Des Top N words from Description (N = 10~ 60)
Merged Field
4-5 10Abs + 10/20Cla Top 10 words from Abstract + Top 10/20 words from Claims6 10Abs +60Des Top 10 words from Abstract + Top 60 words from Description7-8 10/20Cla + 60Des Top 10 words from Abstract + Top 10/20 words from Claims + Top 60 wordsfrom Description9 10Abs + 10/20Cla + 60Des Top 10/20 words from Claims + Top 60 words from Description
Merged with T
itles
10 Tit +10Abs Tit + Top 10 words from Abstract11 Tit + 20Cla Tit + Top 20 words from Claims12 Tit + 60 Des Tit + Top 60 words from Description13 Tit + 10 Abs + 20Cla Tit + Top 10 words from Abstract + Top 20 words from Claims14 Tit + 10 Abs + 20Cla + 60 Des Tit + Top 10 words from Abstract + Top 20 words from Claims + Top 60 wordsfrom Description
Merged with IF
PS 15 IFPS IFPS phrases16 IFPS + 10Abs IFPS + Top 10 words from Abstract (by KDR)17 IFPS + 20 Cla IFPS + Top 10 words from Claims (by KDR)18 IFPS + 10Abs + 20Cla IFPS + Top 10 words from Abstract (by KDR) + Top 10 words from Claims (byKDR)
- 23 -
For combination with fields, we have 7 set of queries formulated by: top 10 terms from Abstract merged
top 10/20 terms from Claims; top 10 terms from Abstract merged with top 60 terms from Description; top 10/20
terms from Claims merged with top 60 terms from Description; top 10 terms from Abstract merged with Top
1020 terms from Claims merged with top 60 terms from Description.
For combinations with Titles, we only choose the most appropriate number of terms which has higher re-
sults from each field such as 20 terms from Claims, 60 terms from Description, etc. For combinations with
IFPS, we have 4 different sets of queries as shown in Table 12. We do not combine IFPS with terms from
Description since IFPS phrases are identified from this field.
3.6 Patent Indexing & Retrieval
The Lemur Indri search engine, which is based on a combination of language model and inference
framework, was utilized to index patent documents in order to retrieve similar documents for a given query. No
stemming or stop-word removal was done. For each query, we retrieved top 1000 patents from the corpus that
contain query terms. Each retrieved patent was assigned with a relevant score. We used the Okapi BM25 formu-
la for the ranking model in this retrieval which has been used in many retrieval systems.
3.7 Re-ranking based on IPC
Patent retrieval results can be evaluated by comparisons with expert judgments, citations or IPC codes.
Among the three methods, IPC codes are employed to improve the ranked list of relevant retrieved patent doc-
uments. Since the IPC codes were assigned to each patent by patent experts, it can eliminate the limitations of
the two other evaluation methods such as cost and incomplete citations.
Figure 11 shows steps of re-ranking based on IPC codes and example. As shown in Figure 11, after re-
trieval process, the top N retrieved patents with relevant scores are re-ranked using their IPC codes. The re-
trieved patent ids are then mapped with the IPC codes contained in the IPC list which is provided by NTCIR-6.
We also separated the IPC codes into Subclass, Main Group and Subgroup, which will be applied weights sepa-
rately. Then, we calculate the average relevance scores of Sub-class, Main Group and Sub-group as the follow-
ing formula.
Score (IPCi) = ∑ scores of the distinct IPCs / #of all distinct IPCs
- 24 -
Here, X indicates an IPC code and n is the number of patents that X is assigned to within the top N re-
trieved patents, respectively.
Figure 11. Steps of re-ranking based on IPC codes and Example
- 25 -
Chapter 4. Experiments & Results
In this chapter, we present the experimental results. Query formulation was carried out based on two me-
thods: keyword dependency relations and semantic tags. We compared the results with tf-idf and evaluated the
results for 3 IPC codes including Sub-class, Main-Group and Sub-group. We choose tf-idf as the baseline since
it has been done by previous numerous researches. The following sections will describe our data collection,
evaluation metrics, experimental results, and future works.
4.1. Data Collection and Preparation
For experiments, three data sets are collected as follows: (1) A corpus of patent documents to search; (2) A set
of patent queries, and (3) Relevance judgments for patent documents in the corpus. For (1), we use the NTCIR-6
corpus which consists of 1,315,470 patent documents published from 1993 until 2002. All fields of a patent (e.g.
Title, Abstract, Claims, Description) have been indexed using Lemur toolkit [25]. For (2), we choose patent
documents which belong to the domain of Batteries (H01M) published from 2003 up to now. Although the data
is related to Batteries, our methodology can be also applied to other domains since a patent is assigned more
than one IPC code. To collect the patent documents, we issued several queries containing the International Clas-
sification of Batteries (e.g. ICL/H01M004/52 to search for H01M 4/52) on the USPTO patent search website
[44], and then crawled only patents published after 2003. For (3), to evaluate the experiments the IPC codes of
query patents are compared with the IPC codes of retrieved patents. For retrieved patents, the list of IPC codes
provided by the NTCIR-6 for the corpus is used. For the query patents, we extract the IPC codes from IPC tags
contained in the crawled data since the NTCIR-6 provided list does not cover patents’ IPCs after 2002. IPC
codes are separated into Subclass, Main-Group and Sub-Group. This separation allows us to apply weights to
Sub-class, Main-group and Sub-group separately and determine their relative influence on the retrieved ranked
document list. Table 4 shows the statistics of relevant IPC codes of the patent query set. In Table 4, the total
number of relevant IPC codes is 415, 1001 and 1829 for Subclass, Main-Group and Sub-Group, respectively.
Table 4. Statistics of the relevant IPC codes
Relevant IPCs Total Min Max
Sub-class 415 1 7
Main-group 1001 2 8
Sub-group 1829 6 23
- 26 -
4.2 Evaluation Metrics
To evaluate the experimental results, we choose the most commonly used metrics in IR. For the prior art
search tasks these included Mean Average of Precisions (MAP), Recall (R), and Precision@5. The measures
were computed with trec_eval program [50] which was written by Chris Buckley and commonly used in the
TREC evaluation campaigns.
Recall is a measure of the ability that a system to retrieve a portion of relevant items from all relevant
items in the collection as a result of a query. Recall is calculated as follows:
=Precision is a measure of the ability of a system to retrieve relevant items. Precision is calculated as fol-
lows:
=P@N is the precision over different rank cutoffs. Rather than considering the entire retrieved set which
can be quite large or possibly the entire collection, we pick a rank cutoff and calculate the precision among only
top N ranked documents; in this work, the rank cutoff is chosen to be 5. P@5 means high precision in the top 5
which indicates that a user can expect to see a lot of relevant documents near the top, even if the precision of the
entire retrieved set is low.
4.3 Experimental Results
In this section, we present and discuss about the experimental results for query formulation based on two
methods: keyword dependency relations and semantic tags. The results demonstrate the effectiveness of these
methods in patent prior art search.
4.3.1 Data Statistics
Table 5 and Table 6 show the statistics of keywords dependency relations (KDR) and semantic tags
(IFPS), respectively.
As shown by Table 5, the average number of graphs per document is 40, while the minimum and maxi-
mum numbers of graphs are 6 and 256, respectively. Each graph contains an average of 8 nodes with 7 edges.
- 27 -
The graph has a minimum number of nodes of 3 and maximum of 58. Each node has a minimum number of 2
edges and maximum of 57 edges.
Statistics of KDR Average Min Max
#of graphs per document 40 6 256
#of nodes per graph 8 3 58
#of edges per node 7 2 57
Table 5. Statistics of the data extracted by KDR method
Statistics of IFPS Average Min Max
#of IF per document 16 0 113
#of PS per document 53 7 263
#of IFPS per document 58 11 279
Table 6. Statistics of Semantic tags: Invention Fields (IF), Problems/Solutions (PS)
As shown by Table 6, the average number of Invention-Field phrases per document is 16, while the min-
imum and maximum numbers of Invention-Field phrases are 0 and 113, respectively. There are about 10% of
patents do not have Invention-Field as mentioned in section 3.3.1 that explains why the minimum can be 0. The
average number of Problem/Solution phrases per document is 53, while the minimum and maximum numbers
of graphs are 7 and 263, respectively. There are only about 1% of patents that have as many Problem/Solution
phrases as over 260 terms. Since the patents talk about problems or solutions of previous inventions, we gener-
ate a list of candidate phrases more than other patents. After merging Problem/Solution phrases and Invention-
Field phrases, and removing all redundant and stop-words, we got the average number of total IFPS phrases per
document is 58, while the minimum and maximum numbers of IFPS phrases are 58 and 279, respectively.
4.3.2 Baseline
We chose tf*idf as our baseline for comparison since this is the most commonly used method in pre-
vious patent prior art searches. Tf*idf is a statistical measure used to evaluate how important a word is to a doc-
ument in a collection. The importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Tf*idf assigns weight to a term t in document
d given by:
Tf*idft,d = tft,d * idft
- 28 -
, where tft,d is the frequency of a term t that appears in the document d and idft is the inverse document frequen-
cy which calculated as follows:
( ) = | ||{ : ∈ }|, where D is the total number of documents in the corpus and |{ : ∈ }| is the number of documents where
term t appear.
4.3.3 Experimental Results
We ran experiments on query formulation by each method to test the effectiveness of each method indi-
vidually. We also ran experiments on query formulation by combining two methods to increase the effectiveness
of retrieval results. In section 4.3.4.1, we present the experimental results of keyword dependency relation
(KDR). In section 4.3.4.2, we present the experimental results of semantic tags (IFPS). In section 4.3.4.3, we
present the experimental results of combining the two methods.
4.3.3.1 Results of Query Formulation by Keyword Dependency Relation (KDR)
Keyword Dependency Relation for Query Formulation from Separate Field
In general, KDR outperformed tf-idf in selection of terms from a patent field. KDR calculate the impor-
tance of terms using dependency relations between key terms while tf-idf does not reflect this information. As
mentioned in Section 2.2, Titles contain very short terms such that using Titles as a separate field is not advan-
tageous. Consequently, we only use Title terms in combination with other queries to see their values. For query
formulation for separate fields, we run experiments for three fields including Abstract, Claims, and Description.
For Abstract, we ran experiments with only 10 query terms since the minimum query size in our data is 11
terms. For Claims, we ran experiments with query length as 10 and 20 terms and for Description, the query
length is from 10 to 60 terms. The results show that increasing the query length improves the scores, however
when the query length exceeds a limit, adding more terms does not further improve the performance.
Table 7 ~ 9 reports the performance of query formulation by keyword dependency relation method for
three fields including Abstract, Claims and Description. Furthermore, table 10 shows the brief summary of the
most important results for query formulation from each field as shown in Table 7 ~ 9.
Table 7 shows experimental results for queries extracted from Abstract field. As shown in Table 7, for
Abstract, KDR has significant improvement in term of MAP with 18.2% for Sub-class; 17.1% for Main-group;
and 13.4% for Subgroup. Although Recall is a very slight decrease for Sub-class (-0.2%) and Main-Group (-
1.3%), Recall is increased for Sub-group (+7.3%).
- 29 -
Query
LengthMethod
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Tf-idf 0.985 0.5663 0.1951 0.928 0.4422 0.3221 0.768 0.2062 0.2254
10 KDR 0.983
(-0.2%)
0.6691
(+18,2%)
0.210
(+7.9%)
0.916
(-1.3%)
0.5178
(+17.1%)
0.3803
(+18.1%)
0.824
(+7.3%)
0.2338
(+13.4%)
0.2434
(+8.0%)
Table 7. Results of queries extracted from Abstract field
Table 8 shows experimental results for queries extracted from Claims field. For top 10 term queries,
KDR has MAP improvement over tf-idf with 7% improvement for Sub-class and 5.5% for Main-group and
slightly worse than tf-idf for Sub-group (-3.7%). Since characteristics of Claims fields are that contain a lot of
components, the frequency of terms in the field is higher than other fields. That explains why for top 20 term
queries KDR does not work as well as tf-idf.
Length
of
Query
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10
Tf-idf 0.988 0.6401 0.2172 0.945 0.4992 0.3566 0.840 0.2370 0.2680
KDR 0.986
(-0.2%)
0.6852
(+7.0%)
0.2189
(+0.8%)
0.942
(-0.3%)
0.5270
(+5.5%)
0.3730
(+4.6%)
0.830
(-1.1%)
0.2282
(-3.7%)
0.2393
(-10.7)
20
Tf-idf 0.995 0.7299 0.2352 0.960 0.5670 0.4066 0.885 0.2747 0.3008
KDR 0.995
(0%)
0.7205
(-1.3%)
0.2352
(0%)
0.965
(+5.2)
0.5639
(-0.5%)
0.4041
(-0.6%)
0.875
(-1.1%)
0.2554
(-7.0%)
0.2721
(-9.5%)
Table 8. Results of queries extracted from Claims field
Table 9 shows experimental results for queries extracted from Description field. As shown in Table 9, the
more term queries gave the better results. For queries with length of 40 to 60 terms, KDR gave better results
than tf-idf. For queries with length from 10 to 30 terms, KDR does not work as well as tf-idf. This is because tf-
idf short queries contain terms about problems or solutions which have very high frequency in Description while
KDR queries include abbreviations. In the example below, top 10 term query by KDR contains 3 abbreviations
(e.g. ag, ca, cr) while tf*idf does not contain any abbreviations and it contains more terms about problems (e.g.
charging, overvoltage, storage). Example:
Top 10 terms by KDR : “electrode positive nickel oxide temperature ag ca material effect cr”.
- 30 -
Top 10 terms by tf*idf: “charging nickel overvoltage storage alkaline batteries absorbing positive effect
oxygen”.
Query
Length
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Tf-idf 0.993 0.7243 0.2402 0.961 0.5582 0.4008 0.878 0.2662 0.2852
KDR 0.990
(-0.3%)
0.7024
(-3.0%)
0.2279
(-5.1%)
0.944
(-1.7%)
0.5167
(-7.4%)
0.3811
(-4.9%)
0.850
(-3.2%)
0.2399
(-9.8%)
0.2623
(-8.0%)
20 Tf-idf 0.990 0.7598 0.2475 0.964 0.5913 0.4328 0.899 0.2824 0.3000
KDR 0.988
(-0.2%)
0.7468
(-1.7%)
0.2443
(-1.3%)
0.948
(-1,6%)
0.5633
(-4.7%)
0.4090
(-5.5%)
0.884
(-1.7%)
0.2619
(-7.3%)
0.2934
(-2.2%)
30 Tf-idf 0.995 0.7745 0.2533 0.964 0.5958 0.4361 0.913 0.2921 0.3066
KDR 0.990
(-0.5%)
0.7653
(-1.2%)
0.2557
(+1.0%)
0.959
(-0.5%)
0.5760
(-3.3%)
0.4131
(-5.3%)
0.907
(-0.6%)
0.2745
(-6.0%)
0.2984
(-2.6%)
40 Tf-idf 0.998 0.7658 0.2533 0.965 0.5920 0.4344 0.912 0.2883 0.3033
KDR 0.993
(-0.5%)
0.7773
(1.5%)
0.2525
(-0.3%)
0.945
(-2.0%)
0.5875
(0.7%)
0.4205
(-3.2%)
0.910
(-0.2%)
0.2796
(-3%)
0.3033
(0%)
50 Tf-idf 0.995 0.7725 0.2566 0.962 0.5933 0.4344 0.917 0.2894 0.3139
KDR 0.995
(0%)
0.7894
(+2.2%)
0.2582
(+0.6%)
0.965
(+0.3%)
0.6024
(+1.5%)
0.4385
(+0.9%)
0.908
(-0.9%)
0.2863
(-1.1%)
0.3164
(+0.8%)
60 Tf-idf 0.995 0.7751 0.2541 0.962 0.5921 0.4336 0.909 0.2890 0.3123
KDR 1.000
(+0.5%)
0.8083
(+4.3%)
0.2557
(+0.6%)
0.966
(+0.4%)
0.6086
(+2.8%)
0.4320
(-0.3%)
0.910
(+0.1%)
0.2885
(-0.1%)
0.3230
(+3.4%)
Table 9. Results of queries extracted from Description field
- 31 -
Table 10 is a brief summary of the most important results for query formulation from each field as pre-
viously shown in Table 7 ~ 9. As shown by Table 10, KDR gave better results over tf-idf. Query length in Ab-
stract is 10 since the minimum number of Abstract terms in our data is 11; the most appropriate query length in
Claims is 20, and in Description is 60. Also shown by Table 10, extracting terms from Description gave the best
performance over all other fields (e.g. Abstract, Claims field). The reason for this is the Description field contain
specification about what a process or method of the invention is and how it differs from previous patents and
technology. Also, Description starts with the general background information of the area where the inventions
belongs to and increasing levels of details of the invention. Therefore, terms from Description mostly related to
the area that a patent belong that helps to identify the IPC Sub-class; terms about limitations of previous patents
and effects of present invention that may help to identify the IPC Main-group; and terms about details of method
or process can help to identify the IPC Sub-group;
FieldQuery
LengthMethod
Sub-
Class
Main-
group
Sub-
group
Abstract 10 Tf-idf 0.5663 0.4422 0.2062
KDP 0.6691 0.5178 0.2338
Claims 20 Tf-idf 0.7299 0.5670 0.2747
KDP 0.7205 0.5639 0.2890
Description 60 Tf-idf 0.7751 0.5921 0.2885
KDP 0.8083 0.6086 0.2885
Table 10. MAP values of queries from different fields
Keyword Dependency Relation for Query Formulation from Combined Fields
In this section, we present experimental results for query formulation by combining fields to see the ef-
fectiveness of field combinations. Queries were created by selecting top N number of terms from field A com-
bined with top N number of from field B. All redundant terms are removed. There are 4 types of combined que-
ries in which 3 are combinations of two fields (e.g. Abstract and Claims; Abstract and Description; Claims and
Description), and the other is a combinations of three fields (e.g. Abstract and Claims and Description). For Ab-
stract, query length is 10 since the minimum number of terms is 11. For Claims and Description, we choose the
- 32 -
query size as 20 and 60, respectively since those are the most appropriate number of terms for those fields,
which were shown by our previous experiments (Section 4.3.3.1).
Tables 10~13 show the results of combined queries by KDR compared with tf*idf. As shown by Table
10~13, KDR gave better performance for all combined queries over the baseline. We obtained the best results
for three field combination queries which were formulated by the top 10 terms from Abstract, top 20 terms from
Claims, and top 60 terms from Description. Additionally, among the queries formulated by combinations of two
fields, queries which were formulated by top 20 terms from Claims combined with top 60 terms from Descrip-
tion achieves better results than other queries.
Table 11 shows the results for queries formulated by the top 10 terms from Abstract combined with top
10 terms from Claims. As shown by Table 10, KDR achieve MAP improvement of 16% for Sub-class, 16.2%
for Main-group and 13.3% for Sub-group over the baseline.
Query Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
Top
10Abs
+ top
10Cla
Tf-idf 1.000 0.6676 0.2336 0.953 0.5200 0.3787 0.856 0.2475 0.2770
KDR 0.995
(-0.5%)
0.7742
(+16%)
0.2418
(+3.5%)
0.955
(+0.2%)
0.6041
(+16.2%)
0.4369
(+15.4%)
0.885
(+3.4%)
0.2803
(+13.3%)
0.2918
(+5.3%)
Table 11. Results of queries formulated from field combinations of Abstract and Claims
Table 12 shows the results for queries formulated by the top 10 terms from Abstract combined with the
top 60 terms from Descriptions. As shown by Table 12, KDR achieve MAP improvement of 5.2% for Sub-class,
4.4% for Main-group and 1.5% for Sub-group over the baseline.
Table 13 shows the results for queries formulated by the top 20 terms from Claims combined with top 60
terms from Descriptions. As shown by Table 13, KDR achieve MAP improvement of 4.9% for Sub-class, 3.5%
for Main-group and 1.8% for Sub-group over the baseline.
Table 14 shows the results for queries formulated by three field combinations which include the top 10
terms from Abstract, top 20 terms from Claims plus top 60 terms from Descriptions. As shown by Table 14,
KDR achieve MAP improvement of 5.4% for Sub-class, 4.8% for Main-group and 2.4% for Sub-group over the
baseline.
- 33 -
QueryMe-
thod
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
Top10Abs +
top60Des
Tf-idf 0.995 0.7719 0.2574 0.965 0.5902 0.4295 0.909 0.2900 0.3131
KDR 0.998
(+0.3%)
0.8124
(+5.2%)
0.2615
(+1.6%)
0.963
(-0.2%)
0.6159
(+4.4%)
0.4459
(+3.8%)
0.907
(-0.2%)
0.2944
(+1.5%)
0.3270
(+4.4%)
Table 12. Results of queries formulated from field combinations of Abstract and Description
QueryMe-
thod
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
Top 20 Cla +
top 60Des
Tf-idf 0.995 0.7779 0.2590 0.964 0.5978 0.4377 0.909 0.2919 0.3205
KDR 0.998
(+0.3%)
0.8157
(+4.9%)
0.2648
(+2.2%)
0.967
(+0.3%)
0.6184
(+3.5%)
0.4434
(+1.3%)
0.992
(+9.1%)
0.2970
(+1.8%)
0.3221
(+0.5%)
Table 13. Results of queries formulated from field combinations of Claims and Description
QueryMe-
thod
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
Top 10Abs +
top 20Cla +
top60Des
Tf-idf 0.995 0.7770 0.2557 0.966 0.5957 0.4328 0.910 0.2931 0.3189
KDR 0.998
(+0.3%)
0.8189
(+5.4%)
0.2656
(+3.9%)
0.970
(+0.4%)
0.6245
(+4.8%)
0.4533
(+4.7%)
0.923
(+1.4%)
0.3002
(+2.4%)
0.3328
(+4.4%)
Table 14. Results of queries formulated from field combinations of Abstract, Claims and Description
Keyword Dependency Relation for Query Formulation combined with Titles
To see the values of Titles as combined with other fields, 5 experiments for queries extracted by KDR
and combined with Titles were performed. Then we compare the results of each field with the field combined
with Title (e.g. top 10 words from Abstract compared with those of Abstract when Title were added). As can be
- 34 -
seen from Table 15, queries extracted by KDR when Titles are added improve performance compared to that of
queries without Titles, especially when Titles are added to queries extracted from Abstract (MAP improvement
of 16.2% for Sub-class; 14.3% for Main-group and 19.8% for Sub-group). Also shown in Table 15, some of the
results are very slightly worse for queries that come from combinations of 10 terms from Abstract and 20 terms
from Claims and 60 terms from Description, Sub-class (-0.2%) and Main-group (-0.1%), and for Sub-group it is
slightly improved (+0.1%). The experiment show the importance of Titles when they are added to other fields.
QuerySub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
KDR 10 Abs 0.983 0.6691 0.2100 0.916 0.5178 0.3803 0.824 0.2338 0.2434
Tit + 10 Abs 0.993
(+1.02%)
0.7777
(+16.2%)
0.2410
(+14.8%)
0.951
(+3.8%)
0.5919
(+14.3%)
0.4336
(+14.0%)
0.872
(+5.83%)
0.2801
(+19.8%)
0.3033
(+24.6%)
20 Cla 0.995 0.7205 0.2352 0.965 0.5639 0.4041 0.875 0.2554 0.2721
Tit + 20 Cla 0.995
(0%)
0.7837
(+8.8%)
0.2525
(+7.4%)
0.971
(+0.6%)
0.6078
(+7.8%)
0.4525
(+12%)
0.905
(+3.4%)
0.2864
(+12.1%)
0.3090
(+13.6%)
60 Des 1.000 0.8083 0.2557 0.966 0.6086 0.4320 0.910 0.2885 0.3230
Tit + 60 Des 0.995
(-0.5%)
0.8233
(+1.9%)
0.2664
(+4.2%)
0.962
(-0.4%)
0.6267
(+3%)
0.4467
(+3.4%)
0.915
(+0.5%)
0.2967
(+2.8%)
0.3221
(-0.3%)
10Abs + 20Cla 0.998 0.7738 0.2533 0.974 0.6050 0.4426 0.914 0.2881 0.3000
Tit + 10 Abs +
20 Cla
1.000
(0.2%)
0.7948
(+2.7%)
0.2549
(+0.6%)
0.975
(+0.1%)
0.6208
(+2.6%)
0.4574
(+3.3%)
0.917
(+0.3%)
0.2973
(+3.2%)
0.3164
(+5.5%)
10Abs + 20Cla
+ 60Des
0.998 0.8189 0.2656 0.970 0.6245 0.4533 0.923 0.3002 0.3328
Tit + 10Abs +
20Cla + 60Des
0.998
(0%)
0.8176
(-0.2%)
0.2664
(+0.3%)
0.969
(-0.1%)
0.6227
(-0.3%)
0.4508
(-0.6%)
0.925
(+0.2%)
0.3006
(+0.1%)
0.3328
(0%)
Table 15. Comparision of KDR queries when Titles are added
4.3.3.2 Results of Query Formulation by Semantic tags (IFPS)
We ran experiments on semantic tags to identify Invention Fields (IF) and Problems/Solutions (PS) from
Description. We compared the results of IFPS with those of tf*idf queries formulated by 58 terms from Descrip-
tion since IFPS is extracted from Description and 58 is the average number of terms of IFPS queries.
- 35 -
Table 16 shows the experiments results of IFPS compared with the baseline. As shown in Table 16, IFPS
achieves MAP improvement of 7.6% for Sub-class, 4.3% for Main-group and 1.7% for Sub-group over the
baseline.
Query
Length
Me-
thod
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
58 Tf*idf 0.998 0.7720 0.2533 0.962 0.5930 0.4328 0.914 0.2890 0.3131
IFPS 0.998
(+0%)
0.8305
(+7.6%)
0.2689
(+6.2%)
0.972
(+1%)
0.6185
(+4.3%)
0.4574
(+5.7%)
0.918
(+0.4%)
0.2940
(+1.7%)
0.3213
(+2.6%)
Table 16. Results of IFPS queries compared with tf-idf queries.
4.3.3.3 Results of Query Formulation by combining Keyword Dependency Relation (KDR) and
Semantic tags (IFPS)
In order to validate the usefulness of IFPS in patent prior art search, we conducted experiments by com-
bining Keyword Dependency Relation and IFPS.
Table 17 shows the experiments results of KDR queries when adding IFPS compared with tf*idf queries
for Abstract field. As shown in Table 17, KDR queries when IFPS is added achieves significant MAP improve-
ment of 46.8% for Sub-class, 42.6% for Main-group and 45.3% for Sub-group over the baseline. Our experi-
ment shows that KDR gave better results over the baseline, especially when IFPS is the results are significantly
improved.
Table 17. Results of KDR queries when adding IFPS compared with tf-idf queries for Abstract
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs
(Tf*idf)
0.985 0.5663 0.1951 0.928 0.4422 0.3221 0.768 0.2062 0.2254
10 Abs
(KDR) +
IFPS
0.998
(+1.3%)
0.8312
(+46.8%)
0.2664
(+36.5%)
0.970
(+4.5%)
0.6306
(+42.6%)
0.4672
(+45%)
0.921
(+19.9%)
0.2997
(+45.3%)
0.3328
(+47.6%)
- 36 -
Table 18 shows the experiments results of KDR queries when adding IFPS compared with tf*idf queries
for Claims field. As shown in Table 18, KDR queries when IFPS is added achieves significant MAP improve-
ment of 12.5% for Sub-class, 11.5% for Main-group and 10.7% for Sub-group over the baseline.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
20 Cla
(Tf*idf)
0.995 0.7299 0.2352 0.960 0.5670 0.4066 0.885 0.2747 0.3008
20 Cla
(KDR)+
IFPS
0.998
(+0.3%)
0.8209
(+12.5%)
0.2705
(+15%)
0.973
(+1.4%)
0.6321
(+11.5%)
0.4631
(+13.9%)
0.923
(+4.3%)
0.3041
(+10.7%)
0.3344
(+%11.2)
Table 18. Results of KDR queries when adding IFPS compared with tf-idf queries for Claims
Table 19 shows the experiments results of queries formulated by top 10 terms from Abstract combined
with top 20 terms from Claims by KDR when adding IFPS compared with those by Tf*idf. As shown in Table
18, KDR queries when IFPS is added achieves significant MAP improvement of 11.3% for Sub-class, 10.3% for
Main-group and 9.4% for Sub-group over the baseline.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs + 20
Cla (Tf*idf)
0.998 0.7379 0.2418 0.964 0.5737 0.4066 0.878 0.2831 0.3016
10 Abs + 20
Cla (KDR) +
IFPS
0.995
(-0.3%)
0.8216
(+11.3%)
0.2721
(+12.5%)
0.973
(+0.9%)
0.6330
(10.3%)
0.4730
(+16.3%)
0.923
(+5.1%)
0.3097
(+9.4%)
0.3410
(+13.1)
Table 19. Results of KDR queries when adding IFPS compared with tf-idf queries for field combination of
Abstract and Claims
- 37 -
The experimental results show that queries extracted by KDR and added more IFPS terms have signifi-
cant improvement over the baseline (e.g. Especially for queries from Abstract, we achieve the highest MAP im-
provement of 46.8% for Sub-class, 42.6% for Main-group, 45.3% for Sub-group). This shows that using KDR
can change weight of terms that results in improvement of retrieval performance. And adding more IFPS terms
gave much more improvement, that show a good strategy for query expansion.
In order to validate the effectiveness of KDR in combination with IFPS, we conducted experiments by
comparing our approach with tf*idf and adding the same number of terms from Description to tf*idf queries. As
explained before, IFPS is extracted from Description and 58 is the average number of terms therefore we choose
to adding more terms to tf*idf queries by taking the top 58 terms from Description. That makes sure that both of
expanded KDR and tf*idf queries have the same number of adding terms.
Table 20 shows the experiments results of KDR queries formulated by top 10 terms from Abstract ex-
panded with IFPS compared with tf*idf queries formulated by top 10 terms from Abstract and top 58 terms from
Description. As shown in Table 20, KDR queries when IFPS is added achieves MAP improvement of 8.6% for
Sub-class, 7.1% for Main-group and 3.7% for Sub-group over the baseline.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs + Des(Tf*idf) 0.995 0.7657 0.2566 0.962 0.5890 0.4279 0.910 0.2889 0.3139
10 Abs (KDR) + IFPS 0.998
(+0.3%)
0.8312
(+8.6%)
0.2664
(+3.8%)
0.970
(+0.8%)
0.6306
(+7.1%)
0.4672
(+9.2%)
0.921
(+1.2%)
0.2997
(+3.7%)
0.3328
(+6.0%)
Table 20. Results of KDR queries formulated by top 10 terms from Abstract expanded with IFPS compared
with tf*idf queries formulated by top 10 terms from Abstract and top 58 terms from Description
Table 21 shows the experiments results of KDR queries formulated by top 20 terms from Claims ex-
panded with IFPS compared with tf*idf queries formulated by top 20 terms from Claims and top 58 terms from
Description. As shown in Table 21, KDR queries when IFPS is added achieves MAP improvement of 5.9% for
Sub-class, 5.6% for Main-group and 3.4% for Sub-group over the baseline.
Table 22 shows the experiments results of KDR queries formulated by combination of top 10 terms from
Abstract plus top 20 terms from Claims expanded with IFPS compared with that of tf*idf queries expanded with
- 38 -
top 58 terms from Description. As shown in Table 22, KDR queries expanded with IFPS is added achieves MAP
improvement of 6.7% for Sub-class, 6.2% for Main-group and 5.6% for Sub-group over the baseline.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
20 Cla + Des (Tf*idf) 0.995 0.7750 0.2557 0.966 0.5987 0.4369 0.910 0.2939 0.3213
20 Cla (KDR) + IFPS 0.998
(+0.3%)
0.8209
(+5.9%)
0.2705
(+5.6%)
0.973
(+0.7%)
0.6321
(+5.6%)
0.4631
(+6.0%)
0.923
(+1.4%)
0.3041
(+3.4%)
0.3344
(+4.1%)
Table 21. Results of KDR queries formulated by top 20 terms from Claims expanded with IFPS compared with
tf*idf queries formulated by top 10 terms from Claims and top 58 terms from Description
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs+ 20 Cla+58Des
(Tf*idf)
0.995 0.7700 0.2541 0.966 0.5960 0.4344 0.908 0.2933 0.3180
10 Abs + 20 Cla
(KDR) + IFPS
0.995
(0%)
0.8216
(+6.7%)
0.2721
(+7.1%)
0.973
(+0.7%)
0.6330
(6.2%)
0.4730
(+8.9%)
0.923
(+1.7%)
0.3097
(+5.6%)
0.3410
(+7.2%)
Table 22. Results of KDR queries formulated by combination of top 10 terms from Abstract plus top 20 terms
from Claims expanded with IFPS compared with that of tf*idf queries expanded with top 58 terms from
Description.
4.3.3.4 Results of Query Formulation by combining Tf*idf and Semantic tags (IFPS)
In order to validate the effectiveness of IFPS in query formulation for patent prior art search, we con-
ducted experiments by comparing terms extracted from Description by tf*idf with terms extracted by IFPS. We
added to the top N terms from Abstract (e.g. 10 terms) or Claims (e.g. 20 terms) the same number of tf*idf
terms from Description (58 terms) and compared with the top N terms from Abstract (e.g. 10 terms) or Claims
(e.g. 20 terms) in combination with IFPS.
Table 23 shows the experiments results of tf*idf queries formulated by top 10 terms from Abstract plus
the top 58 terms from Description compared with that plus IFPS. As shown in Table 23, when IFPS is added to
- 39 -
tf*idf queries we achieves MAP improvement of 8.5% for Sub-class, 5.7% for Main-group and 1.7% for Sub-
group compared with terms extracted by tf*idf.
Table 24 shows the experiments results of tf*idf queries formulated by top 20 terms from Claims plus
IFPS compared with that plus top 58 terms from Description. As shown in Table 24, when IFPS is added to
tf*idf queries we achieves MAP improvement of 6% for Sub-class, 5.3% for Main-group and 4.4% for Sub-
group compared with terms extracted by tf*idf.
Table 25 shows the experiments results of tf*idf queries formulated by combination of top 10 terms from
Abstract plus top 20 terms from Claims expanded with IFPS compared with top 10 terms from Abstract plus top
20 terms from Claims expanded with 58 terms from Description. As shown in Table 25, when IFPS is added to
tf*idf queries we achieves MAP improvement of 6.6% for Sub-class, 5.7% for Main-group and 4.8% for Sub-
group compared with terms extracted by tf*idf.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs + 58
Des(Tf*idf)
0.995 0.7657 0.2566 0.962 0.5890 0.4279 0.910 0.2889 0.3139
10 Abs (tf*idf) + IFPS 0.998
(+0.3%)
0.8305
(+8.5%)
0.2689
(+4.8%)
0.974
(+1.3%)
0.6228
(+5.7%)
0.4623
(+8.0%)
0.919
(0.99%)
0.2939
(+1.7%)
0.3262
(+3.9%)
Table 23. Results of tf*idf queries formulated by the top 10 terms from Abstract when IFPS is added.
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
20 Cla + 58 Des
(Tf*idf)
0.995 0.7750 0.2557 0.966 0.5987 0.4369 0.910 0.2939 0.3213
20 Cla (KDR) + IFPS 0.998
(+0.3%)
0.8216
(+6.0%)
0.2705
(+5.79%)
0.978
(+1.2%)
0.6304
(+5.29%)
0.4656
(+6.57%)
0.925
(+1.65%)
0.3067
(+4.4%)
0.3410
(+6.13%)
Table 24. Experiments results of tf*idf queries formulated by top 10 terms from Abstract plus IFPS compared
with top 10 terms from Abstract plus top 58 terms from Description
- 40 -
Method
Sub-Class Main-Group Sub-Group
Recall MAP P@5 Recall MAP P@5 Recall MAP P@5
10 Abs+ 20 Cla+
58Des (Tf*idf)
0.995 0.7700 0.2541 0.966 0.5960 0.4344 0.908 0.2933 0.3180
10 Abs + 20 Cla (tfidf)
+ IFPS
0.998
(+0.3%)
0.8212
(+6.6%)
0.2689
(+5.8%)
0.978
(+1.2%)
0.6298
(+5.7%)
0.4648
(+7%)
0.921
(+1.4%)
0.3073
(+4.8%)
0.3434
(+8%)
Table 25. Results of tf*idf queries formulated by combination of top 10 terms from Abstract, top 20 terms from
Claims when IFPS is added.
4.5 Discussion
In this chapter, the experiments and implications from the approach and evaluation results are discussed.
We carried out experiments for query formulation by two methods: keyword dependency relations
(KDR) and semantic tags (IFPS). Queries were extracted by taking the top N number of terms from each field or
combination of two or three fields. Then, results were evaluated for three IPC codes including Sub-class, Main-
group and Subgroup by comparing with those of tf*idf. The experimental results show that: 1) Description is the
best field for query formulation compared with Abstract or Claims; 2) Query formulation by combining the top
N terms from Abstract, Claims and Description gives better performance than query formulation by a separate
field (e.g. top 10 terms from Abstract plus top 20 terms from Claims plus top 60 terms from Description); 3)
KDR gave better performance than tf*idf since KDR can identify important terms by changing weight of a term
based on the importance of its neighbor terms; 4) IFPS gave better performance than tf*idf; and 5) the best per-
formance was achieved when KDR is combined with IFPS; Moreover, we found that 6) for Sub-class, the high-
est results were achieved by using IFPS queries alone or by KDR queries extracted from Abstract combining
with IFPS; for Main-group, the highest results achieved by KDR queries extracted from Abstract or Claims or
both combining with IFPS terms; and for Sub-class, the highest results were achieved by KDR queries extracted
from Claims combining with IFPS terms.
Our approach points out distinct features to improve the effectiveness of prior art patent search, which
have not been discovered before. Most of previous researches used words from a separate field as a query (e.g.
Claims field). However instead of doing the same way as previous research, we show that formulating queries
based on keyterm dependency relations by selecting top N terms from each field and combining those terms as
- 41 -
the search queries is more significant to improve the effectiveness of the prior art search. We also show that by
combining with IFPS terms; the results were much more significantly improved and each particular field plays a
different role in identifying the IPC codes of a query patent. Abstract field in combination with IFPS is more
significant to identify IPC sub-class; while terms from Claims field in combination with IFPS are more signifi-
cant to identify IPC sub-group. And Abstract or Claims in combination with IFPS have almost same importance
to identify IPC main-group.
Field Terms extracted by KDR Terms extracted by tf*idf
Abstract cr ti ca material electrode active positive metal
oxide conductive
charging decrease efficiency oxide nickel oxygen
battery temperature yb supplement
Table 26. Example of top 10 terms extracted by KDR and tf*idf for Abstract field.
Table 27. Example of top 10 terms extracted by KDR and tf*idf for Claims field.
Table 28. Example of top 30 terms extracted by KDR and tf*idf for Descripion field.
Keyword dependency relation works better than tf*idf since it is based on the relation between words in
which the importance of a word depends on the importance of its neighbor words. If a term has more relations to
important neighbor, it will be assigned more weight. Tables 26 ~ 28 are examples of terms extracted by KDR
and tf*idf. As we can see from those table, KDR selects terms about details of method or process (e.g. cr, ti, ca,
active, material) while tf*idf selects terms about limitations or effects (e.g. charging, decrease, efficiency). In
Field Terms extracted by KDR Terms extracted by tf*idf
Claims ti cr ca material ni battery metal oxide active
nickel
overvoltage increases electrically conductive ma-
terial oxygen alkaline coating oxide nickel
Field Terms extracted by KDR Terms extracted by tf*idf
Description electrode positive nickel oxide temperature
ag ca material effect cr metal negative ba
hydride alloy increasing improve absorbing
hydrogen problems surface composed elec-
trolyte element capable electrochemically
battery releasing time object
charging nickel overvoltage storage alkaline batte-
ries absorbing positive effect oxygen hydride effi-
ciency battery electrode increasing capacity oxide
powders active add hydrogen decreases tempera-
tures hydroxide proposals material increased time
releasing negative
- 42 -
the example below, top 10 term query by KDR contains 3 abbreviations (e.g. ag, ca, cr) while tf*idf does not
contain any abbreviations and it contains more terms about problems (e.g. charging, overvoltage, storage). For
Abstract and Claims, KDR queries have better performance than tf*idf queries, which show that terms about
details of method or process of a patent is more importance for prior art search than terms about limitations or
effects. However, for Description field terms about limitations or effects is more effective. For queries with
length from 10 ~ 30 terms (Table 28), KDR identified terms mostly about method or process and it has worse
performance than tf*idf. Meanwhile, for queries with length from 40 ~ 60 terms (Table 29) KDR identified
more terms about limitations or effects that results in higher performance than tf*idf. Terms from Description
field are more important than other fields since Description contains terms that related to the area a patent be-
long to, terms about limitations of previous patents and effects of present invention, and terms about details of
method or process.
Table 29. Example of top 40 ~ 60 terms extracted by KDR and tf*idf for Description field.
Based on the experiment analysis, we found that the KDR includes many abbreviations, which can re-
duce the performance of KDR. This problem can be solved by constructing a dictionary such as using Wikipedia
or WordNet, however words from patents are mostly very technical which may not exist in those dictionaries.
Therefore, constructing a patent dictionary is one way to resolve the problems for better improvement.
Extracting terms from Description gave the best performance over all other fields (e.g. Abstract, Claims
field). The reason for this is the Description field contain specification about what a process or method of the
invention is and how it differs from previous patents and technology. Also, Description starts with the general
background information of the area where the inventions belongs to and increasing levels of details of the inven-
tion. However, it will be a difficult task to identify those terms based on frequency. By identifying IFPS terms
Field Terms extracted by KDR Terms extracted by tf*idf
Description active add storage high oxidation generated
cadmium efficiency hydroxide demand
reaction alkaline batteries rising reduction
temperatures capacity caused solution
charging solid elements reducing contained
place great cost sealing energy decomposi-
tion
similarly additives sized elements average cad-
mium competitively merits radiating proportioned
conspicuous dispersibility increasing efficiently
particle metal composed agglomeration agglome-
rate rising explanation apt industrialized caused
electrochemically sharp ca alloy cr beryllium
- 43 -
from Description, we can achieve better performance if IFPS is used as a query itself or and the best is to use in
combination with query selection by KDR. Our analysis shows that Invention Field (IF) terms include informa-
tion related to the areas a patent belong to which can be very helpful to identify the IPC sub-classes of a patent
document. Since the frequency of terms that describe an invention domain is relatively lower, IF phrases cannot
be extracted by frequency based method. Also, Problems/Solutions (PS) includes information related to limita-
tions of previous patents and effects of present invention that may help to identify the IPC main-groups.
Our experiments also show that, when combining with IFPS terms from Abstract is more significant for
identifying IPC Sub-classes of a query patent; terms from Abstract or Claims both significant for identifying
IPC Main-groups of a query patent; while terms from Claims is more significant for identifying IPC Sub-groups
of a query patent.
Through a number of experiments performed in this work, we show that extracting terms based on de-
pendency relations is a good way in changing weights of terms by assigning higher weights to more important
terms. We also show that how IFPS terms can contribute to the effectiveness of query formulation for prior art
search, especially when terms extracted by keyword dependency relations and IFPS can be combined as a query.
4.6 Conclusions & Future works
A new method for query enhancement in patent prior art search that outperforms the baseline (tf-idf)
based on keyterm dependency relations and semantic tags was proposed in this thesis. The experiments demon-
strated significant improvements for query formulation by extracting the top N terms from each field and com-
bining those terms as a query rather than using terms from a separate field as a query. We show that query for-
mulated by combinations of three fields which is the top ten terms from Abstract and the top 20 terms from
Claims and the top 60 terms from Description give the best result. And, our works show the improvement of
query formulation by IFPS terms compared with the same number of terms extracted by tf*idf from Description
field. The reason IFPS terms outperform tf*idf terms since IFPS includes information related to the areas a pa-
tent belong to which can be very helpful to identify the IPC sub-classes of a patent document (IF) and it includes
Problems/Solutions (PS) which related to limitations of previous patents and effects of present invention that
may help to identify IPC main-groups or sub-groups of the query patent. We also show the effectiveness of IFPS
terms when IFPS is combined with KDR terms or tf*idf. When IFPS is added we gain much more improvement
that shows a good strategy for query expansion.
Our experiments show that terms about details of method or process of the invention (e.g. ag, ca, cr) are
more significant for query formulation from Abstract or Claims; while terms about limitations or effects (e.g.
- 44 -
charging, decrease, efficiency) are more significant for query formulation from Description. In the example be-
low, top 10 term query by KDR contains 3 abbreviations (e.g. ag, ca, cr) while tf*idf does not contain any ab-
breviations and it contains more terms about problems (e.g. charging, overvoltage, storage).
Our experiments suggest a way to improve the identification of IPC codes by identifying terms from par-
ticular field instead of using various field or whole document. For example, one wants to know only sub-classes
of a patent he can focus on query terms from Abstract and IFPS; or terms from Claims and IFPS for main-
groups or sub-groups.
The proposed methods in this work are applied to patent documents which are related to the Batteries’
domain; however, they can also be applied to other domains as well. As a future work, we intend to apply our
approach to a larger corpus with various domains. We also further consider how to use dependency relations of
terms for identifying phrases in patent documents instead of using unique words. In particular, dependency rela-
tions between IFPS terms are expected to achieve better improvement; therefore, it should be considered. A pa-
tent term dictionary and a synonym dictionary should also be developed for better term-matching accuracy. Fur-
thermore, the way to improve the original keyword dependency relation method should be analyzed.
- 45 -
References[1] M. Iwayama, A. Fujii, N. Kando, and A. Takano (2009). “Overview of patent re-
trieval task at NTCIR-3”. In Proceedings of NTCIR Workshop, 2002.
[2] A. Fujii,M. Iwayama, and N. Kando (2004). “Overview of Patent Retrieval Task at
NTCIR- 4”. In Proceedings of NTCIR-4 Workshop, 2004.
[3] Youngho Kim, et al (2009). “Automatic Discovery of Technology Trends from Pa-
tent”. Proceedings of the 2009 ACM symposium on Applied Computing, pp. 1480-
1487, 2009.
[4] Atsushi Fujii ,Tetsuya Ishikawa (2004). “Document Structure Analysis in Associa-
tive Patent Retrieval”. NTCIR-4 Workshop, 2004.
[5] Hisao Mase, et al. (2004). “Two-Stage Patent Retrieval Method Considering Claim
Structure”. NTCIR-4 Workshop, 2004.
[6] Sumio Fujita (2004). “Revisiting Document Length Hypotheses: NTCIR-4 CLIR
and Patent Experiments at Patolis”. NTCIR-4 Workshop, 2004.
[7] Hironori Takeuchi, et al. (2004). “Experiments on Patent Retrieval at NTCIR-4
Workshop”. NTCIR-4 Workshop, 2004.
[8] Atsushi Fujii (2007). “Integrating Content and Citation Information for the NTCIR-
6 Patent Retrieval Task”. NTCIR-6 Workshop, 2007.
[9] Jungi Kim, et al. (2007). “POSTECH at NTCIR-6 English Patent Retrieval Sub-
task”. NTCIR-6 Workshop, 2007.
[10] Kazuya Konishi, Akira Kitauchi and Toru Takaki, (2004). “Invalidity Patent
Search System of NTT DATA”. NTCIR-4 Workshop, 2004.
[11] Hisao Mase,Makoto Iwayama (2007). “NTCIR-6 Patent Retrieval Experiments at
Hitachi”, NTCIR-6 Workshop, 2007.
[12] Hidetsugu Nanba (2007). “Query Expansion using an Automatically Constructed
Thesaurus”. NTCIR-6 Workshop, 2007.
[13] Hiroki Tanioka, Kenichi Yamamoto (2007). “A Passage Retrieval System using
Query Expansion and Emphasis”, NTCIR-6 Workshop, 2007.
[14] Kazuya K. (2005). “Query Term Extraction from patent documents for invalidity
search”. Proceedings of NTCIR-5 Workshop Meeting, December 6-9, 2005, Tokyo, Ja-
pan.
[15] Jarvelin, A. and Preben, H. (2009). “UTA and SICS at CLEF-IP”. 1st CLEF-IP,
Corfu, Greece, 2009.
[16] Lopez, P. and Romary, L. (2009). “Multiple Retrieval Models and Regression
Models for Prior Art Search”. In: 1st CLEF-IP, Corfu, Greece, 2009.
[17] G. Roda, J. Tait, F. Piroi, and V. Zenz (2009). “CLEF-IP 2009: Retrieval experi-
ments in the Intellectual Property domain”, CLEF-IP 2009.
- 46 -
[18] Susan V. and Eva D. (2010). “Prior Art retrieval using the claims section as a bag
of words”. CLEF-IP 2010.
[19] Toucedo, J.C. and Losada, D.E. (2009). “University of Santiago de Compostela at
CLEF-IP09”. 1st CLEF-IP, Corfu, Greece, 2009.
[20] Xiaobing X. and W. Bruce C. (2010). “Transforming Patents into Prior Art Que-
ries”. SIGIR’09.
[21] Metti Z. et al. (2010). “Prior art retrieval using various patent document fields
contents”. CLEF-IP 2010.
[22] Mai, F.-D., Hwang, F., Chien, K.-m., Wang, Y.-M., & Chen, C.-y. (2002). “Patent
map and analysis of carbon nanotube”. Science and Technology Information Center,
National Science Council, ROC.
[23] Young Gil K., et al (2008). “Visualization of patent analysis for emerging tech-
nology”. Expert Systems with Applications: An International Journal archive Volume
34 Issue 3, April, 2008.
[24] Brian Lent, et al. (1997). “Discovering trends in text databases”. In Proc. 3rd Int.
Conf. Knowledge Discovery and Data Mining, KDD, pp. 227-230.
[25] The Lemur Toolkit. http://www.lemurproject.org.
[26] Takaki, et al. (2004). “Associative Document Retrieval by Query Subtopic Analy-
sis and its Application to Invalidity Patent Search”. In: Proceedings of CIKM 2004.
[27] Mase, H., et al. (2005). “Proposal of Two Stage Patent Retrieval Method Consi-
dering the Claim Structure”. ACM Transactions on Asian Language Information
Processing 4, 2005.
[28] Archibugi, D., & Pianta, M. (1996). “Measuring technological change through pa-
tents and innovation survey”. Technovation, 16(9), 451–468.
[29] Be’de’carrax, C., & Huot, C. (1994). A new methodology for systematic exploita-
tion of technology databases. Information Processing & Management, 30(3), 407–418.
[30] Tseng, Y., Lin, C., & Lin, Y. (2007). “Text mining techniques for patent
analysis”. Information Processing and Management, 43(5), 1216–1247.
[31] Y.R. Li, L.H. Wang and C.F. Hong. (2009). “Extracting the significant-rare key-
words for patent analysis”. Expert Systems with Applications 36 (2009), pp. 5200–
5204.[32] Tiwana, S., & Horowitz, E. (2009). Extracting Problem Solved Concepts from Pa-
tent Documents. Proceedings of the 2nd ACM workshop on Patent Information Re-
trieval, PaIR 2009, November 6, 2009, Hong Kong, China, 43-48.
[33] O. Babina. “Nlp-based patent information retrieval”.
http://fccl.ksu.ru/issue8/babinaNLPpatentIR.pdf.
- 47 -
[34] K. V. Indukuri, A. A. Ambekar, and A. Sureka. (2007). “Similarity analysis of pa-
tent claims using natural language processing techniques”. In ICCIMA ’07: Proc of the
Int’l Conf on Computational Intelligence and Multimedia Applications (ICCIMA
2007), Washington, DC, USA, 2007. IEEE CS.
[35] S. Sheremetyeva. (2003). “Natural language analysis of patent claims”. In Proc of
the ACL-2003 Workshop on Patent Corpus Processing, Morristown, NJ, USA, 2003.
ACL.
[36] A. Shinmori, M. Okumura, Y. Marukawa, and M. Iwayama. (2003). “Patent claim
processing for readability: structure analysis and term explanation”. In Proc of the
ACL-2003 Workshop on Patent Corpus Processing, pages 56–65, Morristown, NJ,
USA, 2003. ACL.
[37] S.-Y. Yang and V.-W. Soo. (2008). “Comparing the conceptual graphs extracted
from patent claims”. In SUTC ’08: Proc of the 2008 IEEE Int’l Conf on Sensor Net-
works, Ubiquitous, and Trustworthy Computing (SUTC 2008), Washington, DC, USA,
2008. IEEE CS.
[38] C. Yang, Hong Peng, J. Wang (2008). “A new Feature Extraction Approach
Based on Sentence Element Analysis”. In Computational Intelligence and Security,
CIS’08.
[39] V.Nastase, J.S. Shirabad, M. F. Craropreso (2007). “Using Dependency Relations
for Text Classication”. University of Ottawa SITE Technical Report TR-2007-12.
[40] ] W. Zheng, et al., “Topic Tracking Based on Keywords Dependency Profile”.
AIRS 2008.
[41] ] Renxu Sun , Chai-huat Ong , Tat-seng Chua (2006). “Mining Dependency Rela-
tions for Query Expansion in Passage Retrieval”. In SIGIR ’06: Proceedings of the
29th annual international ACM SIGIR conference on Research and development in in-
formation retrieval.
[42] H. Cui, R. Sun, K. Li, M.-Y. Kan and T.-S. Chua. (). “Question Answering Pas-
sage Retrieval Using Dependency Relations”. Proceedings of the 28th annual interna-
tional ACM SIGIR conference on Research and development in information retrieval,
Salvador, Brazil , Aug 15-19, pp. 400 - 407.
[43] Schonhofen, P. and Benczur, A.A. (). “Feature selection based on word-sentence
relation”. In
[44] The USPTO databased. http://www.uspto.gov/
[45] Jae-Ho Kim, et al. “Patent document categorization based on semantic structural
information”, Information Processing and Management (2007).
- 48 -
[46]Xiaobing Xue and W. Bruce Croft (2009). “Automatic Query Generation for Pa-
tent Search”. In Proceeding of the 18th ACM conference on Information and Know-
ledge Management, CIKM’ 09.
[47] Lupu, M.; Mayer, K.; Tait, J.; Trippe, A.J. (2011). “Current Challenges in Patent
Information Retrieval”. The Information Retrieval Series, Vol. 29, 2011.
[48] David Hunt, Long Nguyen, Matthew Rodgers (2007). “Patent searching: tools &
techniques”
[49] Open NLP POStagger: http://opennlp.sourceforge.net/
[50] trect_eval program at TRECT website: trec.nist.gov/trec_eval
- 49 -
- 50 -
Summary
Query Enhancement for Patent Prior Art Search with Keyterm Dependency Relations
and Semantic Tags
A new method for query enhancement in patent prior art search that outperforms the baseline (tf-idf)
using keyterm dependency relations and semantic tags was proposed in this thesis. The experiments demon-
strated in this work show that for query formulation from a separate field, Description is the most significant
field to improve the ranking of retrieved prior art patents with 60 terms as the appropriate query size. It is also
shown that for query formulation from combined fields, query formulated by combinations of three fields
which is the top ten terms from Abstract and the top 20 terms from Claims and the top 60 terms from Descrip-
tion give the best result. And, our works show that the best query is achieved by the combination of the top
ten terms from Abstract, top 20 terms from Claims and IFPS. Our proposed method also shows that the query
formulation using IFPS itself can still significantly improve results over the baseline.
The proposed methods in this work are applied to patent documents which are related to the Batteries’
domain; however, they can also be applied to other domains as well.
Keywords: patent retrieval, prior art retrieval, keyterm dependency relations, semantic tags, term cooccu-
rences