getting started japanese search and calculate similarity with apache lucene

Getting Started Japanese Search and Calculate Similaritywith Apache Lucene

May 2016 Eiji Shinohara

Name：Eiji Shinohara / 篠原英治 / @shinodogg

Role：AWS Solutions ArchitectSubject Matter Expert・Amazon CloudSearch・Amazon Elasticsearch Service

Who am I?

Which Search Engine/Service do you use?• Apache Solr

• Elasticsearch

• Amazon CloudSearch

• Amazon Elasticsearch Service

On top of Apache Lucene• Apache Solr

• Elasticsearch

• Amazon CloudSearch

• Amazon Elasticsearch Service

Have you used Apache Lucene?

•Apache Lucene is a free and open-source information retrieval software library, originally written in Java by Doug Cutting. • It is supported by theApache Software Foundation and is released under the Apache Software License.

https://en.wikipedia.org/wiki/Lucene

Doug Cutting – Hadoop/Nutch/Lucene•Hadoop: MapReduce• Thenamemykidgaveastuffedyellowelephant.

•Nutch: Crawler•Nutchwasthewaymyoldestsonwhenhewastwo,Ithinkitcamefromlunch.

•Lucene: Search• LuceneisDougCutting'swife'smiddlename,andhermaternalgrandmother'sfirstname.

http://www.mwsoft.jp/programming/hadoop/where_come_from.html

Doug Cutting – Hadoop/Nutch/Lucene•Hadoop: MapReduce• Thenamemykidgaveastuffedyellowelephant.

•Nutch: Crawler•Nutchwasthewaymyoldestsonwhenhewastwo,Ithinkitcamefromlunch

•Lucene: Search• LuceneisDougCutting'swife'smiddlename,andhermaternalgrandmother'sfirstname.

http://www.mwsoft.jp/programming/hadoop/where_come_from.html

MaybemostpropernamingJ

Apache Lucene•Full-Text search• Easy to use

http://www.lucenetutorial.com/lucene-in-5-minutes.html

Apache Lucene•Full-Text search• Easy to use

1. Index• new Document → addDocument → commit

2. Query• Generate Query String

3. Search• Search and Fetch hitted documents

4. Display• Get contents from fetched documents to showhttp://www.lucenetutorial.com/lucene-in-5-minutes.html

Evernote and LinkedIn are using Lucene•w/ thin their own HTTP wrapper• Presentation at Lucene Solr Revolution 2014

https://www.youtube.com/watch?v=drOmahIie6c https://www.youtube.com/watch?v=8O7cF75intk

Build your own Search engine?• Some companies are doing that

http://www.slideshare.net/lucidworks/galene-linkedins-search-architecture-presented-by-diego-buthay-sriram-sankar-linkedin/8

Iʼll join Lucene Solr Revolution 2016

Apache Lucene⼊⾨ in Japanese

http://rondhuit.com/lucene-for-bea-060710.pdfhttp://www.amazon.co.jp/dp/4774127809

Lucene in Action

https://www.amazon.com/dp/1933988177

Uchida-sanʼs Blog in Japanese

http://mocobeta-backup.tumblr.com/post/54371099587/lucene-in-action

Uchida-san: Search Consultant at Rondhuit

Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity

http://mocobeta-backup.tumblr.com/post/49779999073/

Lucene in Action chap5: Term Vector (2) Calcurate Document Similarity• Just tried to run on local Macbook Air J• Created 2 classes• Indexer• Indexing some documents

• CalculationSimilarityTester• Comparing 2 documents• Calculate cosine similarity

• Using Luke for browsing index• https://github.com/DmitryKey/luke• Uchida-san is also Luke comitter•

Lucene 6.0• I had Lucene 5.5 environment but,,,• Invalid directory at the location, check console for more

information. Last exception: • java.lang.IllegalArgumentException: Could not load codec

'Lucene60'. Did you forget to add lucene-backward-codecs.jar?

Lucene 6.0•So created new Maven project• pom.xml

<dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>6.0.0</version>

</dependency><dependency>

<groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>6.0.0</version>


<groupId>org.apache.lucene</groupId><artifactId>lucene-analyzers-common</artifactId><version>6.0.0</version>


<groupId>org.apache.lucene</groupId><artifactId>lucene-analyzers-kuromoji</artifactId><version>6.0.0</version>

</dependency>

Indexerpublic class Indexer {

public static void main(String args[]) throws IOException {Analyzer analyzer = new JapaneseAnalyzer();〜略〜

File[] files = new File("/Users/xxx/lucene_test/docs/").listFiles();for (File file : files) {

Document doc = new Document();〜略〜FieldType contentsType = new FieldType();contentsType.setStored(true);contentsType.setTokenized(true);contentsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);contentsType.setStoreTermVectors(true);〜略〜doc.add(new Field("contents", sb.toString(), contentsType));writer.addDocument(doc);

}writer.commit();writer.close();

}}

•Read file -> add Document -> Commit

Indexer• Files• Found examples on the internet :)• http://www.pahoo.org/e-soul/webtech/php06/php06-21-01.shtm

PHP: Hypertext Preprocessor（ピー・エイチ・ピーハイパーテキストプリプロセッサー）とは、動的に HTML データを⽣成することによって、動的なウェブページを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類される。この⾔語処理系⾃体は、C⾔語で記述されている。

PHP(Hypertext Preprocessor；ピー・エイチ・ピー）とは、動的に HTML データを⽣成することによって、動的なウェブページを実現すること⽬的としたプログラミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の⼀種で、処理系⾃体は C⾔語で記述されている。

Indexer• Files• Found examples on the internet :)• http://www.fisproject.jp/2015/01/cosine_similarity/

• Exactly same

A Cat sat on the mat.

Cats are sitting on the mat.

⼈⼝から無作為に選択されて、⼈⼝に関する仮説を試験するために使⽤される項⽬となっております。


Indexer•Run

Luke• Index Browsing


$mvn package./luke.sh

Calcurate Document Similarity•mocobeta/CalcCosineSimilarityTest.java• https://gist.github.com/mocobeta/5525864• Search document from index• TF-IDF from Term Vector

• TF-IDF• how important a word is to a document in a collection or corpus

• TF: how frequently a term occurs in a document• IDF: it's a measure of the rareness of a term

• Get Cosine-Similarity• Lower is similar

Calcurate Document Similaritypublic class CalcCosineSimilarityTester {

public static void main(String args[]) throws IOException {〜略〜TopDocs hits = searcher.search(new TermQuery(new Term("path", path1)), 1);int docId1 = hits.scoreDocs[0].doc;Map<String, Double> map1 = buildDocumentVector(docId1);

hits = searcher.search(new TermQuery(new Term("path", path2)), 1);int docId2 = hits.scoreDocs[0].doc;Map<String, Double> map2 = buildDocumentVector(docId2);

System.out.println(computeAngle(map1, map2));

// create HashMap(Key:Keyword, Value:TF-IDF) for each documentprivate Map<String, Double> buildDocumentVector(int docId) {

〜略〜

// calculate cosine similarityprivate double computeAngle(map1, map2) {

〜略〜

Calcurate Document Similarityprivate Map<String, Double> buildDocumentVector(int docId) throws IOException {

Terms vector = reader.getTermVector(docId, "contents");〜略〜// get TF-IDF from Term VectorTermsEnum itr = vector.iterator();〜略〜while ((ref = itr.next()) != null) {

String term = ref.utf8ToString();TermFreq freq = new TermFreq(term, maxDoc);freq.setTc(itr.totalTermFreq());freq.setDf(reader.docFreq(new Term("contents", term)));list.add(freq);tcSum += itr.totalTermFreq();

}// Build HashMap Key:Keyword, Value:TF-IDFMap<String, Double> docVector = new HashMap<String, Double>();for (TermFreq freq : list) {

〜略〜}return docVector;

}

Calcurate Document Similarityprivate double computeAngle(Map<String, Double> vec1, Map<String, Double> vec2) {

double dotProduct = 0; // inner productfor (String term : vec1.keySet()) {

if (vec2.containsKey(term)) {dotProduct += vec1.get(term) * vec2.get(term);

}}

double denominator = getNorm(vec1) * getNorm(vec2);double ratio = dotProduct / denominator; // cosine value

return Math.acos(ratio);}

private double getNorm(Map<String, Double> vec) {double sumOfSquares = 0;for (Double val : vec.values()){

sumOfSquares += val * val;}return Math.sqrt(sumOfSquares);

}

Calcurate Document Similarity• result• 0.5000430658877127

PHP: Hypertext Preprocessor（ピー・エイチ・ピーハイパーテキストプリプロセッサー）とは、動的に HTML データを⽣成することによって、動的なウェブページを実現することを主な⽬的としたプログラミング⾔語、およびその⾔語処理系である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語として分類される。この⾔語処理系⾃体は、C⾔語で記述されている。

PHP(Hypertext Preprocessor；ピー・エイチ・ピー）とは、動的に HTML データを⽣成することによって、動的なウェブページを実現すること⽬的としたプログラミング⾔語である。PHP は、HTML 埋め込み型のサーバサイド・スクリプト⾔語の⼀種で、処理系⾃体は C⾔語で記述されている。


A Cat sat on the mat.

Cats are sitting on the mat.

Lucene 6.0•Bunch of changes..

Lucene 6.0•N-best • LUCENE-6837: Add N-best output capability to JapaneseTokenizer

N-best•Contribute from Yahoo! Japan

http://www.slideshare.net/techblogyahoo/17lucenesolr-solrjp-apache-lucene-solrnbest

N-best•Contribute from Yahoo! Japan

Nihongo Muzukashii-ne…•Need to analyze more or maintain dictionaries??


Nihongo Muzukashii-ne…•Doesnʼt hit with “⼀眼レフ”(Single-lens reflex)？

http://blog.yoslab.com/entry/2014/09/12/005207

N-best•Seems cool J• Iʼm going to try…


getting started japanese search and calculate similarity with apache lucene

Technology