xecon+phpfest2014 발표자료 - elasticsearch를 이용한 통합검색 구축방법 - 김훈민

ElasticSearch를 이용한 통합검색 구축방법

@hoonmin 김훈민

NAVER LABS

2014.11.08 Session 3-2 XECon + PHPFest 2014

누구세요?• 김훈민!• http://github.com/hoonmin!

• 네이버 랩스에서 일합니다.!• 자바 개발자. 성능 엔지니어.!• 오픈소스 캐시 솔루션인 Arcus의 커미터 입니다.!• http://github.com/naver/arcus!

• 결혼 3년차 딸바보 아빠입니다.

2

http://github.com/hoonmin

http://github.com/naver/arcus

발표할 주제는?

3

4출처: http://clien.net

http://clien.net

내가 운영하는 XE 사이트에!통합 검색을 붙이고 싶다면?

6

간단한 검색 이론(?)!!

“반지”라는 단어를 포함하고 있는!톨킨의 작품 목록을 구하려면?

8

Grep “반지” *

• 모든 작품에 대해 grep을 실행합니다.!• 해당 단어를 가진 작품을 출력합니다.!!

• 수백만 건의 문서에 대해 매번 같은 일을 한다면? Oops…

9

역 인덱스 (inverted index)

• 모든 작품 텍스트에서 단어들을 추출하여 저장!• 각 단어에 대해, 그 단어가 포함된 작품의 id를 기록

10

단어 작품 출현 빈도 작품 id

Ring 5 1, 2, 3, 4, 5

Middle-Earth 4 1, 3, 4, 5

Gollum 4 2, 3, 4, 5

Apache Lucene

http://lucene.apache.org/core/

11

http://lucene.apache.org/core/

12출처: http://horicky.blogspot.kr/2013/02/text-processing-part-2-inverted-index.html

http://horicky.blogspot.kr/2013/02/text-processing-part-2-inverted-index.html

Lucene 기반 검색엔진의 양대 산맥

13

* Solr와 ElasticSearch를 자세히 비교한 자료 http://db-engines.com/en/system/Elasticsearch%3BSolr%3BSphinx

http://db-engines.com/en/system/Elasticsearch%3BSolr%3BSphinx

Google Trend

14

Lucene

SolrElasticSearch

ElasticSearch

http://elasticsearch.org

15

http://elasticsearch.org

특징

• Lucene!• Schema-Free (JSON)!• Distributed!• Multi-tenancy!• RESTful APIs

16

용어

17

ElasticSearch DBMS

Index Database

Document Type Table

Document Row

Field Column

JSON Document{!! "_id": "1",!! "name": “Hoonmin Kim",!! "birth_year": 1981,!! "tags": ["Naver", "Arcus", "DevOps"],!! "location" : {!! ! "city": "용인시"!! }!} 18

API : Index 생성curl -XPUT ‘localhost:9200/resume' -d '!{!! “settings”: {!! ! …!! },!! “mappings”: {!! ! …!! }!}'

19

인덱스 이름

인덱스 설정!(Optional)

API : Document 추가curl -XPUT ‘localhost:9200/resume/person/1’ -d ‘!{!! “name”: “Hoonmin Kim”,!! “tags”: [“Naver”, “Arcus”]!}’!!201 (CREATED) - 신규 생성 됨!200 (OK) - 업데이트(reindex) 됨

20

Document Type

_id

API : Document 조회curl -XGET ‘localhost:9200/resume/person/1’!!!!{!! “name”: “Hoonmin Kim”,!! “tags”: [“Naver”, “Arcus”]!}

21

document type

document id

인덱스!이름

API : Document 삭제curl -XDELETE ‘localhost:9200/resume/person/1’!!!

22

API : Multi Getcurl ‘localhost:9200/resume/_mget’ -d ‘{!! “docs” : [!! ! {!! ! ! “_type”: “person”,!! ! ! “_id”: “1”!! ! },!! ! …!! ]!}

23

API : Searchcurl -XGET!! ‘localhost:9200/resume/_search?q=naver’!!{ …!! “hits”: {!! ! “total”: 1, “max_score”: 0.15342641,!! ! “hits”: [!! ! ! { … }!! ! ]!! }!} 24

API : Search Query DSLcurl -XPOST ‘localhost:9200/resume/_search’ -d ‘!{!“query”: {!! “bool”: {!! ! “must”: [!! ! ! { “match”: {“name”: “kim”} }!! ! ]!! }!}}’

25

Distributed Cluster

26

Node 0

Master

Distributed Cluster

27

Node 0!!!!

Master

index 생성Primary!Shard 0

Primary!Shard 1

Lucene Worker

Distributed Cluster

28

Node 0!!!!

Master

Primary!Shard 0

Node 1!!!!Primary!

Shard 1

Distributed Cluster

29

Node 0!!!!

Master

Primary!Shard 0

Node 1!!!!Primary!

Shard 1Replica!Shard 1

Replica!Shard 0

Distributed Cluster

30

Node 0!!!!

Master

Primary!Shard 0

Node 1!!!!Primary!


Replica!Shard 0

Distributed Cluster

31

Node 0!!!!

Master

Primary!Shard 0

Node 1!!!!Primary!


Primary!Shard 0

데모 : XE 게시판 통합 검색

32

목표

• 2개의 게시판 모듈!• 게시글 약 6만 건!• 간단한 제목 + 내용 검색!• 간단한 검색 UI!!

• XE 검색 모듈 개발 (TODO)

33

http://125.209.193.50:49088

34

http://125.209.193.50:49088

데모 시스템 구성

35

MariaDBApache

XE (PHP)

:3306:80

docker container


36

MariaDBApache

XE (PHP)

Elastic-Search

JDBC plugin

:3306:9200 :9300:80

docker container


37

MariaDBApache

XE (PHP)

Elastic-Search

JDBC plugin

:3306:9200 :9300:80

docker container

Web UI!http-server

:80

ElasticSearch 설치 및 실행

• Java 런타임 필요!• 설치!• http://www.elasticsearch.org/download!• apt-get, yum!

• 실행!• $ elasticsearch -d

38참고: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html

http://www.elasticsearch.org/download

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html

인덱스 설계

• 인덱싱 할 테이블과 컬럼을 확인!• xe_documents!• xe_modules!• xe_comments!

• 원본 데이터를 어떤 형식으로 저장할 것인지 결정!• 게시글과 덧글을 구분 해야 하나?!• 주로 어떤 쿼리를 날릴 것인가?

40

데모 인덱스 구조

41

index

document_type

document

xe_index

document

{ module : { … }, article : { … }, comment : { … } }

JDBC Plugin (1)• ElasticSearch의 플러그인 중 하나!• JDBC(Java DataBase Connectivity)!• 자바에서 DB에 접속할 수 있도록 하는 API!

• DB에서 설정된 쿼리들을 실행한 결과를 변환하여 인덱스의 문서로 저장!

• https://github.com/jprante/elasticsearch-river-jdbc

42

https://github.com/jprante/elasticsearch-river-jdbc

JDBC Plugin (2)

43출처: https://github.com/jprante/elasticsearch-river-jdbc/raw/master/src/site/resources/simple-tabular-json-data.png

https://github.com/jprante/elasticsearch-river-jdbc/raw/master/src/site/resources/simple-tabular-json-data.png

JDBC Plugin (3)

44

curl -XPUT elasticsearch:9200/_river/xe_index/_meta -d '{ "type": "jdbc", "jdbc": { "index": "xe_index", "type": "document", "url": “jdbc:mysql://<hostname or IP>:3306/maria", "user": "maria", "password": "maria", "sql": “SELECT …” } }'

JDBC Plugin (4)

45

SELECT d.document_srl as `_id`, d.document_srl as `article.document_srl`, ... m.mid as `module.mid`, ... c.comment_srl as `comments[comment_srl]`, ... FROM xe_documents as d INNER JOIN xe_modules as m on d.module_srl = m.module_srl INNER JOIN xe_comments as c on c.module_srl = m.module_srl AND c.document_srl = d.document_srl;

제목+내용 검색

46

curl -XPOST http://elasticsearch:9200/xe_index/document/_search -d ' { "size": 10, "from": 0, "query": { "bool": { "must": [ { "match": { "article.title": "XE" }}, { "match": { "article.content": "XE" }}, { "fuzzy" : { "article.content" : "XE"}} ] } } }'

http://elasticsearch:9200/xe_index/document/_search

최신순 정렬

47

curl -XPOST http://elasticsearch:9200/xe_index/document/_search -d ' { "size": 10, "from": 0, "query": { … }, "sort": [ { "article.regdate" : {"order" : "desc"}}, "_score" ] }'


정확도 순 정렬

48

curl -XPOST http://elasticsearch:9200/xe_index/document/_search -d ' { "size": 10, "from": 0, "query": { … }, "sort": [ "_score", { "article.regdate" : {"order" : "desc"}} ] }'


다루지 못한 내용들• 보안!• 쿼리 튜닝!• 다양한 쿼리와 필터를 실험!

• 형태소 분석기 적용!• 한글 형태소 분석을 통해 정확한 단어를 추출!

• …

49

감사합니다.

• 발표 관련 소스 코드는 GitHub을 확인해주세요.!• http://github.com/hoonmin/xecon2014!!

• E-Mail : [email protected]!• Line : @harebox

50

http://github.com/hoonmin/xecon2014

mailto:[email protected]

xecon+phpfest2014 발표자료 - elasticsearch를 이용한 통합검색 구축방법 - 김훈민

Software