text processing team 3 김승곤 박지웅 엄태건 최지헌 임유빈. outline 1.introduction...

62
Text processing Team 3 김김김 김김김 김김김 김김김 김김김

Upload: jodie-parks

Post on 31-Dec-2015

241 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Text processingTeam 3

김승곤 박지웅 엄태건 최지헌 임유빈

Page 2: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Outline

1. Introduction

2. Text processing

3. Index techniques in database

4. Index techniques in wireless network

5. Apache Lucene

6. Apache Solr

7. Demo

5/18

5/20

Page 3: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

What’s Text Processing?● Mechanism for the manipulation of text

o Language processing

o Data structure

o Visualization

o Human factor

● Converting text to indexing term

Page 4: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Necessity of Index

데이터베이스의 성능 측정 기준

1) 테이블에 접근하는 SQL 의 수

2) 업무 카테고리

3) 중복된 Access Pattern 을 제거한 실제 Access pattern (RAP, Real Access Pattern)

4) 테이블 유형에 대한 분류

5) SQL 의 성능정보

6) 테이블당 인덱스 수

인덱스 활용 예시

Page 5: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Text processing steps

1. Lexical Analysis

2. Elimination of stopwords

3. Stemming

4. Selection of index terms

5. Building a thesaurus

Page 6: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Lexical Analysis● Converting byte stream to tokens

o Numbers and digits

o Hyphens Index as phrase, allow partial match Proximity information

o Punctuation

o Lexer By hand - Easy, fast but not flexible DFA generator (Deterministic Finite Automata) - Use state machine

“Deaths from car accidents in 1989”{Deaths, car, accidents, from, 1989}

“Work-out”

“BS”, “B.S.”, “M.S.”, URLs

Page 7: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Elimination of Stopwords● High frequency, but not useless

o For example,

o Removing stopwords

o Statistical approach & Lookup table

the, of, and, a, in, to, is, for, with, are

“to be or not to be” -> {be}

Page 8: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Stemming● Reduce variant word forms to a single “stem”

o Words

o Four approaches Table lookup - use a dictionary

Successor variety - fancy suffix removal

Affix removal - cut prefixes and suffixes

Character N-grams

-’s, -ing, -ed, -s, in-, ad-, pre-, sub-

Page 9: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Porter’s algorithm● Removes suffixes in five stages

o Each depends on a suffix and the stem measure m

o Porter Errors organization/organ doing/doe past/paste european/europe resolve/resolution

Rule Result

SSES -> SS caresses -> caress

IES -> I ponies -> poni, ties -> ti

SS -> SS caress -> caress

S ->∅ cats -> cat

EED->EE feed -> feed

(*v*) ED plastered -> plaster

(*v*) ING motoring -> motor

Page 10: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Indexing Implementation in Text Database

Page 11: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure

● Different types of index data structures, for querying large text collectiono Signature Fileo Inverted File

Page 12: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Signature File

● F bits of signature

● Make term descriptors with m bits = 1, rest = 0

● Superimpose term descriptors of document to obtain document descriptor

Page 13: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Signature File

● Probabilistic indexing method

● Queryingo Form query descriptor for termo Fetch superset of query descriptor by comparisono Possibility of wrong result (false drop)

Page 14: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Page 15: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Page 16: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Page 17: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

● Store mapping from content to its location

● Structureo a directory of terms

o posting lists of document IDs

Page 18: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Page 19: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: not

Page 20: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: not

Page 21: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: not

String comparison slow! Solution: Inverted index

Page 22: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: not Inverted index

0

1

Page 23: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Query: not Inverted index

Index Data Structure: Inverted File

0

1

Page 24: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: not Inverted index

0

1

Page 25: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: be Inverted index

0

1

Page 26: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Index Data Structure: Inverted File

Query: thing Inverted index

0

1

Page 27: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Drawbacks of Signature File

● Elimination of false match

● Require more disk access

● Difficult to construct and maintain

● Larger than inverted file

Page 28: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Drawbacks of Inverted File

● Performance challenges caused byo huge amount of documents

o increasingly large number of users

● Space cost of the associated inverted list even ranges gigabytes to terabytes

● Researches to improve performance of indexo more efficient index structure for low space and fast

query processing

Page 29: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File

● Time cost of indexo Seek and retrieve inverted list from disk into memory

o Transfer lists from memory into CPU cache

● Increase number of lists that can be cached

● Reduce number of disk accesses

Page 30: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File:d-gap

4 9 20 28 45 59 81 102 130 157 178 210 237 258

Page 31: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File:d-gap

4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21

Page 32: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File:d-gap

4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21

4 5 11 8 17 14 22 21 28 27 21 32 27 21

Page 33: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File:Simple-9

● Combination of bit alignment and word alignment

● Pack as many integers as possible into one 32-bit word

● Compression format

selector(4-bit)

data bit (28-bit)

Page 34: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File:Simple-9

● Compression method

Page 35: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● example

● all values are less than ‘8’ → 3 bits

● Selector c

Compression of Inverted File:Simple-9

Page 36: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● example

● all values are less than ‘32’ → 5 bits

● Selector e

Compression of Inverted File:Simple-9

Page 37: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Compression of Inverted File: Bitlist

● Simple and very efficient encoding schemeo Use encoded number to represent a set of

document IDso Only use 0/1 to indicate whether a document

contains a specific term

o Low space requirement

Page 38: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● naive inverted list

● 0/1 matrix

Bitlist structure

Page 39: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Bitlist structure

Page 40: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● base number = 4

Bitlist structure

Page 41: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● base number = 4

Bitlist structure

Page 42: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Bitlist structure

● bitlist, base = 4 ● bitlist, base = 12

Page 43: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Bitlist: DocID reassignment

● before

● after

Page 44: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Reference

[1] Rahevar, Mrugendrasinh L., and Mehul C. Parikh. "Optimized index construction for large text collections using blocked sort-based indexing." Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on. IEEE, 2014.[2] Rao, Weixiong, et al. "Bitlist: New full-text index for low space cost and efficient keyword search." Proceedings of the VLDB Endowment 6.13 (2013): 1522-1533.[3] Zhang, Jiangong, Xiaohui Long, and Torsten Suel. "Performance of compressed inverted list caching in search engines." Proceedings of the 17th international conference on World Wide Web. ACM, 2008.[4] Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38.2 (2006): 6.[5] William, B. Frakes, and Ricardo Baeza-Yates. "Information retrieval: data structures and algorithms." ISBN-10 134638379 (1992).

Page 45: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Indexing Techniques For Full-Text Search

In wireless broadcast environment

Page 46: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Wireless mobile computing

● Broadcasting o Effective technique to disseminate

information to massive number of clients through public broadcast channels

o Why? bandwidth efficiency energy efficiency scalability

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 47: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Full-text search in Wireless mobile computing

● Full-text search is used in various information systems

● Previous works have been developed for disk storage, not “wireless channels”

● In disk-based storage, documents are stored in physical space, so clients can “jump’ among different storage slots

● In on-air storage, documents are stored “sequentially” along the time line

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 48: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

public channel

Contents provider

Full-text search in Wireless mobile computing

broadcast* : bucket articles (breaking news, weather reports … )

…….

I want to find articles containing “Database System”

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 49: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Full-text search in Wireless mobile computing

Full Scan!

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 50: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Problem?

● Energy consumption o Mobile device has limited battery power

need to reduce energy consumption!

● Active mode <-> doze mode o Active mode: computes operation o Doze mode: do nothing

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 51: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Metrics

● Traditional o Number of disk accesses

● In wireless network o latency ( access time )

duration from the time of query submission to the time when the download of the target information is complete

o energy ( tuning time ) duration which the mobile device remains

in active mode. Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 52: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Metrics

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 53: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

● Broadcast index buckets and data buckets

● Data access protocol of client

o [Initial Probe]: Receive the current bucket broadcasted on the air, and check if the current bucket is the first bucket of the index

o [IndexWait] : If the current bucket is not the first bucket of the index, wait until the first bucket of the next index arrives on the air

o [DataWait]: Find the target data addresses by using the index, and wait until the target data bucket arrives on the air.

Basic scheme

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 54: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Naive: Inverted list method

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 55: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Naive: Inverted list method● Problem

o Large IndexWait time AccessTime increases

● Solutiono replication/ distribution

Page 56: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Improved: Inverted list + Index tree method

Page 57: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Improved: Inverted list + Index tree method● Distribution

Page 58: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Evaluation

Page 59: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Reference● Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and

Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Page 60: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Hash Indexing Scheme

● Tree-based indexing vs hash based indexing ○ hash-based is more flexible and space efficient for

full-text search in wireless data broadcast

Yang, Kai, et al. "A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast." Database Systems for

Advanced Applications. Springer Berlin Heidelberg, 2011.

Page 61: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Basic-Hash Indexing Scheme

Page 62: Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Reference

Yang, Kai, et al. "A novel hash-based streaming scheme for

energy efficient full-text search in wireless data broadcast."

Database Systems for Advanced Applications. Springer

Berlin Heidelberg, 2011.