text processing team 3 김승곤 박지웅 엄태건 최지헌 임유빈. outline 1.introduction...

Text processingTeam 3

김승곤 박지웅 엄태건 최지헌 임유빈

Outline

1. Introduction

2. Text processing

3. Index techniques in database

4. Index techniques in wireless network

5. Apache Lucene

6. Apache Solr

7. Demo

What’s Text Processing?● Mechanism for the manipulation of text

o Language processing

o Data structure

o Visualization

o Human factor

● Converting text to indexing term

Necessity of Index

데이터베이스의 성능 측정 기준

1) 테이블에 접근하는 SQL 의 수

2) 업무 카테고리

3) 중복된 Access Pattern 을 제거한 실제 Access pattern (RAP, Real Access Pattern)

4) 테이블 유형에 대한 분류

5) SQL 의 성능정보

6) 테이블당 인덱스 수

인덱스 활용 예시

Text processing steps

1. Lexical Analysis

2. Elimination of stopwords

3. Stemming

4. Selection of index terms

5. Building a thesaurus

Lexical Analysis● Converting byte stream to tokens

o Numbers and digits

o Hyphens Index as phrase, allow partial match Proximity information

o Punctuation

o Lexer By hand - Easy, fast but not flexible DFA generator (Deterministic Finite Automata) - Use state machine

“Deaths from car accidents in 1989”{Deaths, car, accidents, from, 1989}

“Work-out”

“BS”, “B.S.”, “M.S.”, URLs

Elimination of Stopwords● High frequency, but not useless

o For example,

o Removing stopwords

o Statistical approach & Lookup table

the, of, and, a, in, to, is, for, with, are

“to be or not to be” -> {be}

Stemming● Reduce variant word forms to a single “stem”

o Words

o Four approaches Table lookup - use a dictionary

Successor variety - fancy suffix removal

Affix removal - cut prefixes and suffixes

Character N-grams

-’s, -ing, -ed, -s, in-, ad-, pre-, sub-

Porter’s algorithm● Removes suffixes in five stages

o Each depends on a suffix and the stem measure m

o Porter Errors organization/organ doing/doe past/paste european/europe resolve/resolution

Rule Result

SSES -> SS caresses -> caress

IES -> I ponies -> poni, ties -> ti

SS -> SS caress -> caress

S ->∅ cats -> cat

EED->EE feed -> feed

(*v*) ED plastered -> plaster

(*v*) ING motoring -> motor

Indexing Implementation in Text Database

Index Data Structure

● Different types of index data structures, for querying large text collectiono Signature Fileo Inverted File

Index Data Structure: Signature File

● F bits of signature

● Make term descriptors with m bits = 1, rest = 0

● Superimpose term descriptors of document to obtain document descriptor

Index Data Structure: Signature File

● Probabilistic indexing method

● Queryingo Form query descriptor for termo Fetch superset of query descriptor by comparisono Possibility of wrong result (false drop)

Index Data Structure: Inverted File

● Store mapping from content to its location

● Structureo a directory of terms

o posting lists of document IDs

Query: not

String comparison slow! Solution: Inverted index

Query: not Inverted index

Query: be Inverted index

Query: thing Inverted index

Drawbacks of Signature File

● Elimination of false match

● Require more disk access

● Difficult to construct and maintain

● Larger than inverted file

Drawbacks of Inverted File

● Performance challenges caused byo huge amount of documents

o increasingly large number of users

● Space cost of the associated inverted list even ranges gigabytes to terabytes

● Researches to improve performance of indexo more efficient index structure for low space and fast

query processing

Compression of Inverted File

● Time cost of indexo Seek and retrieve inverted list from disk into memory

o Transfer lists from memory into CPU cache

● Increase number of lists that can be cached

● Reduce number of disk accesses

Compression of Inverted File:d-gap

4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21

4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21

4 5 11 8 17 14 22 21 28 27 21 32 27 21

Compression of Inverted File:Simple-9

● Combination of bit alignment and word alignment

● Pack as many integers as possible into one 32-bit word

● Compression format

selector(4-bit)

data bit (28-bit)

● Compression method

● example

● all values are less than ‘8’ → 3 bits

● Selector c

● example

● all values are less than ‘32’ → 5 bits

● Selector e

Compression of Inverted File: Bitlist

● Simple and very efficient encoding schemeo Use encoded number to represent a set of

document IDso Only use 0/1 to indicate whether a document

contains a specific term

o Low space requirement

● naive inverted list

● 0/1 matrix

Bitlist structure

● base number = 4

Bitlist structure

● base number = 4

Bitlist structure

● bitlist, base = 4 ● bitlist, base = 12

Bitlist: DocID reassignment

● before

● after

Reference

[1] Rahevar, Mrugendrasinh L., and Mehul C. Parikh. "Optimized index construction for large text collections using blocked sort-based indexing." Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on. IEEE, 2014.[2] Rao, Weixiong, et al. "Bitlist: New full-text index for low space cost and efficient keyword search." Proceedings of the VLDB Endowment 6.13 (2013): 1522-1533.[3] Zhang, Jiangong, Xiaohui Long, and Torsten Suel. "Performance of compressed inverted list caching in search engines." Proceedings of the 17th international conference on World Wide Web. ACM, 2008.[4] Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38.2 (2006): 6.[5] William, B. Frakes, and Ricardo Baeza-Yates. "Information retrieval: data structures and algorithms." ISBN-10 134638379 (1992).

Indexing Techniques For Full-Text Search

In wireless broadcast environment

Wireless mobile computing

● Broadcasting o Effective technique to disseminate

information to massive number of clients through public broadcast channels

o Why? bandwidth efficiency energy efficiency scalability

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Full-text search in Wireless mobile computing

● Full-text search is used in various information systems

● Previous works have been developed for disk storage, not “wireless channels”

● In disk-based storage, documents are stored in physical space, so clients can “jump’ among different storage slots

● In on-air storage, documents are stored “sequentially” along the time line

public channel

Contents provider

broadcast* : bucket articles (breaking news, weather reports … )

…….

I want to find articles containing “Database System”

Full Scan!

Problem?

● Energy consumption o Mobile device has limited battery power

need to reduce energy consumption!

● Active mode <-> doze mode o Active mode: computes operation o Doze mode: do nothing

Metrics

● Traditional o Number of disk accesses

● In wireless network o latency ( access time )

duration from the time of query submission to the time when the download of the target information is complete

o energy ( tuning time ) duration which the mobile device remains

in active mode. Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Metrics

● Broadcast index buckets and data buckets

● Data access protocol of client

o [Initial Probe]: Receive the current bucket broadcasted on the air, and check if the current bucket is the first bucket of the index

o [IndexWait] : If the current bucket is not the first bucket of the index, wait until the first bucket of the next index arrives on the air

o [DataWait]: Find the target data addresses by using the index, and wait until the target data bucket arrives on the air.

Basic scheme

Naive: Inverted list method

Naive: Inverted list method● Problem

o Large IndexWait time AccessTime increases

● Solutiono replication/ distribution

Improved: Inverted list + Index tree method

Improved: Inverted list + Index tree method● Distribution

Evaluation

Reference● Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and

Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Hash Indexing Scheme

● Tree-based indexing vs hash based indexing ○ hash-based is more flexible and space efficient for

full-text search in wireless data broadcast

Yang, Kai, et al. "A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast." Database Systems for

Advanced Applications. Springer Berlin Heidelberg, 2011.

Basic-Hash Indexing Scheme

Reference

Yang, Kai, et al. "A novel hash-based streaming scheme for

energy efficient full-text search in wireless data broadcast."

Database Systems for Advanced Applications. Springer

Berlin Heidelberg, 2011.

text processing team 3 김승곤 박지웅 엄태건 최지헌 임유빈. outline 1.introduction...

Documents

text pre-processing - m. ali fauzi | ptiik universitas...

pengantar text processing - frzal.files. · pdf...

index ∞ image processing ∞ opencv ∞ download & setup...

index ∞ image processing ∞ opencv ∞ download & setup...

alarmas, atm processing, ebt, credit card processing

zastosowanie wybranych metod eksploracji … · keywords:...

€¦ · web viewtext abstraktu text abstraktu text...

basi di dati multimediali - giugno 2005 marco ernandes:...

rigutini leonardo – dipartimento di ingegneria...

problems of 3d scanning and scanned data processing · text...

corpus and text processing for language teaching : on-line

grafiken programmieren mit processing · 2 processing mit...

50 tph gold processing processing plant

2. regular expressions and automata 2007 년 3 월 31 일...

digital geometry processing digital geometry processing

part 4. text processing and saliency

data-intensive text processing with mapreduce ch4

20130608-speech recognition and its applications to computer...

text processing question answering

text processing bahasa indonesia -...