intro to elasticsearch
Post on 16-Jan-2017
71 views
Embed Size (px)
TRANSCRIPT
Your Data, Your Search !2016-06-27
OutlineInformation retrievalIndexing & SearchingElasticsearch
Information retrievalInformation Retrieval(IR) is finding material(usually documents) of an unstructured nature(usually text) that statisfies an information need from within large collections(usually stored on computers).
Search Engineis a software system that is designed to search for information. Its a kind of implementation of IR.
What is search engine?A search engine isAn index engine for documentsA search engine on indexes A search engine is more powerful to do searches:Its designed for it !
Search Engine Architecture
Problems ??How to store the data ?How to index the data ?How to search the data ?
How to store the data ?Inverted List
How to
the data ?INDEX
the follow two filesFile1: Students should be allowed to go out with their friends, but not allowed to drink beer.
File2: My friend Jerry went to school to see his students but found them drunk which is not allowed.
Step 1: TokenzierSplit doc into wordsRemove the punctuationRemove stop word (the, a, this, that etc.)
StudentsallowedgotheirfriendsalloweddrinkbeerMyfriendJerrywentschoolseehisstudentsfoundthemdrunkallowed
Step2: Linguistic Processor LowercaseStemming, cars -> car, etc.Lemmatizatio, drove -> drive, etc.
studentallowgotheirfriendallowdrinkbeermyfriendjerrygoschoolseehisstudentfindthemdrinkallow
Term
Step3: IndexTermDocument IDstudent 1allow 1go 1their 1friend 1allow 1
DictSortPosting list
How to
the data ?SEARCH
Step1: User search querySuppose you have the follow query
lucene AND learned NOT hadoop
Step2: Lexical & Syntax AnalysisIdentify words and keywordsWords: lucene, learned, hadoopKeywords: AND, NOTBuilding a syntax tree
lucene
learned
hadoop
AND
Not
Step3: SearchSearch in the Inverted ListSort, Conjunction, DisconjunctionScorer
full text searchRESTful APIreal time,Search andanalytics engineopen sourcehigh availabilityschema freeJSON over HTTPLucene baseddistributedRESTful APIElasticSearch
Elastic SearchDistributed and Highly Available Search Engine.Each index is fully sharded with a configurable number of shards.Each shard can have one or more replicas.Read / Search operations performed on either one of the replica shard.Multi Tenant with Multi Types.Support for more than one index.Support for more than one type per index.Index level configuration (number of shards, index storage, ...).Document orientedNo need for upfront schema definition.Schema can be defined per type for customization of the indexing process.Various set of APIsHTTP RESTful APINative Java API.All APIs perform automatic node operation rerouting. (Near) Real Time Search.Reliable, Asynchronous Write Behind for long term persistency.Built on top of LuceneEach shard is a fully functional Lucene indexAll the power of Lucene easily exposed through simple configuration / plugins.Per operation consistencySingle document level operations are atomic, consistent, isolated and durable.Open Source under the Apache License, version 2 ("ALv2")
Terminologies of Elastic SearchClusterNodeIndexShard
Cluster
A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes
A cluster is identified by a unique name which by default is "elasticsearch"Terminologies of Elastic Search
Node
It is an elasticsearch instance (a java process)
A node is created when a elasticsearch instance is started
A random Marvel Charater name is allocated by defaultTerminologies of Elastic Search
Index
An index is a collection of documents that have somewhat similar characteristics. eg:customer data, product catalog
Very crucial while performing indexing, search, update, and delete operations against the documents in it
One can define as many indexes in one single clusterTerminologies of Elastic Search
Document
It is the most basic unit of information which can be indexed
It is expressed in json (key:value) pair. {user:nullcon}
Every Document gets associated with a type and a unique id.Terminologies of Elastic Search
Shard
Every index can be split into multiple shards to be able to distribute data.The shard is the atomic part of an index, which can be distributed over the cluster if you add more nodes.Terminologies of Elastic Search
A terminology comparison
Relational databaseElasticsearchDatabaseIndexTableTypeRowDocumentColumnFieldSchemaMappingIndexEverything is indexedSQLQuery DSLSELECT * FROm tb GET http://UPDATE tb SET PUT http://
Playing with Elasticsearch
REST API: http://host:port/[index]/[type]/[_action/id]HTTP Methods: GET, POST,PUT,DELETE
Playing with ElasticsearchSearchcurl XGET http://localhost:9200/my_index/test/_searchcurl XGET http://localhost:9200/my_index/_searchcurl XPUT http://localhost:9200/_searchMeta Datacurl XPUT http://localhost:9200/my_index/_statusDocuments:curl XPUT http://localhost:9200/my_index/test/1curl XGET http://localhost:9200/my_index/test/1curl XDELETE http://localhost:9200/my_index/test/1
Example: IndexCurl XPUT http://localhost:9200/my_index/test/1 -d { "name": "joeywen", "value": 100}
Example: SearchCurl XGET http://localhost:9200/my_index/_search d { query: { match_all: {} }}
Total number of docsRelevanceSearch timeMax score
Creating, indexing, or deleting a single document
Plugins-Kopf
Plugins-head
Web
Q&A