elasticsearch po slovensky

Download Elasticsearch po slovensky

Post on 17-Jul-2015

200 views

Category:

Technology

9 download

Embed Size (px)

TRANSCRIPT

  • {po slovensky}.

    Igor Rjabinin {lab.SNG}

    elastic{search}.

  • Elasticsearch: The Definitive Guide

    Elasticsearch is a real-time distributed search and analytics engine. It allows you to explore your data at a speed

    and at a scale never before possible. It is used for full-text search,structured search, analytics, and all three in combination.

  • WIKIPEDIA fulltext vyhadvanie search-as-you-type & did-you-mean zvraznenie hadanho vrazu vo vsledkoch

    STACKOVERFLOW fulltext vyhadvanie more-like-this

    GITHUB vyhadvanie v ~130*109 riadkoch kdu

  • DISTRIBUTED horizontlne klovateln

    APACHE LUCENE Information retrieval software library umouje vysoko vkonn pokroil vyhadvanie

    RESTful API komunikcia pouitm JSON cez HTTP curl -X /get,put,post,delete/

  • SQL vs elasticsearch

    SQL je relan databza elastic je vyhadvac engine

    filtrovanie na binrnej rovni

    fulltext vyhadvanie + filtrovanie na binrnej rovni

  • SQL vs elasticsearch

    database table row

    index type document

  • INTALCIA

    > brew install elasticsearch () > cd /usr/local/Cellar/elasticsearch/1.3.4 > bin/plugin -i elasticsearch/marvel/latest () > launchctl load ~/Library/LaunchAgents/ homebrew.mxcl.elasticsearch.plist

  • TEST

    > curl -X GET localhost:9200

    { "status" : 200, "name" : "Jim Hammond", "version" : { "number" : "1.3.4", "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45", "build_timestamp" : "2014-09-30T09:07:17Z", "build_snapshot" : false, "lucene_version" : "4.9" }, "tagline" : "You Know, for Search" }

  • KOMUNIKCIA

    HTTP metdy: GET, POST, PUT, DELETE

    formt: {METDA} /{index}/{type}/{id}

    JSON

  • PUT /nervosa/podujatia/1 { "title": "WebElement", "text": "WebElement je pravideln stretnutie ud zaujmajcich sa o weby a technolgie s webmi spojen." }

    INSERT / UPDATE

    { "_index": "nervosa", "_type": "podujatia", "_id": "1", "_version": 1, "created": true }

    { "_index": "nervosa", "_type": "podujatia", "_id": "1", "_version": 2, "created": false }

  • GET /nervosa/podujatia/_search?q=weby

    BASIC SEARCH

    { "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.076713204, "hits": [ { "_index": "nervosa", "_type": "podujatia", "_id": "1", "_score": 0.076713204, "_source": { "title": "WebElement", "text": "WebElement je pravideln stretnutie ud zaujmajcich sa o weby a technolgie s webmi spojen." } } ] }

  • GET /nervosa/podujatia/_search?q=weby

    BASIC SEARCH

    { "took": 5, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.076713204, "hits": [ { "_index": "nervosa", "_type": "podujatia", "_id": "1", "_score": 0.076713204, "_source": { "title": "WebElement", "text": "WebElement je pravideln stretnutie ud zaujmajcich sa o weby a technolgie s webmi spojen." } } ] }

    as v ms

    poet njdench dokumentov

    skre/relevancia

  • GET /nervosa/podujatia/_search?q={QUERY}

    LUCENE QUERY PARSER

    Terms apple

    Phrases "apple iphone"

    Proximity "apple safari"~5

    Fuzzy apple~0.8

    Wildcards app* *pp*

    Boosting apple^10 safari

    Range [2011/05/01 TO 2011/05/31] [java TO json]

    Boolean apple AND NOT iphone +apple -iphone (apple OR iphone) AND NOT review

    Fieldstitle:iphone^15 OR body:iphone published_on:[2011/05/01 TO

    "2011/05/27 10:00:00"]

  • GET /nervosa/podujatia/_search { "query": { "match": { "title": "webelement" } } }

    SEARCH using QUERY DSL

  • GET /nervosa/podujatia/_search { "query": { "match": { "title": "webelement" } } }

    SEARCH using QUERY DSL

    GET /nervosa/podujatia/_search { "query": { "match": { "title": "zdruzenie" } } }

  • GET /nervosa/podujatia/_search { "query": { "match": { "title": "webelement" } } }

    SEARCH using QUERY DSL

    GET /nervosa/podujatia/_search { "query": { "match": { "title": "zdruzenie" } } }

    chba diakritika

  • VYHADVANIE V ELASTIC

    kad pole {field} je vyhadaten dva druhy vyhadvania:

    structured search - filtervetky podujatia v nervose za rok 2014(vsledkom score je vdy 1)

    fulltext search - had text podujatia v nervose, kde sa rozoberal composer vyrta score poda relevancie*

  • INVERTED INDEX The quick brown fox jumped over the lazy dog Quick brown foxes leap over lazy dogs in summerTerm Doc_1 Doc_2 ------------------------- Quick | | X The | X | brown | X | X dog | X | dogs | | X fox | X | foxes | | X in | | X jumped | X | lazy | X | X leap | | X over | X | X quick | X | summer | | X the | X | --------------------------

    ?q=quick brown

    Term Doc_1 Doc_2 ------------------------- brown | X | X quick | X | ------------------------ Total | 2 | 1

  • VYHADVANIE V ELASTIC

    zska pole {field} z dokumentu zvol jeho analyser rozparsuje text do tokenov aplikuje token filtre ulo do indexu

  • ANALZACharacter filters Tokenizer Token filters

  • ANALZACharacter filters Tokenizer Token filters uprata text ete pred tokenizciou napr. vyhodi

    alebo transformova & and

  • ANALZACharacter filters Tokenizer Token filters rozsek text na vrazy, ktor sa bud indexova standart tokenizer rozdel text na vrazy

    poda word boundaries

    "Set the shape to semi-transparent by calling set_trans(5)

    set, the, shape, to, semi, transparent, by, calling, set_trans, 5

  • ANALZACharacter filters Tokenizer Token filters sa aplikuje na kad token a me ho

    zmeni (lowercase, asciifolding) zmaza (napr. stopwords a, aj, e) prida aie tokeny (napr. synonym)

    #language analyser english "Set the shape to semi-transparent by calling set_trans(5)

    set, shape, semi, transpar, call, set_tran, 5

  • KEDY SA ANALYZUJE

    pri indexovan (na indexovan text) pri full-text vyhadvan (na hadan vraz) vinou sa aplikuje ten ist analyzer (mal by sa) pri filtrovan sa vyhadva presn vraz a analyzer

    sa neaplikuje polia, ktor bud pouvan na filtrovanie mu ma nastaven "index": not_analyzed" (napr. tagy)

  • MAPPING

    definovanie schmy umouje presne uri, ako sa m dan pole

    sprva zadefinova type (string/integer/double/

    boolean/date) uri, i sa m/nem pole analyzova ak analyzer poui (pri indexovan/vyhadvan)

  • MAPPING# pred zmenou v mappingu treba index zavriet alebo zmazat DELETE /nervosa

    PUT /nervosa { "mappings": { "podujatia" : { "properties" : { "title" : { "type" : "string" }, "datum" : { "type" : "date" }, "text" : { "type" : "string", "analyzer": "english" }, "tags" : { "type" : "string", "index": "not_analyzed" } } } }

  • A O T SLOVENINA?

  • A O T SLOVENINA?ElasticSearch nem slovensk language analyzer

  • A O T SLOVENINA?ElasticSearch nem slovensk language analyzer

    Vytvorme si ho sami

  • SCENR

    aby vyhadvanie fungovalo s diakritikou aj bez zmeni velkos slov na mal vyhodi spojky/predloky vyhada aj slov v nie zkladnom tvare

    (asovan / skloovan ) njs aj synonym

  • SCENR

    aby vyhadvanie fungovalo s diakritikou aj bez zmeni velkos slov na mal vyhodi spojky/predloky vyhada aj slov v nie zkladnom tvare

    (asovan / skloovan ) njs aj synonym

    asciifolding

    lowercasestopwords filter

    stemmer lematizrsynonym filter

  • LEMATIZR

    poui hunspell slovnk http://www.zdrojak.cz/clanky/elasticsearch-vyhledavame-cesky/ https://github.com/essential-data/elasticsearch-sk free

    LemmaGen https://github.com/vhyza/elasticsearch-analysis-lemmagen presnejie, ale len na nekomern projekty

    bin/plugin --url http://bit.ly/analysis-lemmagen --install elasticsearch-analysis-lemmagen

  • DEFINCIA ANALYZERUPUT /nervosa { "settings": { "analysis": { "filter": { "lemmagen_filter_sk": {

    "type": "lemmagen", "lexicon": "sk" }, "synonym_filter": { "type": "synonym", "synonyms_path": "synonyms/sk_SK.txt", "ignore_case": true }, "stopwords_SK": { "type": "stop", "stopwords_path": "stop-words/stop-words-slovak.txt", "ignore_case": true } },

  • DEFINCIA ANALYZERU() "analyzer": { "slovencina_synonym": { "type": "custom", "tokenizer": "standard", "filter": [ "stopwords_SK", "lemmagen_filter_sk", "lowercase", "stopwords_SK", "synonym_filter", "asciifolding" ] }, "slovencina": { "type": "custom", "tokenizer": "standard",