コンテンツ文書をより探しやすくするための …...“litware insurance corp.”...

AI01

コンテンツ/文書をより探しやすくするためのSearch x AI - Cognitive Search -

コンテンツ/文書をより探しやすくするための Search x AI - Cognitive Search -

AI01

https://www.facebook.com/dahatake/

https://twitter.com/dahatake/

https://github.com/dahatake/

https://daiyuhatakeyama.wordpress.com/

https://www.slideshare.net/dahatake/

#decode18 #AI01

• 得られる情報• Why?

• What?

• How? 開発

• 200

• Azure Cognitive Services

• Azure Search

今日の内容

「探す」を考え直す

探す人の事を考えてますか?

勉強したい?

オフライン?

オンライン

導入したい?

障害?

PC上?

クラウド上?機械学習

回帰

SVM

Decision Tree

Deep Learning

CNN

RNN

強化学習 Q Learning

勉強したい?

オフライン?

オンライン

導入したい?

障害?

PC上?

クラウド上?機械学習

回帰

SVM

Decision Tree

Deep Learning

CNN

RNN


Relevancy – 関連性User | 入力値と Data | 情報

を紐づける

機械学習

回帰

SVM

Decision Tree

Deep Learning

CNN

RNN


勉強したい?

オフライン?

オンライン

導入したい?

障害?

PC上?

クラウド上?

Relevancy – 関連性User | 入力値と Data | 情報

を紐づける

companiesGoogle

Microsoft

Facebook

record-id companies

1 [“v02”,“v01”,“v05”

]

自然言語処理 (NLP) は大変…

Deep Learning は、2012年にGoogleの研究者によって飛躍的に技術向上しました。その後MicrosoftやFacebookなどの企業も本格参入。2015年から写真の識別など幾つかの分野でMicrosoftが人と同程度のModelを開発しました。

デジタル化

SpeechLanguageVision

メタデータの自動抽出への道が開けた!

Deep Learning がもたらしたもの

Text (OCR)

“(1) Validate enrichment pipeline”


Tags“throwing”, “ball”, “girl”, “grass”, “basketball”

Caption“A girl throwing a ball”

EntitiesPersons“Anita Christiansen”,“Conrad Nuber”,

Locations“Bothell”, “Woodinville”

Organization“Litware Insurance Corp.”


John F. Kennedy (JFK)November 22, 1963

Web App(azsearch.js)

BlobStorage

Azure Function

Skills:Computer VisionOCR + HandwritingEntity LinkingCIA Cryptonyms Azure Search

Cosmos DBAzure Machine Learning

CognitiveSkill Set

Skill: Topics

本質は、情報検索(Search)

JFK FILESCOGNITIVE SEARCHARCHITECTURE

Cognitive Search とは?

PaaS -最低限の管理全文検索エンジンファセットサジェスチョン / オートコンプリート地理情報多言語 (日本語含む)スコアリングのカスタマイズシノニム検索ログ / クリックログetc…

https://microsoft.sharepoint.com/teams/BrandCentral/Search/Pages/BCDeviceRendersResults.aspx?k#Default=%7B%22k%22%3A%22%22%2C%22r%22%3A%5B%7B%22n%22%3A%22RefinableString00%22%2C%22t%22%3A%5B%22%5C%22%C7%82%C7%8253757266616365%5C%22%22%5D%2C%22o%22%3A%22and%22%2C%22k%22%3Afalse%2C%22m%22%3Anull%7D%5D%2C%22l%22%3A1033%7D

なぜ、Search?

AIで全部やればいいんじゃないの?

Indexer

多様なファイルフォーマットへの対応

Azure Search Built-in モジュール一覧

https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-

azure-search#property-reference

Analyzer

• <lang>.microsoft (50言語)

• <lang>.lucene (35言語)

• keyword• pattern• simple• standard• standardasciifolding.lucene• stop• whitespace

CharFilter

• html_strip• mapping• pattern_replace

Tokenizer

• classic• edgeNGram• keyword_v2• letter• lowercase• microsoft_language_tokenizer

(43言語)

• microsoft_language_stemming_tokenizer (＊)

• nGram• path_hierarchy_v2• pattern• stnadard_v2• uax_url_email• whitespace

TokenFilter

arabic_normalizationapostropheasciifoldingcjk_bigramcjk_widthclassiccommon_gramsdictionary_decompounderedgeNGram_v2elisionkeepkeyword_markerkeyword_repeatkstemlengthlimitlowercase

nGram_v2pattern_capturepattern_replacephoneticporter_stemreverseshinglesnowballstemmer (＊)

stemmer_overridestopwords (＊)

synonymtrimtruncateuniqueuppercaseword_delimiter

(＊) - 複数言語対応。ただし日本語み対応

2017年５月対応状況Language Analyer

自然言語処理組み込み済み!

https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search#property-reference

facetable:true

ファセット・ナビゲーション

{ "name":"color", "type":"Edm.String", "searchable": false, "filterable":true, "sortable":true, "facetable":true },{ "name":"size", "type":" Edm.Int32", "searchable": false, "filterable":true, "sortable":true, "facetable":true },{ "name":"price", "type":" Edm.Int32", "searchable": false, "filterable":true, "sortable":true, "facetable":true },

/indexes/myindex/docs

facetfacetfacet

search

"@search.facets": {"color": [

{ "count": 4, "value": "Red“ },{ "count": 3, "value": "Black“ },{ "count": 3, "value": "Yellow“ }

],"[email protected]":

"size": [{"count": 2, "value": 62 },{"count": 2, "value": 60 },..

],},

メタデータ | 構造データの表現

Step. 1初めての Cognitive Search

Search の処理フロー

Search

Engine

Analyzer

IndexWriterIndex

QueryParserSimple lucene

Analyzed terms

Query terms

Query tree

Query text

Documents terms

Analyzed terms

RetrieveIngest

Analyzer

ドキュメント

• クエリーを元にトークン引き当て

• ランキング処理

クエリ文を解析し内部クエリ―形式に変換

テキスト解析を行いトークンの展開、変換、削除などを行う

転置インデックス

検索処理

Content Extraction

ファイルやファイルメタから、

テキストを抽出

インデックス生成処理

Cognitive Search

Search

Engine

Analyzer

IndexWriterIndex

QueryParserSimple lucene

Analyzed terms

Query terms

Query tree

Query text

Documents terms

Analyzed terms

RetrieveIngest

Analyzer

• クエリーを元にトークン引き当て

• ランキング処理

クエリ文を解析し内部クエリ―形式に変換

テキスト解析を行いトークンの展開、変換、削除などを行う

転置インデックス

検索処理

ドキュメント Content Extraction

ファイルやファイルメタから、

テキストを抽出

インデックス生成処理

“ENRICH”skills Annotation

Web App(azsearch.js)

BlobStorage

Azure Function

Skills:Computer VisionOCR + HandwritingEntity LinkingCIA Cryptonyms Azure Search

Cosmos DBAzure Machine Learning

CognitiveSkill Set

Skill: Topics


Explore

AzureStorage

Azure Functions-Cryptonyms-Redactions

Cognitive Skills -OCR + Handwriting-Computer Vision-Entities

Azure ML

Search Index

Azure Search

ContentExtraction


Spec

• フルマネージ - PaaS• Indexer Add-in

• Pull のみ

• Pre-Build skill• Azure Cognitive Services + α

• Region• South Central US か West Europe

• API Version• api-version=2017-11-11-Preview

• 拡張性• 任意の REST API の呼びだし

• 現状追加費用なし!

Step. 2Indexer パイプラインの開発

シンプルなAPIとフォーマット

Index 追加・更新 /indexes/<indexname> PUT

Index 一覧表示 /indexes GET

Index 統計情報取

得/indexes/<indexname>/stats GET

Index の削除 /indexes/<indexname>DELET

E

Document 追加

・削除/indexes/<indexname>/docs/index POST

Search /indexes/<indexname>/docs GET

Skillset 追加・更新/indexes/<indexname>/skillsets/

<skillsetname>PUT

https://<アカウント名>.search.windows.net

{"@odata.context":

"https://dahatake.search.windows.net/indexes('messages')/$metadata#Collection(Microsoft.Azure.Search.V2015_02_28_Preview.IndexResult)","value": [{ "errorMessage": null, "key": "1",

"status": true, "statusCode": 201 },{ "errorMessage": null, "key": "2",

"status": true, "statusCode": 201 },{ "errorMessage": null, "key": "3",

"status": true, "statusCode": 201 }]

}

Object 構造

Skillset

ドキュメント

Index

outputFieldMappings

Indexder の定義

ドキュメント

Index

{"name":"01-hellodoc", "dataSourceName" : "01-hellodoc","targetIndexName" : "01-hellodoc","skillsetName" : "01-hellodoc-skillset","fieldMappings" : [

{"sourceFieldName" : "metadata_storage_path","targetFieldName" : "metadata_storage_path","mappingFunction" : { "name" : "base64Encode" }

}],

"outputFieldMappings" : [

{"sourceFieldName" : "/document/organizations", "targetFieldName" : "organizations"

},…

手順

ドキュメント

Index

0. [option] データ準備 Blog など1. Azure Search 作成2. data Source 作成3. skillset 作成4. index 作成5. indexer 作成

1. data Source, skillset index への参照2.起動スケジュール設定

1.スケジュール指定がないと、作成時に起動6. Indexer の status で、挙動確認7. search で格納結果確認

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-intro#where-do-i-start

skillset - Pipeline 処理の定義

Built-in Skills

Key Phrase Extraction Sentiment AnalysisOrganization Entity ExtractionLocation Entity Extraction Persons Entity ExtractionLanguage Detection

Face DetectionTag ExtractionCelebrity RecognitionLandmark DetectionHandwriting Recognition (Preview)Printed Text Recognition

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-predefined-skills

https://docs.microsoft.com/en-us/azure/search/search-limits-quotas-

capacity#indexer-limits

開発の Tips

https://docs.microsoft.com/en-us/azure/search/search-limits-quotas-capacity#indexer-limits

content

normalized_images

*

[0]

lastname

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-annotations-syntax

開発の Tips

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-annotations-syntax

https://docs.microsoft.com/ja-

jp/azure/search/cognitive-search-

concept-troubleshooting

開発の Tips

{"fields": [

// other fields go here.{

"name": "enriched","type": "Edm.String","searchable": false,"sortable": false,"filterable": false,"facetable": false

}]

}

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-troubleshooting

“enriched” field – Memory Dump!{

"/document": {"/document/normalized_images/*/Categories": [[{

"detail": null,"name": "abstract_","score": 0.00390625

},{

"detail": null,"name": "others_","score": 0.00390625

}]],…

Cognitive Service には、ファイルサイズ /

文字数の制限があるじゃないの!

Built-in Skills



https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-skill-textsplithttps://docs.microsoft.com/ja-jp/azure/search/cognitive-search-skill-textmerger

文字数

Split

Built-in Skills



サイズ変更

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-image-scenarios

Step. 3Custom Skill の開発

多くの CognitiveService が日本市場対応がまだじゃないの!

Custom Skills

Azure MachineLearning

3rd Party

Custom Skills{

"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill","uri": "https://myskill.azurewebsites.net/api/OrgId""httpHeaders": {"Api-Key": "mySecret" },"context": "/document/organizations/*" ,"inputs": [

{ "name": “organizationName", "source": "/document/organizations/*" },],"outputs":[

{ "name": "organizationId", "targetName": "organizationId" }]

},

シナリオ 1: Text Classification

“Lorem ipsum dolor sit amet,

consectetur adipiscing elit, sed

do eiusmod tempor incididunt ut

labore et dolore magna aliqua. Ut

enim ad minim veniam, quis

nostrud exercitation ullamco

laboris nisi…”

Class A

Class B

Class C

シナリオ 2: Named Entity Extraction







laboris nisi…”







laboris nisi…”

Entity type A

Entity type B

Step 4. Azure Machine Learning Text Analytics Toolkit

• Public Preview

Azure Machine Learning用 Pyrthon Packages

AML Text Analytics Toolkit (TATK)

http://medicalentitydetector.azurewebsites.net/

Cognitive Search + Text Analytics Toolkit

LabeledData

Named Entity

Extraction

Azure ML

Annotated

Documents

Customer

Data

Search

Index

まとめ

メタデータ Relevancy×

日本語処理

Cognitive Services と進化

AML Text Analytics Toolkit

http://aka.ms/jfkfiles

https://github.com/Microsoft/AzureSearch_JFK_Files

https://github.com/dahatake/Azure-Search-Cognitive-Search

サンプルコード

http://aka.ms/jfkfiles

https://github.com/Microsoft/AzureSearch_JFK_Files

https://github.com/dahatake/Azure-Search-Cognitive-Search

Resources

Try Cognitive SearchDocumentation

Azure Machine Learning Package for Text AnalyticsDocumentation | Create a Data Science Virtual Machine

Cognitive ServicesDocumentation

Cognitive Services – Text AnalyticsDemo Site

https://docs.microsoft.com/ja-jp/azure/search/cognitive-search-concept-intro

https://docs.microsoft.com/en-us/python/api/overview/azure-machine-learning/textanalytics

https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/overview

https://docs.microsoft.com/ja-jp/azure/cognitive-services/

https://text-analytics-demo-dev.azurewebsites.net/

コンテンツ 文書をより探しやすくするための …...“litware insurance corp.”...

Documents

コンテンツ文書をより探しやすくするための …...“litware insurance corp.”...