japanese cloudsearch use-cases and tech deep dive

91
Japanese Language Analysis with Amazon CloudSearch and JP Startup CloudSearch UseCases Amazon Data Services Japan Eiji Shinohara April 7, 2015

Upload: eiji-shinohara

Post on 29-Jul-2015

291 views

Category:

Technology


1 download

TRANSCRIPT

1. Japanese Language Analysis withAmazon CloudSearch and JP Startup CloudSearch Use-Cases Amazon Data Services Japan Eiji Shinohara April 7, 2015 2. ! Name Eiji Shinohara / / @shinodogg ! Role AWS Solutions Architect for Startups Amazon CloudSearch Subject Matter Expert Who am I? 3. Talking to Startup CTOs/Engineers on daily basis 4. ! AWS Startup CTO Night with Amazon CTO We had Amazon CTO Werner Vogels ! TechCrunch Tokyo CTO Night powered by AWS Startups pitch contest for JP CTO of the year ! IVS CTO Night & Day powered by AWS 3 days Over 100 CTOs gathering w/ Innity Ventures Summit 2014 #CTONight Series in Japan 5. 1. CTOs Pitch 2. Questions and Comments from Werner 3. Tech Discussion in a few minutes 4. Give an Award from Werver 6. ! AWS Startup CTO Night with Amazon CTO Vuzz CTO Kiyota-san won the prize 7. ! TechCrunch Tokyo CTO Night powered by AWS Contest for JP Startup CTO of the year! 8. ! TechCrunch Tokyo CTO Night powered by AWS ! Pitch Presenters (Startup CTOs) ! Judges (Popular Company CTOs) GREE Cookpad BizReach Hatena CyberAgent Amazon 9. ! TechCrunch Tokyo CTO Night powered by AWS Popular News-Curation Service 10. ! IVS CTO Night & Day powered by AWS 3 days event. Over 100 Startup CTOs gathering! 11. ! IVS CTO Night & Day powered by AWS 12. ! IVS CTO Night & Day powered by AWS Survey Result 100% participant CTOs said... WANT TO JOIN THIS EVENT AGAIN!! 13. AWS is empowering Startups! Lets Meet up at CTO Night in Japan (`) 14. AWS Pop-up Loft in San Francisco 15. AWS Pop-up Loft in San Francisco 16. Id like to have AWS Pop-up Loft in Tokyo. So Im honored to be here J We never disclose AWS customers info without permission. We got agreements for all use-cases in this slide. 17. Amazon CloudSearch 18. A9 ! when you search on Amazon.com you will see.. 19. Amazon CloudSearch ! A9 team is developing Amazon CloudSearch 20. Amazon CloudSearch ! CloudSearch in Amazon amazon smile : Support Local Charities/10s of millions of products https://smile.amazon.com/ 21. Amazon CloudSearch ! CloudSearch in Amazon goodreads : 30 million members/900 million books/34 million reviews https://www.goodreads.com/ 22. Amazon CloudSearch ! CloudSearch in Japan schoo WEB-campus: Online Life-Long Learning Platform https://schoo.jp/ 23. Amazon CloudSearch ! CloudSearch in Japan schoo WEB-campus: You can search AWS class 24. Search Engine ! Find documents with keyword from large amount of data Incrementally like grep? It will take too long.. Need to build index in advance(Inverted Index) TF-IDF scoring Multiple Query Parser Support 25. Need knowledge to manage Search service.. 26. and.. Japanese (c) Mitsuo Aida http://www.mitsuo.co.jp/museum/info/message.html If we take from one another, there will never be enough If we share with each other, there will be more than enough 27. Japanese Text Processing ! (Morphological Analysis) (-)/(-)/(-)/() Japanese is NOT separated by white-space. Need to analyze.. To ne-tune, dictionary maintenance is needed (Kyary Pamyu Pamyu) important NOUN should be 1 noun 28. Japanese Text Processing ! Stemming (-, baseForm:)/() like.. if the word ends in 'ed', remove the 'ed' if the word ends in 'ing', remove the 'ing' if the word ends in 'ly', remove the 'ly Yes. Japanese Stemming is enough complicated.. http://www.cjk.org/cjk/joa/joapaper.htm 29. Japanese Text Processing ! Synonym Addition e.g. Venice Alias search with pupil => student is hit search with student => pupil is NOT hit Group 1st, rst, one => you can search with all keywords in the group ! Stop Word Removing (this), (that), (which), (who), (what),,, All these 4 words mean Venice http://ja.wikipedia.org/wiki/ 30. ! Amazon CloudSearch Full Managed Cloud-Based Search Service Pretty easy to introduce 34 languages(include Japanese) support Sophisticated Functions Highlight Suggest(AutoComplete) Geo Search 31. Amazon CloudSearch ! Suggestions /suggest?q=ir&suggester=title_sug "suggest": {"query": "iro", "found": 5,"suggestions": [ {suggestion: Iron Man,"id": "tt0371746"}, {"suggestion": "Iron Man 2,"id:"tt1228705"}, ... Reading Search Japanese language has Kanji/Hiragana/Katakana ,, 32. e.g. Nanboku Line Subway Station Search 33. Using Amazon CloudSearch ! Create Domain 34. Using Amazon CloudSearch ! Data(Station nameLine name) Station Code Station Name A lot of stations are served by multiple line in Tokyo 35. Using Amazon CloudSearch ! Schema design(Field denition) Japanese Tokenization 36. Using Amazon CloudSearch ! Search with JR (most popular circle line in Tokyo) 37. Using Amazon CloudSearch ! Search with or 38. Automatic Scaling ! By Document Size and Search Request Auto Partitioning Auto Scaling 39. support variety of led types ! Field Types Double Date Signed Integer Text Literal 40. Ranking and Relevance() ! Sorting by _score 41. Ranking and Relevance() ! A/B testing on AWS Management Console 42. CloudSearch - Reference Architecture 43. CloudSearch - Reference Architecture ! Indexing Processing Script Queuing Batching Amazon EC2 Amazon EC2 Amazon CloudSearch Amazon SQS Source System Search Data Format (SDF) 44. Amazon CloudSearch Meetup in Tokyo A9 schoo nanapi ChatWork Cookpad ADSJ A9 45. CloudSearch use-case: ChatWork ! ChatWork: Business Communication Tool Over 40 thousand companies are using About 0.5 million users ! comment from ChatWork CTO Yamamoto-san To handle about 5 hundred million documents, we migrated to Amazon CloudSearch. Thanks to AWS and A9 team, it took only a few month. 46. CloudSearch use-case: ChatWork Tanaka-san slide https://speakerdeck.com/tanakayuki/kai-fa-zhe-karamitacloudsearch Almost maintenance free Positive feedback from end-users about Low latency 47. CloudSearch use-case: Engineer Cross2015(29th Jan) ChatWork is making CloudSearch noise in Japan 48. CloudSearch use-case: nanapi 49. CloudSearch use-case: nanapi ! nanapi is a Life Recipe portal About 20 million per user per Month Over 0.1 million recipes ! Getting popular these days 50. CloudSearch use-case: nanapi Kagaya-san slide https://speakerdeck.com/violetyk/cloudsearch-nanapi-use-case Default setting works a lot Easy to have Japanese search function Fully managed by AWS is huge plus 51. CloudSearch use-case: schoo 52. CloudSearch use-case: schoo Ito-san slide http://www.slideshare.net/hiromitsuito71/20141017-cloud-searchschoo It took only 1 WEEK to introduce. Its so easy and nice. Of course you need to escape XSS stu 53. ! Japanese Language is not so easy Yahoo! Japan Search Engineer Osuka-san slide Hasegawa-san Tanigawa-san?? Need to analyze and only the user can know the answer 54. Launched Features - Requests from Japan ! Customizing Japanese Tokenization ! Indexing Bigrams Bi-gram 55. Tokenization Dictionary 56. ! Customizing Japanese Tokenization Same idea as Kuromoji 57. ! Customizing Japanese Tokenization add to the Tokenization Dictionary https://www.youtube.com/watch?v=NLy4cvRx7Vc 58. ! Indexing Bigrams CJK - Chinese/Japanese/Korean Multiple Languages in Analysis Scheme Eliminate the missing but may need to care the noise e.g. search (Tokyo) with the word (Kyoto) Tokyo Kyoto 59. ! Indexing Bigrams Dene 2 elds. 1 for Tokenize, 1 for Bi-gram Utilizing Source Field 60. ! Indexing Bigrams Controlling Score with ^(caret) Search with (Kyoto) (Kyoto) is Higher than (Tokyo) 61. Amazon CloudSearch slides in Japanese at Tokyo Solr Meetup Webinar w/ A9 Team 62. ! Inside Amazon CloudSearch 63. ! Indexing EC2 ELB P3 EC2 P2 EC2 P1 batch Each Partition can be Indexing Node Balancing with ELB Multi-threading data load 64. ! Indexing EC2 P1 batch Raw Batch data to S3. Meta data to DynamoDB Document Service returns 200(OK) after putting the data Document Service S3 DynamoDB 200 65. ! Indexing EC2 P1 Indexed Binary data to S3 and update Meta data Each node(partition) gets Indexed Binary data from S3 Document Service S3 DynamoDB Updater Process Solr 66. Amazon S3 Region over 3 dierent places Bucket Durability: 99.999999999% Data Center Just put les Dont need to care about infrastructure. Unlimited Data Center Data CenterFiles Text, Image, Movie,, Cheap and Pay as you go 1GB $0.03 per month 67. ! Query EC2 ELB P3 EC2 P2 EC2 P1 query Through ELB as same as indexing Distributed Search 68. ! Handling Massive Query u Replication by Auto Scaling scale EC2 capacity automatically according to server load e.g. CPU70% for 5 min. Then add 2 more EC2 instances Auto Scaling Group EC2 EC2 ELB Auto Scaling CloudWatch Monitoring 69. ! Handling Massive Query u Replication by Auto Scaling scale EC2 capacity automatically according to server load e.g. CPU70% for 5 min. Then add 2 more EC2 instances Auto Scaling Group EC2 EC2 ELB Auto Scaling CloudWatch Monitoring EC2 EC2 Create EC2 instances add to ELB 70. ! Handling Massive Query EC2 P1 Auto Scaling Group EC2 P2 Auto Scaling Group EC2 P3 Auto Scaling Group 71. ! Handling Massive Query EC2 P1 Auto Scaling Group EC2 P2 Auto Scaling Group EC2 P3 Auto Scaling Group EC2 P1 EC2 P2 EC2 P3 EC2 P1 EC2 P2 EC2 P3 72. ! Scaling Depends on the Data Size No downtime but eventually consistent Small Medium Large 73. ! Scaling After scaling up to largest, Scale out EMR(AWS Hadoop service) jobs to split partition No downtime but eventually consistent Index Index P1 Index P2Amazon EMR 74. ! Modify conguration Re-Indexing with EMR No downtime but eventually consistent Index A Amazon EMR Index B 75. Inside Amazon CloudSearch ! Utilizing variety of AWS services Dont need to think of back-ground things but please understand that Amazon CloudSearch is based on eventual consistency model Metrics with CloudWatch (launched in Mar 2015) Successful Requests Searchable Documents IndexUtilization Partitions 76. Amazon CloudSearch TIPS ! Initial Massive Data Loading Lager Instance and enough Partitions for Multi-threading http://www.slideshare.net/AmazonWebServices/sdd411-amazon-cloudsearch-deep-dive-and-best-practices-aws-reinvent-2014 77. Amazon CloudSearch TIPS ! Prepare for Bursty Trac Increase Replication Counts 78. Amazon CloudSearch use-case in Japan AWS Tokyo Region 4 Years Anniversary Cake J 79. ! CloudSearch use-case: Lancers Crowd-sourcing service 80. ! CloudSearch use-case: Gochi-Kuru Bento-Box delivery service 81. CloudSearch is expanding in Japan ! Smart/InSight: Enterprise Searchhttp://smartinsight.jp/en/ CloudDays Tokyo 2014 82. CloudSearch is expanding in Japan ! Pasona Career: Japanese Big Recruiting Company Connected Career Search https://pasona-connected.jp/ 83. Want to Dive Deeper? ! CloudSearch, The Amazon Web Service on top of Solr At Lucene/Solr Revolution 2014 https://www.youtube.com/watch?v=RI1x0d-yO8A 84. Want to Dive Deeper? ! Amazon CloudSearch Deep Dive & Best Practices At AWS re:Invent 2014 Pro Tips and Rule of thumb Slide: http://goo.gl/pklAzW Youtube: http://youtu.be/OeHaj1a66I4 85. ! FYI 86. AWS Black Belt Tech Webinar ! Every Wed 6PM 7PM(JST) Online Seminar in Japanese http://aws.typepad.com/sajp/ w/ Adobe Connect 87. AWS Black Belt Tech Webinar ! Deep dive product-cut seminar by Solution Architect http://aws.typepad.com/sajp/ Amazon Simple Queue Service (SQS) ! AWS Elastic Beanstalk: Worker Tier SQS Auto Scaling Sqsd (deamon) Elastic Beanstalk Application http://localhost:80/xxx EC2 Instance Auto Scaling group CloudWatch Auto Scaling 88. AWS Black Belt Tech Webinar ! Check #awsblackbelt hashtag on Twitter 89. Thank you!!