the architecture of search engines in booking.com
TRANSCRIPT
The architecture of search engines in booking.comKang-min Liu |2017-03-09
Amsterdam
關於 Booking.com
Booking.com B.V. 隸屬於 Priceline 集團(納斯達克上市公司:PCLN),擁有並經營 Booking.com™,為全球頂尖線上住宿預訂業者。Booking.com 每日平均預訂晚數超過 1,200,000。Booking.com 網站及應用程式的造訪者來自世界各地,橫跨休閒及商務旅遊市
場。
Booking.com B.V. 公司成立於 1996 年,舉凡小型家庭自營 B&B、商務公寓、五星級豪華
套房,始終以最優惠價格提供各類住宿產品。Booking.com 秉承國際化理念,提供超過 40 種語言版本網頁,合作住宿總數達 1,160,281 間, 遍及全球 226 個國家和地區。
https://www.booking.com/content/about.zh-tw.html
Problems (Tech.)
Data Volume
● Location○ Cities + POIs: 3M○ Hotels: 1.2M
● Reservation○ 1.2M per day
● Hotel Reviews○ 100M
● Availability○ 52B
Location
Search● Input
○ Free Text● Result
○ Hotel ID○ City ID○ Lat/Lon
● Names are short
○ Stopword does not apply
● Multi-language
● High Ambiguity
● Multi-meaning Words
○ Park Hotel
○ Park City
○ City Hotel
● Local names
○ USJ = 環球影城
Difficulties
● MySQL
○ SELECT id FROM City
WHERE name like ‘%London%’
● Pros
○ Easy to implement
● Cons
○ Sensitive to Token order
○ No scoring
○ No partial matching
Solution (pre-2011)
● Elasticsearch
○ English-biased tokenization rule
○ One Index for everything, for all purposes (term suggestion + search)
● Pros
○ Tokenization / Partial matchiing
○ Fast Scoring + TopK
● Cons
○ Scoring is optimized for long corpus. Difficult to tweak.
○ Machine downtime management
Solution (2011-2013)
● Brick
○ In-house search engine. Simply TCP server on top of Lucene.
○ One document per translation.
○ 8 shards / 5 replicas.
○ Term suggestion + auto-correction + classification
● Pros
○ Control the scoring for each token
○ Controls the system deployment
● Cons
○ Tightly made for our specific problem
Solution 2013..NOW
Web
search search search
search search search
search search search
Replica 0 Replica 1 Replica M
…
…
…
… … …
MaterializedLocation x Translation
Location +Translation
Availability (AV)
Search● Input
○ Where – city, country, region○ When – check-in date○ How long – check-out date○ What – search options (stars,
price range, etc.)● Result
○ Available hotels
Inverted index #pre-2011
● LAMP - (P = perl) stack● normalized, optimized dataset● search ~ mysql filter + perl sort● Single search worker per query
● High time complexity● Large cities are unsearchable Inventory
Search
Pre-computed AV #2011+
● materialized dataset● read-optimized databases (AV)
○ aim for constant time fetch
● Single search worker● Failed with inventory growth● Failed on big search
Search
AVInventory Materialization AVAV
Volume of AV
“The brand’s global dominance cannot be overstated: It works with
approximately 800,000 partners, offering an average of 3 room
types, 2+ rates, 30 different length of stays across 365 arrival days,
which yields something north of 52 billion price points at any given
time.”
https://www.forbes.com/sites/jonathansalembaskin/2015/09/24/booking-com-ch
annels-its-inner-geek-toward-engagement/
Map-Reduce #2014+
● Parallelized search○ multiple workers per query
● Multiple MR phrases● Search-as-a-service
○ Plus all the goods and bads of services
● World search: 20s● Overheads: IPC, serialization
AVinv Materization AVAV
MR
Web server
MR
MR
MR + LocalAV #2015+
● Data in RAM○ Bring code to data
● Java○ reduce constant factor
■ Distance for100K hotels● perl: 0.4s● java: 0.04s
○ multi-thread■ smaller overhead than IPC
inv Materization
Web server(Scatter-gather
)
SmartAV
MR AV
SmartAV
MR AV
координаторкоординатор
Web service
Coordinator
AVsearch AVsearch AVsearch
AVsearch AVsearch AVsearch
AVsearch AVsearch AVsearch
статический шардингhotel_id mod Nреплики эквивалентны
shard0
Replica 0 Replica 1 Replica M
…
…
…
shard1
shardN
… … …
Queues for materializating
availability
Materialization
inv
scatter-gatherрандомный выборрепликиretry, если необходимоping nodes
апдейты за последние часы
in-memory indicesAV persisted
● Statically sharded (hotel_id mod k)
● Hotel data
○ Updated Hourly
○ Kept in RAM. Non-persisted, but easy to fetch and rebuild from mysql.
● Availability data
○ Persisted
○ Realtime updates
○ RocskDB
Local AV
● Filter
○ Search criterias: Stars / WiFi / parking etc
○ Group matching: Rooms wanted, persons per room
○ Availability: check-in and check-out dates
● Sort
○ By price, distance, review score
● Top-K
● Merge
Application
● MR search vs. MR search + local AV + new tech. Stack
● Adriatic coast (~30K hotels)○ before - 13s, after - 30ms
● Rome (~6K hotels)○ before 5s, after 20ms
● Sofia (~0.3K hotels) ○ before 200ms, after - 10ms
Result
Conclusion
One more thing...We are hiring人才募集中
Thank you :)