Download - Spotify Teknikdagarna
How to make sense of 150 TB of data, every daySebastian [email protected]
Spotify75 million users20 million paying usersFounded in 20083+ billion dollars paid to rights holders30+ million tracks1.5+ billion playlists1500+ employees
Svrt att definiera exakt hur mnga ltar. Deduplicering, vi har ett eget team som jobbar med det30 miljoner faktiska ltar, nr man rknat bort dubbletter.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.What is big data?
High Volume, High Velocity, High VarietyKatalog, 30 miljoner rader (inte s mycket). 30^2 (n^2) jmfrelse.
14 TB of user/service-related log data per dayStreams/clicks/interactions are being loggedExpands to 150 TB every day Combining data sources
billions of lines of data every dayAnonymizing data, making sure that all data is according to privacy concerns. One machine, 160mb/s, 10 days to read in 150 TB of dataSo how do we do it?
We utilise a cluster of 1600 nodes
60 PB of disk space68 TB of RAM (42gb per server)30k CPU cores
Hadoop Data Center
European Data Center
American Data Center
Internet
Internet7 TB/day7 TB/day
ClientClient
Logs sent from clients- Sent to EU/US data centre
Spotify data architecture
Very complexLots of different servicesApp is a small part of everythingThe app does not just work
Example of the Discovery Logs come from the client, pass through hadoopinto a service that recommends music and surfaces back to user
Approximate 60M users x 4M songs with40 latent factors, ALS
In short, minimise the cost function:
Example of the Discovery Logs come from the client, pass through hadoopinto a service that recommends music and surfaces back to user
Saturday before lunchWeekday evenings
Track usage, but important for breakdown.The reason we save a lot of data