recsys 2016

86
Highlights from Recommender Systems Conference Boston, MA, USA, 15th-19th September 2016 Mindis Zickus, https://www.dunnhumby.com /

Upload: mindaugas-zickus

Post on 08-Jan-2017

1.358 views

Category:

Data & Analytics


5 download

TRANSCRIPT

  • Highlights from Recommender Systems Conference

    Boston, MA, USA, 15th-19th September 2016

    Mindis Zickus, https://www.dunnhumby.com/

    https://www.dunnhumby.com/

  • Topics1. Everything is a recommendation at Netflix, Quora, Amazon,

    2. Adaptive and interactive recommendations

    3. Text modelling algorithms for recommendations

    4. Explore-exploit dilemma

    5. Models to generate features: Ranking content in the news feed at Facebook

    6. Deep learning is disrupting recommenders

    7. Models in production

    8. Pin recommendation at Pinterest

    9. Contextual Turn

    10. Interesting Papers, slides, algorithms

  • 1. Everything is a recommendation

    Netflix, Quora, Amazon

  • Netflix: Every piece of design and content is a recommendation selected by algorithm

  • Amazons personalized landing page

    Rows of different types of recommendations

  • Algorithmic recommendations support human designers at

    StichFix designs personalized clothing:

    1. FILL OUT YOUR STYLE PROFILETell your personal stylist about your fit, size and style preferences.2. RECEIVE A FIX DELIVERYGet 5 pieces of clothing delivered to your door.3. KEEP WHAT YOU WANTOnly pay for what you keep. Returns are easy and free.

    http://www.slideshare.net/KatherineLivins/recsys-2016-talk-feature-selection-for-human-recommenders-66187739

    http://www.slideshare.net/KatherineLivins/recsys-2016-talk-feature-selection-for-human-recommenders-66187739

  • Story recommendation for journalists at Schibsted

    Algorithms recommend news to the Journalists

    Journalists can tune freshness

  • Anonymous search Signed in

    Personalized search at Google

  • 2. Adaptive and interactive recommendations

  • Netflix orders content rows in front pageaccording to predicted users mode of watching

    Rows of intent

    Continuation: Resume a recently-watched TV/Movie

    List: Play a title previously added to My List

    Rewatch: Rewatch a title enjoyed in the past

    Discovery: Discover a new title to watch

    http://www.slideshare.net/intotheminds/balancing-discovery-and-continuation-in-recommendation-hossein-taghavi-netflix?

    Ordering of movies in rows Thematic coherence, relevancy Personalized personalization levels of diversity Adaptive, intent driven personalization Thumbnail Image is personalised

    http://www.slideshare.net/intotheminds/balancing-discovery-and-continuation-in-recommendation-hossein-taghavi-netflix

  • Model reorders unseen rows based on previous clicks

    Graphical (Bayesian) model with Expectation Maximization inference

    Unseen rows are also reordered in real time base on real time behaviour

  • https://www.amazon.com/stream

    Recommended items are adaptively personalized and diversified at Amazon Stream

    Method:(1) a Bayesian regression model for scoring the relevance of items while

    leveraging uncertainty, (2) submodular diversification framework that re-ranks the top scoring items

    based on category(3) personalized category preferences learned from the users behavior.

    https://www.amazon.com/stream

  • http://recprofile.org/kitazawa.pdf

    Incremental Factorization Machines algorithm for adaptive recommendations

    http://recprofile.org/kitazawa.pdf

  • 3. Text modelling algorithms for recommendations

  • How would your recommend a web page?

  • Content recommendation at RoverApp (ex. Flipora)

    1. Define topic hierarchy (3000 topics) e.g. Sports/Racing/Formula1

    2. Define entities within topics: Schumacher, Obama

    3. Crawl web, get pages. Or use publishers content.

    4. Assign each incoming document to topics and entity (sparse SVM)

    5. Define users interest profile as topics and entities consumed with some decay (15000 dimensional vector)

    6. Find most similar docs for user to recommend

    7. Get CTR, update recommendations on CTR

  • Recommendations at Google Now based on user interest graph (users profile on knowledge graph)

  • Many industry recommenders are based or benefit from text information

    Methods:

    Tweets Search queries SMS messages Conversations Product descriptors

    Many of items has some text attributes or can be solely defined by text

    Similarity (bag of words, TF/IDF) Topic discovery with unsupervised learning (LDA) Dynamics of topics Taxonomies or Knowledge Graphs of Topics Entities (Named entity recognition) Sentiment Sequence (word2vec) Embedding User interests mapping

    Web pages Stories Blogs News Q&A Reviews

  • Original word2vec: captures words sequential co-occurrence patterns to predict sequence of words

    Creates neural embedding (latent factors) of a word by predicting other words in his neighborhood in document.

    The final objective is not prediction but the words vector of weights in hidden matrix

  • Word2vec extensions for product recommendations

    Yahoo: Prod2vec: predict next product in purchase sequence

    https://arxiv.org/pdf/1606.07154.pdf

    Criteo: Meta-Prod2Vec: extends prod2vec by leveraging item meta data, can be used for cold start problems

    https://arxiv.org/pdf/1607.07326v1.pdf

    Microsoft: Item2Vec: Predict other products in basket

    https://arxiv.org/ftp/arxiv/papers/1603/1603.04259.pdf

    https://arxiv.org/pdf/1606.07154.pdfhttps://arxiv.org/pdf/1607.07326v1.pdfhttps://arxiv.org/ftp/arxiv/papers/1603/1603.04259.pdf

  • You have to do Embedding

    Every cool data scientist does Embedding these days

    Embedding means transporting/mapping the item or user to another n-dimensional space.

    Sparse to dense representation

    Reduces dimensionality

    Space can be clusters, latent factors, dimensions.

    Embedding methods can be clustering, PCA, LD matrix factorization, neural (e.g. word2vec), deep learning

    Embedding can be hierarchical

    Distances between items in new space gives similarity.

    There might be many types of similarities (e.g. >20 at Facebook)

  • Non materialized user representation via embedding

  • 4. Explore exploit dilemma

  • Explore exploit dilemma for music recommendations at Pandora

    If uncertainty/variance about the items relevancy is high the optimal strategy sometimes is to explore - show high uncertainty but lower relevancy items to users - to get more information about true items relevancy

    Challenge is how much to explore to avoid WTF recommendations

  • Ticketmaster case study: contextual bandit approach towards periodical personalized recommendations

    http://delivery.acm.org/10.1145/2960000/2959139/p23-qin.pdf?

    Background: Ticketmaster is interested in pushing periodical personalized recommendations to users, commonly seen for many e-commerce companies today. In many cases, users are not motivated to visit websites or launch apps to see online recommendations. Periodical pushing of relevant products such as weekly recommendation emails, sms, and notifications, remind users of the products for making purchases and further exploration of online content.

    Challenge: How to refresh recommendations

    Contextual bandits:1. Show completely random recommendations during the first batch.2. Use the resulting feedback data from the first batch to initially train the models.3. Publish the models, and use them to serve recommendations for the second batch.4. Use the resulting feedback data from the second batch to update the models.5. Repeat (3) and (4) with subsequent batches.

    Improvement: use hashing trick

    http://engineering.richrelevance.com/personalization-contextual-bandits/

    http://delivery.acm.org/10.1145/2960000/2959139/p23-qin.pdfhttp://engineering.richrelevance.com/personalization-contextual-bandits/

  • Filter bubble in modelling: users see and click what is recommended by models, subsequently models learn from interactions with previous model generated recommendations.

  • 5. Models to generate features: Ranking content in the news feed at Facebook

    http://conf.turi.com/lsrs16/wp-content/uploads/Komal_Kapoor_Ranking-and-Recommendation-for-Billions-of-Users.pptx

    http://conf.turi.com/lsrs16/wp-content/uploads/Komal_Kapoor_Ranking-and-Recommendation-for-Billions-of-Users.pptx

  • Feature Selection (BDTs)

    Prune to the most important features (~2K) Training time is proportional to number of examples * number of

    features

    Under-sample negative examples (impressions, no action) to help with # of examples

    Reduce noise and results in simpler trees

    Do this for each feed event type: train many forests

    Historical counts and propensity are some of the strongest features

  • Model Training (Logistic regression) We need to react quickly and incorporate new content - use a

    simple model

    Logistic regression is simple, fast and easy to distribute

    Treat the trees as feature transforms, each one turning the input features into a set of categorical features, one per tree.

    Use logistic regression for online learning to quickly re-learn leaf weights

    F3

    -0.1 0.3

    0.2

    F1

    -0.5

    0.2 -.05

    F2

    F3

    Throw out boosted tree weights, use only transformsInput: (F1, F2, F3)Output (T1, T2) where T1 {Leaves of tree 1}

  • Stacking: Combined Tree + LR Model Main Advantage: Tree application is computationally resource intensive and slow

    Reuse click tree to predict likes, comments, etc.

    Only slightly more resource intensive than independent models; better prediction performance transfer learnings

    ~Thousands of Raw features

    Thousands of Tree Transforms

    Sparse Boolean features + non-tree raw features

    Like Comment Share Friend Outbound Click

    Follow HideClick

    Click Like Comment Share Friend Outbound click Follow Hide

  • Other models + sparse features

    Train Neural nets to predict events Discard final layer, use final layer outputs as features

    Add sparse features such as text or content ID

    Raw Features

    Forest

    Raw Features

    Neural NetworkSparse features

    Logistic Regression

    Like Comment Share Hide Outbound Click

    Fan | Follow FriendClick

  • Facebook: Chain of probabilities to measure ultimate value

    Recommendation

    Impression

    Recommendation

    Conversion

    Page Post Impression

    Page Post Engage

    P (engagement | impression) = P(conversion | impression) * P(post impression | conversion) * P(engagement | post impression)

  • Data freshness matters simple models allows for online learning and twitch response

    Feature generation is part of the modeling process

    Stacking Supports plugging-in new algorithms and features easily

    Works very well in practice

    Use skewed sampling to manage high data volumes

    Historical counters as features provides highly predictive features, easy to update online

    Learnings

  • 6. Deep Learning is disrupting recommenders

  • Machine-learning requires feature engineering that transforms the raw data (such as the pixel values of an image or transactions) into feature vector from which the machine learning subsystem could classify patterns in the input.

    Deep-learning have multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned.

    http://www.slideshare.net/kerveros99/deep-learning-for-recommender-systems-budapest-recsys-meetup

    https://www.yammer.com/dunnhumby.com/#/uploaded_files/69393183?threadId=775785880

    http://www.slideshare.net/kerveros99/deep-learning-for-recommender-systems-budapest-recsys-meetuphttps://www.yammer.com/dunnhumby.com/#/uploaded_files/69393183?threadId=775785880

  • Many companies try to use DL in production. Last year there were 0 deep learning papers at Recsys, this year ~25% DL applications

    DL Pros: can deal with different types of input data (raw data, text, images, sequences) , can handle cold start

    DL Cons: black box, many parameters to tune e.g. need another modelling system for tuning

    Instead of feature engineering, we now have architecture engineering

    DL Papers at recsys

    Convolutional Matrix Factorization for Document Context-Aware Recommendation by Donghyun Kim, Chanyoung Park, Jinoh Oh, Sungyong Lee, Hwanjo Yu

    Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations by Balzs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, Domonkos Tikk

    Materials of DL workshop at Recsyshttp://dlrs-workshop.org/dlrs-2016/program/

    http://smerity.com/articles/2016/architectures_are_the_new_feature_engineering.htmlhttp://dlrs-workshop.org/dlrs-2016/program/http://dlrs-workshop.org/dlrs-2016/program/

  • Google uses DL for Youtube recommendations,

    DL still uses features defined by experts.

    Mentioned that Google expects to move all modelling to common platform based on Tensorflow

    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf

    https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/45530.pdf

  • The artificial neurons (for example, hidden units grouped

    under node s with values st at time t) get inputs from

    other neurons at previous time steps (this is represented

    with the black square, representing a delay of one time

    step, on the left).

    In this way, a recurrent neural network can map an input

    sequence with elements xt into an output sequence with

    elements ot, with each ot depending on all the previous

    xt (for t t). The same parameters (matrices U,V,W )

    are used at each time step.

    Good article about abour DL and RNNhttps://www.yammer.com/dunnhumby.com/#/uploaded_files/69393183?threadId=775785880

    http://home.elka.pw.edu.pl/~btwardow/recsys2016_btwardow_ACCEPTED.pdf

    RNN

    https://www.yammer.com/dunnhumby.com/#/uploaded_files/69393183?threadId=775785880https://www.yammer.com/dunnhumby.com/#/uploaded_files/69393183?threadId=775785880http://home.elka.pw.edu.pl/~btwardow/recsys2016_btwardow_ACCEPTED.pdf

  • DL to combine usage and items text information in single model

    https://arxiv.org/abs/1609.02116

    https://arxiv.org/abs/1609.02116

  • 7. Models in production

  • Model Accuracy vs

    Speed and complexity of scoring

    Transparency

    Cost of training and deriving features

    Ability to explain recommendations to user

    Causal effects

    Predicting the right metrics

  • http://www.slideshare.net/xamat/recsys-2016-tutorial-lessons-learned-from-building-reallife-recommender-systems

    http://www.slideshare.net/xamat/recsys-2016-tutorial-lessons-learned-from-building-reallife-recommender-systems

  • Quoras production machine learning uses Luigi to run model training workflows

    Models are trained on single machine

  • Feature generation framework at Netflix

    When experimenters design new feature encoders

    functions that take raw data as input and compute

    features they can immediately use them to compute

    new features for any time in the past, since the time

    machine can retrieve the appropriate snapshots and pass

    them to the feature encoders.

    http://techblog.netflix.com/2016/02/distributed-time-travel-for-feature.html

    http://techblog.netflix.com/2016/02/distributed-time-travel-for-feature.html

  • Everyone uses two stage scoring!!!!

    Stage1: Candidate retrieval, aim for high recall, get thousands of item candidates

    Stage2: Reranking based on more sophisticated models, real time context, users feedback

  • 2 stages of item ranking at eBay

    1) Recall, which requires retrieving candidate items that might be similar to the given seed item,

    2) Ranking, which sorts the candidates according to their probability of being purchased.

    The input to the algorithm comes as an HTTP request to the merchandising backend (MBE) system with a given seed item. This initiates parallel calls to several services which return candidate recommendations that are similar in some way to the seed. The set of candidate recommendations are then ranked in real time. The output of the system is the top 5 ranked items, which are surfaced to the user.

  • Netflix has shown that unless your dataset is huge, distributed model training is not faster than training with well optimized code on single machine

    http://www.slideshare.net/moustaki/some-pitfalls-of-distributed-learning

    http://www.slideshare.net/moustaki/some-pitfalls-of-distributed-learning

  • Argument for Scala to bridge data science and production engineers

    Some companies (Verizon, Asos, Credit Karma) are adopting Scala as universal data analysis and analysis production language.

    Why Scala:

    Functional language, can write data transformation pipelines

    Can use Java libraries

    Spark is in Scala

    Similar to continuous integration movement to integrate software development and operations.

  • Both ML engineers and data scientists are involved in machine

    learning

    ML engineers build, implement, and maintain production

    machine learning systems.

    Data scientists conduct research to generate ideas about

    machine learning projects, and perform analysis to understand

    the metrics impact of machine learning systems.

    Data Science ways of working at Quora

    https://www.quora.com/What-is-the-difference-between-a-machine-learning-engineer-and-a-data-scientist-at-Quora

    https://www.quora.com/What-is-the-difference-between-a-machine-learning-engineer-and-a-data-scientist-at-Quora

  • 8. Example: Pin recommendation at Pinterest

  • Related Pins System at Pinterest

    1: Candidate Generation Signals derived from curation,

    visuals similarity, topic vectors, etc,

    Rough estimate of what is related

    Generate N candidates (thousands)

    2. Ranking Machine learned ranking

    model applied to candidate set

    3. Serving Online real time ranking and

    serving

    https://arxiv.org/pdf/1511.04003.pdf

    https://arxiv.org/pdf/1511.04003.pdf

  • Pinterest: To avoid filter bubble, serves small group of users random Pins and uses that data to build models

  • Pinterest: real time ranking done with random forest, with parallelized distributed c++ implementation of RF scoring

  • Models are build on 1% semi random recommendations

  • 9. Contextual turn

  • Contextual recommendations

    Recommendations dont have to personal

    Majority of recommenders used in industry are item-item (non personalized)

    Increasing number of session based recommenders

    When searching for new item its more important what other users did in this situation vs. what user did previously himself

    https://home.deib.polimi.it/pagano/portfolio/papers/TheContextualTurn.pdf

    https://home.deib.polimi.it/pagano/portfolio/papers/TheContextualTurn.pdf

  • Importance of Personalization

    Value of personalization depends on how broad is your intent. The broader intent the more opportunity for personalization.

    Running shoes can be personalized if we know gender Personalization as re-ranking with user as context.

  • Balancing popularity, localness and affinity at Google Now

  • Google Now:Personalized Search and recommendations

  • The Search-Recommendation-Notification Spectrum

  • At Quora, the value of showing a story to a user is approximated by weighted sum of actions

  • Event Probability Value*

    Click 5.1% 1

    Like 2.9% 5

    Comment 0.55% 20

    Share 0.00005% 40

    Friend 0.00003% 50

    Hide 0.00002% -100

    Total 0.306

    Multi-objective recommendations

    At Facebook different actions have different significance

    Given a potential story, how good is it?Express as probability of click, like, comment, etc.Assign different weights to different events, according to significance

  • 10. Interesting Papers, slides, algorithms

  • Best paper of recsys: Local Item-Item Models For Top-N Recommendation

    Original SLIM model: Item to item similarity weights can be learn by regressing purchase indicator of every item rj (0/1) by other items that have been purchased by users.

    Improved SLIM model: By using different item-item models for these user subsets, we can capture differences in their preferences and this can lead to improved performance for top-N recommendations.

    http://dl.acm.org/citation.cfm?id=2959185&CFID=672508488&CFTOKEN=91227145

  • Extracting Food Substitutes From Food Diary via Distributional Similarity

    Foods that are consumed in similar contexts are more likely to be similar dietarily.

    For example, a turkey sandwich can be considered a suitable substitute for a chicken sandwich if both tend to be consumed with french fries and salad.

  • List of algorithms used by presenters

    Logistic regression, Bayesian priors, caching, L1, L2, VW with FTRL

    GBDT, XGBOOST

    RankLib

    MF, LIBFM, field aware FM

    LDA (collapsed Gibbs sampling)

    Deep learning: RNN, CNN

    Word2vec: prod2vec, item2vec

    Graphical Bayesian models

  • https://github.com/quora/qmf

    https://github.com/quora/qmf

  • LiRa: A New Likelihood-Based Similarity Score For Collaborative Filtering https://arxiv.org/pdf/1608.08646v1.pdf

    https://arxiv.org/pdf/1608.08646v1.pdf

  • Submodality to mathematically control diversity

    Adding item from different cluster gives more value than from same cluster

    Adaptive,

    Personalized

    Diversity for

    Visual Discovery

    at Amazon

    http://dl.acm.org/c

    itation.cfm?id=29

    59171

    http://dl.acm.org/citation.cfm?id=2959171

  • Negative sampling is still an art

    Observational ata are implicit we know what user likes but dont

    What user actually has seen or is aware of but intentionally hasnt clicked

    Popular not clicked items

    No single method, have to try what works

  • Measurement

    Offline A/B testing

    http://leon.bottou.org/publications/pdf/tr-2012-09-12.pdf

    Targeted maximum likelihood

    http://www.targetedlearningbook.com/

    http://leon.bottou.org/publications/pdf/tr-2012-09-12.pdfhttp://www.targetedlearningbook.com/