Tag Archives: Recommender Systems

STUDY: Socially aware temporally causal decoder recommender systems

Reading has many benefits for young students, such as better linguistic and life skills, and reading for pleasure has been shown to correlate with academic success. Furthermore students have reported improved emotional wellbeing from reading, as well as better general knowledge and better understanding of other cultures. With the vast amount of reading material both online and off, finding age-appropriate, relevant and engaging content can be a challenging task, but helping students do so is a necessary step to engage them in reading. Effective recommendations that present students with relevant reading material helps keep students reading, and this is where machine learning (ML) can help.

ML has been widely used in building recommender systems for various types of digital content, ranging from videos to books to e-commerce items. Recommender systems are used across a range of digital platforms to help surface relevant and engaging content to users. In these systems, ML models are trained to suggest items to each user individually based on user preferences, user engagement, and the items under recommendation. These data provide a strong learning signal for models to be able to recommend items that are likely to be of interest, thereby improving user experience.

In “STUDY: Socially Aware Temporally Causal Decoder Recommender Systems”, we present a content recommender system for audiobooks in an educational setting taking into account the social nature of reading. We developed the STUDY algorithm in partnership with Learning Ally, an educational nonprofit, aimed at promoting reading in dyslexic students, that provides audiobooks to students through a school-wide subscription program. Leveraging the wide range of audiobooks in the Learning Ally library, our goal is to help students find the right content to help boost their reading experience and engagement. Motivated by the fact that what a person’s peers are currently reading has significant effects on what they would find interesting to read, we jointly process the reading engagement history of students who are in the same classroom. This allows our model to benefit from live information about what is currently trending within the student’s localized social group, in this case, their classroom.


Data

Learning Ally has a large digital library of curated audiobooks targeted at students, making it well-suited for building a social recommendation model to help improve student learning outcomes. We received two years of anonymized audiobook consumption data. All students, schools and groupings in the data were anonymized, only identified by a randomly generated ID not traceable back to real entities by Google. Furthermore all potentially identifiable metadata was only shared in an aggregated form, to protect students and institutions from being re-identified. The data consisted of time-stamped records of student’s interactions with audiobooks. For each interaction we have an anonymized student ID (which includes the student’s grade level and anonymized school ID), an audiobook identifier and a date. While many schools distribute students in a single grade across several classrooms, we leverage this metadata to make the simplifying assumption that all students in the same school and in the same grade level are in the same classroom. While this provides the foundation needed to build a better social recommender model, it's important to note that this does not enable us to re-identify individuals, class groups or schools.


The STUDY algorithm

We framed the recommendation problem as a click-through rate prediction problem, where we model the conditional probability of a user interacting with each specific item conditioned on both 1) user and item characteristics and 2) the item interaction history sequence for the user at hand. Previous work suggests Transformer-based models, a widely used model class developed by Google Research, are well suited for modeling this problem. When each user is processed individually this becomes an autoregressive sequence modeling problem. We use this conceptual framework to model our data and then extend this framework to create the STUDY approach.

While this approach for click-through rate prediction can model dependencies between past and future item preferences for an individual user and can learn patterns of similarity across users at train time, it cannot model dependencies across different users at inference time. To recognise the social nature of reading and remediate this shortcoming we developed the STUDY model, which concatenates multiple sequences of books read by each student into a single sequence that collects data from multiple students in a single classroom.

However, this data representation requires careful diligence if it is to be modeled by transformers. In transformers, the attention mask is the matrix that controls which inputs can be used to inform the predictions of which outputs. The pattern of using all prior tokens in a sequence to inform the prediction of an output leads to the upper triangular attention matrix traditionally found in causal decoders. However, since the sequence fed into the STUDY model is not temporally ordered, even though each of its constituent subsequences is, a standard causal decoder is no longer a good fit for this sequence. When trying to predict each token, the model is not allowed to attend to every token that precedes it in the sequence; some of these tokens might have timestamps that are later and contain information that would not be available at deployment time.

In this figure we show the attention mask typically used in causal decoders. Each column represents an output and each column represents an output. A value of 1 (shown as blue) for a matrix entry at a particular position denotes that the model can observe the input of that row when predicting the output of the corresponding column, whereas a value of 0 (shown as white) denotes the opposite.

The STUDY model builds on causal transformers by replacing the triangular matrix attention mask with a flexible attention mask with values based on timestamps to allow attention across different subsequences. Compared to a regular transformer, which would not allow attention across different subsequences and would have a triangular matrix mask within sequence, STUDY maintains a causal triangular attention matrix within a sequence and has flexible values across sequences with values that depend on timestamps. Hence, predictions at any output point in the sequence are informed by all input points that occurred in the past relative to the current time point, regardless of whether they appear before or after the current input in the sequence. This causal constraint is important because if it is not enforced at train time, the model could potentially learn to make predictions using information from the future, which would not be available for a real world deployment.

In (a) we show a sequential autoregressive transformer with causal attention that processes each user individually; in (b) we show an equivalent joint forward pass that results in the same computation as (a); and finally, in (c) we show that by introducing new nonzero values (shown in purple) to the attention mask we allow information to flow across users. We do this by allowing a prediction to condition on all interactions with an earlier timestamp, irrespective of whether the interaction came from the same user or not.

Experiments

We used the Learning Ally dataset to train the STUDY model along with multiple baselines for comparison. We implemented an autoregressive click-through rate transformer decoder, which we refer to as “Individual”, a k-nearest neighbor baseline (KNN), and a comparable social baseline, social attention memory network (SAMN). We used the data from the first school year for training and we used the data from the second school year for validation and testing.

We evaluated these models by measuring the percentage of the time the next item the user actually interacted with was in the model’s top n recommendations, i.e., hits@n, for different values of n. In addition to evaluating the models on the entire test set we also report the models’ scores on two subsets of the test set that are more challenging than the whole data set. We observed that students will typically interact with an audiobook over multiple sessions, so simply recommending the last book read by the user would be a strong trivial recommendation. Hence, the first test subset, which we refer to as “non-continuation”, is where we only look at each model’s performance on recommendations when the students interact with books that are different from the previous interaction. We also observe that students revisit books they have read in the past, so strong performance on the test set can be achieved by restricting the recommendations made for each student to only the books they have read in the past. Although there might be value in recommending old favorites to students, much value from recommender systems comes from surfacing content that is new and unknown to the user. To measure this we evaluate the models on the subset of the test set where the students interact with a title for the first time. We name this evaluation subset “novel”.

We find that STUDY outperforms all other tested models across almost every single slice we evaluated against.

In this figure we compare the performance of four models, Study, Individual, KNN and SAMN. We measure the performance with hits@5, i.e., how likely the model is to suggest the next title the user read within the model’s top 5 recommendations. We evaluate the model on the entire test set (all) as well as the novel and non-continuation splits. We see STUDY consistently outperforms the other three models presented across all splits.

Importance of appropriate grouping

At the heart of the STUDY algorithm is organizing users into groups and doing joint inference over multiple users who are in the same group in a single forward pass of the model. We conducted an ablation study where we looked at the importance of the actual groupings used on the performance of the model. In our presented model we group together all students who are in the same grade level and school. We then experiment with groups defined by all students in the same grade level and district and also place all students in a single group with a random subset used for each forward pass. We also compare these models against the Individual model for reference.

We found that using groups that were more localized was more effective, with the school and grade level grouping outperforming the district and grade level grouping. This supports the hypothesis that the STUDY model is successful because of the social nature of activities such as reading — people’s reading choices are likely to correlate with the reading choices of those around them. Both of these models outperformed the other two models (single group and Individual) where grade level is not used to group students. This suggests that data from users with similar reading levels and interests is beneficial for performance.


Future work

This work is limited to modeling recommendations for user populations where the social connections are assumed to be homogenous. In the future it would be beneficial to model a user population where relationships are not homogeneous, i.e., where categorically different types of relationships exist or where the relative strength or influence of different relationships is known.


Acknowledgements

This work involved collaborative efforts from a multidisciplinary team of researchers, software engineers and educational subject matter experts. We thank our co-authors: Diana Mincu, Lauren Harrell, and Katherine Heller from Google. We also thank our colleagues at Learning Ally, Jeff Ho, Akshat Shah, Erin Walker, and Tyler Bastian, and our collaborators at Google, Marc Repnyek, Aki Estrella, Fernando Diaz, Scott Sanner, Emily Salkey and Lev Proleev.

Source: Google AI Blog


Large-Scale Matrix Factorization on TPUs

Matrix factorization is one of the oldest, yet still widely used, techniques for learning how to recommend items such as songs or movies from user ratings. In its basic form, it approximates a large, sparse (i.e., mostly empty) matrix of user-item interactions with a product of two smaller, denser matrices representing learned item and user features. These dense matrices, in turn, can be used to recommend items to a user with which they haven't interacted before.

Despite its algorithmic simplicity, matrix factorization can still achieve competitive performance in recommender benchmarks. Alternating least squares (ALS), and especially its implicit variation, is a fundamental algorithm to learn the parameters of matrix factorization. ALS is known for its high efficiency because it scales linearly in the number of rows, columns and non-zeros. Hence, this algorithm is very well suited for large-scale challenges. But, for very large real-world matrix factorization datasets, a single machine implementation would not suffice, and so, it would require a large distributed system. Most of the distributed implementations of matrix factorization that employ ALS leverage off-the-shelf CPU devices, and rightfully so, due to the inherently sparse nature of the problem (the input matrix is mostly empty).

On the other hand, recent success of deep learning, which has exhibited growing computational capacity, has spurred a new wave of research and progress on hardware accelerators such as Tensor Processing Units (TPUs). TPUs afford domain specific hardware speedups, especially for use cases like deep learning, which involves a large number of dense matrix multiplications. In particular, they allow significant speedups for traditional data-parallel workloads, such as training models with Stochastic Gradient Descent (SGD) in SPMD (single program multiple data) fashion. The SPMD approach has gained popularity in computations like training neural networks with gradient descent algorithms, and can be used for both data-parallel and model-parallel computations, where we distribute parameters of the model across available devices. Nevertheless, while TPUs have been enormously attractive for methods based on SGD, it is not immediately clear if a high performance implementation of ALS, which requires a large number of distributed sparse matrix multiplies, can be developed for a large-scale cluster of TPU devices.

In “ALX: Large Scale Matrix Factorization on TPUs”, we explore a distributed ALS design that makes efficient use of the TPU architecture and can scale well to matrix factorization problems of the order of billions of rows and columns by scaling the number of available TPU cores. The approach we propose leverages a combination of model and data parallelism, where each TPU core both stores a portion of the embedding table and trains over a unique slice of data, grouped in mini-batches. In order to spur future research on large-scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built and released a real world web link prediction dataset called WebGraph.

The figure shows the flow of data and computation through the ALX framework on TPU devices. Similar to SGD-based training procedures, each TPU core performs identical computation for its own batch of data in SPMD fashion, which allows for synchronous computation in parallel on multiple TPU cores. Each TPU starts with gathering all the relevant item embeddings in the Sharded Gather stage. These materialized embeddings are used to solve for user embeddings which are scattered to the relevant shard of the embedding table in the Sharded Scatter stage.

Dense Batching for Improved Efficiency
We designed ALX specifically for TPUs, exploiting unique properties of TPU architecture while overcoming a few interesting limitations. For instance, each TPU core has limited memory and restricts all tensors to have a static shape, but each example in a mini-batch can have a wildly varying number of items (i.e., inputs can be long and sparse). To resolve this, we break exceedingly long examples into multiple smaller examples of the same shape, a process called dense batching. More details about dense batching can be found in our paper.

Illustrating example of how sparse batches are densified to increase efficiency on TPUs.

Uniform Sharding of Embedding Tables
With the batching problem solved, we next want to factorize a sparse matrix into two dense embedding matrices (e.g., user and item embeddings) such that the resulting dot product of embeddings approximate the original sparse matrix — this helps us infer predictions for all the positions from the original matrix, including those that were empty, which can be used to recommend items with which users haven’t interacted. Both the resulting embedding tables (W and H in the figure below) can potentially be too large to fit in a single TPU core, thus requiring a distributed training setup for most large-scale use cases.

Most previous attempts of distributed matrix factorization use a parameter server architecture where the model parameters are stored on highly available servers, and the training data is processed in parallel by workers that are solely responsible for the learning task. In our case, since each TPU core has identical compute and memory, it's wasteful to only use either memory for storing model parameters or compute for training. Thus, we designed our system such that each core is used to do both.

Illustrative example of factorizing a sparse matrix Y into two dense embedding matrices W and H.

In ALX, we uniformly divide both embedding tables, thus fully exploiting both the size of distributed memory available and the dedicated low-latency interconnects between TPUs. This is highly efficient for very large embedding tables and results in good performance for distributed gather and scatter operations.

Uniform sharding of both embedding tables (W and H) across TPU cores (in blue).

WebGraph
Since potential applications may involve very large data sets, scalability is potentially an important opportunity for advancement in matrix factorization. To that end, we also release a large real-world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem where rows and columns are source and destination links, respectively, and the task is to predict destination links from each source link. We use WebGraph to illustrate the scaling properties of ALX.

The WebGraph dataset was generated from a single crawl performed by CommonCrawl in 2021 where we strip everything and keep only the link->outlinks data. Since the performance of a factorization method depends on the properties of the underlying graph, we created six versions of WebGraph, each varying in the sparsity pattern and locale, to study how well ALS performs on each.

  • To study locale-specific graphs, we filter based on two top level domains: ‘de’ and ‘in’, each producing a graph with an order of magnitude fewer nodes.
  • These graphs can still have arbitrary sparsity patterns and dangling links. Thus we further filter the nodes in each graph to have a minimum of either 10 or 50 inlinks and outlinks.

For easy access, we have made these available as a Tensorflow Dataset package. For reference, the biggest version, WebGraph-sparse, has more than 365M nodes and 30B edges. We create and publish both training and testing splits for evaluation purposes.

Results
We carefully tune the system and quality parameters of ALX. Based on our observations related to precision and choice of linear solvers. ​​We observed that by carefully selecting the precision for storage of the embedding tables (bfloat16) and for the input to the linear solvers (float32), we were able to halve the memory required for the embeddings while still avoiding problems arising from lower precision values during the solve stage. For our linear solvers, we selected conjugate gradients, which we found to be the fastest across the board on TPUs. We use embeddings of dimension 128 and train the model for 16 epochs. In our experience, hyperparameter tuning over both norm penalty (λ) and unobserved weight (α) has been indispensable for good recall metrics as shown in the table below.

Results obtained by running ALX on all versions of WebGraph dataset. Recall values of 1.0 denote perfect recall.

Scaling Analysis
Since the input data are processed in parallel across TPU cores, increasing the number of cores decreases training time, ideally in a linear fashion. But at the same time, a larger number of cores requires more network communication (due to the sharded embedding tables). Thanks to high-speed interconnects, this overhead can be negligible for a small number of cores, but as the number of cores increases, the overhead eventually slows down the ideal linear scaling.

In order to confirm our hypothesis, we analyze scaling properties of the four biggest WebGraph variants in terms of training time as we increase the number of available TPU cores. As shown below, even empirically, we do observe the predicted linear decrease in training time up to a sweet spot, after which the network overhead slows the decline.

Scaling analysis of running time as the number of TPU cores are increased. Each figure plots the time taken to train for one epoch in seconds.

Conclusion
For easy access and reproducibility, the ALX code is open-sourced and can be easily run on Google Cloud. In fact, we illustrate that a sparse matrix like WebGraph-dense of size 135M x 135M (with 22B edges) can be factorized in a colab connected to 8 TPU cores in less than a day. We have designed the ALX framework with scalability in mind. With 256 TPU cores, one epoch of the largest WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix) takes around 20 minutes to finish (5.5 hours for the whole training run). The final model has around 100B parameters. We hope that the ALX and WebGraph will be useful to both researchers and practitioners working in these fields. The code for ALX can be found here on github!

Acknowledgements
The core team includes Steffen Rendle, Walid Krichene and Li Zhang. We thank many Google colleagues for helping at various stages of this project. In particular, we are grateful to the JAX team for numerous discussions, especially James Bradbury and Skye Wanderman-Milne; Blake Hechtman for help with XLA and Rasmus Larsen for useful discussions about performance of linear solvers on TPUs. Finally, we're also grateful to Nicolas Mayoraz, John Anderson, and Fernando Pereira for providing useful feedback.

Source: Google AI Blog


Reproducibility in Deep Learning and Smooth Activations

Ever queried a recommender system and found that the same search only a few moments later or on a different device yields very different results? This is not uncommon and can be frustrating if a person is looking for something specific. As a designer of such a system, it is also not uncommon for the metrics measured to change from design and testing to deployment, bringing into question the utility of the experimental testing phase. Some level of such irreproducibility can be expected as the world changes and new models are deployed. However, this also happens regularly as requests hit duplicates of the same model or models are being refreshed.

Lack of replicability, where researchers are unable to reproduce published results with a given model, has been identified as a challenge in the field of machine learning (ML). Irreproducibility is a related but more elusive problem, where multiple instances of a given model are trained on the same data under identical training conditions, but yield different results. Only recently has irreproducibility been identified as a difficult problem, but due to its complexity, theoretical studies to understand this problem are extremely rare.

In practice, deep network models are trained in highly parallelized and distributed environments. Nondeterminism in training from random initialization, parallelism, distributed training, data shuffling, quantization errors, hardware types, and more, combined with objectives with multiple local optima contribute to the problem of irreproducibility. Some of these factors, such as initialization, can be controlled, but it is impractical to control others. Optimization trajectories can diverge early in training by following training examples in the order seen, leading to very different models. Several recently published solutions [1, 2, 3] based on advanced combinations of ensembling, self-ensembling, and distillation can mitigate the problem, but usually at the cost of accuracy and increased complexity, maintenance and improvement costs.

In “Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations”, we consider a different practical solution to this problem that does not incur the costs of other solutions, while still improving reproducibility and yielding higher model accuracy. We discover that the Rectified Linear Unit (ReLU), which is very popular as the nonlinearity function (i.e., activation function) used to transform values in neural networks, exacerbates the irreproducibility problem. On the other hand, we demonstrate that smooth activation functions, which have derivatives that are continuous for the whole domain, unlike those of ReLU, are able to substantially reduce irreproducibility levels. We then propose the Smooth reLU (SmeLU) activation function, which gives comparable reproducibility and accuracy benefits to other smooth activations but is much simpler.

The ReLU function (left) as function of the input signal, and its gradient (right) as function of the input.

Smooth Activations
An ML model attempts to learn the best model parameters that fit the training data by minimizing a loss, which can be imagined as a landscape with peaks and valleys, where the lowest point attains an optimal solution. For deep models, the landscape may consist of many such peaks and valleys. The activation function used by the model governs the shape of this landscape and how the model navigates it.

ReLU, which is not a smooth function, imposes an objective whose landscape is partitioned into many regions with multiple local minima, each providing different model predictions. With this landscape, the order in which updates are applied is a dominant factor in determining the optimization trajectory, providing a recipe for irreproducibility. Because of its non-continuous gradient, functions expressed by a ReLU network will contain sudden jumps in the gradient, which can occur internally in different layers of the deep network, affecting updates of different internal units, and are likely strong contributors to irreproducibility.

Suppose a sequence of model updates attempts to push the activation of some unit down from a positive value. The gradient of the ReLU function is 1 for positive unit values, so with every update it pushes the unit to become smaller and smaller (to the left in the panel above). At the point the activation of this unit crosses the threshold from a positive value to a negative one, the gradient suddenly changes from magnitude 1 to magnitude 0. Training attempts to keep moving the unit leftwards, but due to the 0 gradient, the unit cannot move further in that direction. Therefore, the model must resort to updating other units that can move.

We find that networks with smooth activations (e.g., GELU, Swish and Softplus) can be substantially more reproducible. They may exhibit a similar objective landscape, but with fewer regions, giving a model fewer opportunities to diverge. Unlike the sudden jumps with ReLU, for a unit with decreasing activations, the gradient gradually reduces to 0, which gives other units opportunities to adjust to the changing behavior. With equal initialization, moderate shuffling of training examples, and normalization of hidden layer outputs, smooth activations are able to increase the chances of converging to the same minimum. Very aggressive data shuffling, however, loses this advantage.

The rate that a smooth activation function transitions between output levels, i.e., its “smoothness”, can be adjusted. Sufficient smoothness leads to improved accuracy and reproducibility. Too much smoothness, though, approaches linear models with a corresponding degradation of model accuracy, thus losing the advantages of using a deep network.

Smooth activations (top) and their gradients (bottom) for different smoothness parameter values β as a function of the input values. β determines the width of the transition region between 0 and 1 gradients. For Swish and Softplus, a greater β gives a narrower region, for SmeLU, a greater β gives a wider region.

Smooth reLU (SmeLU)
Activations like GELU and Swish require complex hardware implementations to support exponential and logarithmic functions. Further, GELU must be computed numerically or approximated. These properties can make deployment error-prone, expensive, or slow. GELU and Swish are not monotonic (they start by slightly decreasing and then switch to increasing), which may interfere with interpretability (or identifiability), nor do they have a full stop or a clean slope 1 region, properties that simplify implementation and may aid in reproducibility. 

The Smooth reLU (SmeLU) activation function is designed as a simple function that addresses the concerns with other smooth activations. It connects a 0 slope on the left with a slope 1 line on the right through a quadratic middle region, constraining continuous gradients at the connection points (as an asymmetric version of a Huber loss function).

SmeLU can be viewed as a convolution of ReLU with a box. It provides a cheap and simple smooth solution that is comparable in reproducibility-accuracy tradeoffs to more computationally expensive and complex smooth activations. The figure below illustrates the transition of the loss (objective) surface as we gradually transition from a non-smooth ReLU to a smoother SmeLU. A transition of width 0 is the basic ReLU function for which the loss objective has many local minima. As the transition region widens (SmeLU), the loss surface becomes smoother. If the transition is too wide, i.e., too smooth, the benefit of using a deep network wanes and we approach the linear model solution — the objective surface flattens, potentially losing the ability of the network to express much information.

Loss surfaces (as functions of a 2D input) for two sample loss functions (middle and right) as the activation function’s transition region widens, going from from ReLU to an increasingly smoother SmeLU (left). The loss surface becomes smoother with increasing the smoothness of the SmeLU function.

Performance
SmeLU has benefited multiple systems, specifically recommendation systems, increasing their reproducibility by reducing, for example, recommendation swap rates. While the use of SmeLU results in accuracy improvements over ReLU, it also replaces other costly methods to address irreproducibility, such as ensembles, which mitigate irreproducibility at the cost of accuracy. Moreover, replacing ensembles in sparse recommendation systems reduces the need for multiple lookups of model parameters that are needed to generate an inference for each of the ensemble components. This substantially improves training and inference efficiency.

To illustrate the benefits of smooth activations, we plot the relative prediction difference (PD) as a function of change in some loss for the different activations. We define relative PD as the ratio between the absolute difference in predictions of two models and their expected prediction, averaged over all evaluation examples. We have observed that in large scale systems, it is sufficient, and inexpensive, to consider only two models for very consistent results.

The figure below shows curves on the PD-accuracy loss plane. For reproducibility, being lower on the curve is better, and for accuracy, being on the left is better. Smooth activations can yield a ballpark 50% reduction in PD relative to ReLU, while still potentially resulting in improved accuracy. SmeLU yields accuracy comparable to other smooth activations, but is more reproducible (lower PD) while still outperforming ReLU in accuracy.

Relative PD as a function of percentage change in the evaluation ranking loss, which measures how accurately items are ranked in a recommendation system (higher values indicate worse accuracy), for different activations.

Conclusion and Future Work
We demonstrated the problem of irreproducibility in real world practical systems, and how it affects users as well as system and model designers. While this particular issue has been given very little attention when trying to address the lack of replicability of research results, irreproducibility can be a critical problem. We demonstrated that a simple solution of using smooth activations can substantially reduce the problem without degrading other critical metrics like model accuracy. We demonstrate a new smooth activation function, SmeLU, which has the added benefits of mathematical simplicity and ease of implementation, and can be cheap and less error prone.

Understanding reproducibility, especially in deep networks, where objectives are not convex, is an open problem. An initial theoretical framework for the simpler convex case has recently been proposed, but more research must be done to gain a better understanding of this problem which will apply to practical systems that rely on deep networks.

Acknowledgements
We would like to thank Sergey Ioffe for early discussions about SmeLU; Lorenzo Coviello and Angel Yu for help in early adoptions of SmeLU; Shiv Venkataraman for sponsorship of the work; Claire Cui for discussion and support from the very beginning; Jeremiah Willcock, Tom Jablin, and Cliff Young for substantial implementation support; Yuyan Wang, Mahesh Sathiamoorthy, Myles Sussman, Li Wei, Kevin Regan, Steven Okamoto, Qiqi Yan, Todd Phillips, Ed Chi, Sunita Verna, and many many others for many discussions, and for integrations in many different systems; Matt Streeter and Yonghui Wu for feedback on the paper and this post; Tom Small for help with the illustrations in this post.

Source: Google AI Blog


Flexible, Scalable, Differentiable Simulation of Recommender Systems with RecSim NG

Recommender systems are the primary interface connecting users to a wide variety of online content, and therefore must overcome a number of challenges across the user population in order to serve them equitably. To this end, in 2019 we released RecSim, a configurable platform for authoring simulation environments to facilitate the study of RL algorithms (the de facto standard ML approach for addressing sequential decision problems) in recommender systems. However, as the technology has progressed, it has become increasingly important to address the gap between simulation and real-world applications, ensuring that models are flexible and easily extendible, enabling probabilistic inference of user dynamics, and addressing computational efficiency.

To address these issues, we recently released RecSim NG, the “Next Generation” of simulators for recommender systems research and development. RecSim NG is a response to a set of use cases that have emerged as important challenges in the application of simulation to real-world problems. It addresses the gap between simulation and real-world applications, ensures the models are flexible and easily extendible, enables probabilistic inference of user dynamics, and addresses computational efficiency.

Overview of RecSim NG
RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. It offers a powerful, general probabilistic programming language for agent-behavior specification.

RecSim NG significantly expands the modeling capabilities of RecSim in two ways. First, the story API allows the simulation of scenarios where an arbitrary number of actors (e.g., recommenders, content consumers, content producers, advertisers) interact with one another. This enables the flexible modeling of entire recommender ecosystems, as opposed to the usual isolated user-recommender interaction setting. Second, we introduced a library of behavioral building blocks that, much like Keras layers, implement well-known modeling primitives that can be assembled to build complex models quickly. Following the object-oriented paradigm, RecSim NG uses entity patterns to encapsulate shared parameters that govern various agent behaviors, like user satisfaction, and uses templates to define large populations of agents concisely in a way that abstracts agent “individuality” without duplicating invariant behaviors.

Apart from the typical use of simulators to generate Monte Carlo samples, RecSim NG directly enables various other forms of probabilistic reasoning. While domain knowledge and intuition are key to modeling any recommendation problem, the simulation fidelity needed to bridge the so-called “sim2real” gap can only be achieved by calibrating the simulator’s model to observed data. For data-driven simulation, RecSim NG makes it easy to implement various model-learning algorithms, such as expectation-maximization (EM), generative adversarial training, etc.

Also available within RecSim NG are tools for probabilistic inference and latent-variable model learning, backed by automatic differentiation and tracing. RecSim NG exposes a small set of Edward2 program transformations tailored to simulation-specific tasks. Its log-probability module can evaluate the probabilities of trajectories according to the probabilistic graphical model induced by the simulation. This, together with the automatic differentiation provided by the TensorFlow runtime, enables the implementation of maximum-likelihood estimation and model learning within the simulation itself. RecSim NG can readily use the Markov-chain Monte Carlo (MCMC) machinery provided by TensorFlow Probability to power posterior inference and latent-variable model learning. For example, a simulation model that describes how latent user attributes (e.g., preferences, intents, satisfaction) are translated into observational data (e.g., clicks, ratings, comments) can be “run in reverse,” that is, real observational data generated by a recommender system can be used to identify the most likely configuration of latent user attributes, which in turn can be used to assess the quality of the user experience. This allows for a simulation model to be integrated directly into the full data-science and model-development workflow.

Assessing recommender ecosystem health, i.e., the long-term impact of recommendation strategies on aspects such as overall satisfaction, collective fairness, and safety, requires the simulation of large multi-agent systems in order to plausibly reproduce the interactions between the different participants of the ecosystem. This, along with the computational load of probabilistic inference tasks, requires an efficient simulation runtime. For computational performance, RecSim NG offers a TensorFlow-based runtime for running simulations on accelerated hardware. The simulation takes advantage of all optimizations offered by TensorFlow's AutoGraph compiler, including accelerated linear algebra (XLA) if available. The simulation will automatically exploit all available cores on the host machine as well as specialized hardware (if run accordingly), such as Tensor Processing Units (TPUs). The core RecSim NG architecture is back-end independent, enabling applications to be developed within other computational frameworks (such as JAX or PyTorch).

Ecosystem Modeling as an Application
To demonstrate the capabilities of RecSim NG, we present a very simplified model of multi-agent interactions among users and content providers in a stylized recommender ecosystem1. The simulation captures the dynamics of a recommender system that mediates the interaction between users and content providers by recommending slates of those providers’ content items to users over time. We adopt a simplified user model whereby each user is characterized by a static, observable “user interest vector.” This vector determines a user’s affinity with a recommended item, which are then used as inputs to a choice model that determines a user’s item selection from a recommended slate. A user’s utility for any selected item is simply their affinity for the item, perturbed with Gaussian noise.

The aim of the recommender is to maximize cumulative user utility, over all users, over a fixed horizon. However, interesting ecosystem effects make this challenging, and emerge because of content provider behavior. Like users, each provider has an “interest vector'' around which the content items it makes available are centered, reflecting that provider’s general expertise or tendencies. Providers have their own incentives for making content available: their utility is measured by the number of their items selected by any user over the recent past. Moreover, providers with higher utility generate or make available a greater number of items, increasing the “catalog” from which users (and the recommender) can choose.

We compare two different recommender policies in this setting. The first is a standard “myopic'' policy that, for any user, always recommends the items that have the greatest predicted affinity for that user. Under such a policy, the behavior of providers has the potential to give rise to “rich-get-richer'' phenomena: providers that initially attract users produce more items at subsequent periods, which increases the odds of attracting even further future engagement. This gradual concentration of available items around “mainstream” content providers has a negative impact on overall user utility over time. The second recommender policy is aware of these provider dynamics, which it counteracts by promoting under-served providers.2 While a simple heuristic, the provider-aware policy increases overall user utility over extended horizons.

The number of agents in the simulation is large and we templatize both users and content providers with reusable modeling blocks offered by RecSim NG. Determining how to execute the simulation in parallel is non-trivial, so it is critical to utilize TF's AutoGraph and other computational optimizations.

Conclusion
Our hope is that RecSim NG will make it easier for both researchers and practitioners to develop, train and evaluate novel algorithms for recommender systems, especially algorithms intended to optimize system behavior over extended horizons, capture complex multi-agent interactions and incentives, or both. We are also investigating the release of increasingly realistic user models that can serve as benchmarks for the research community, as well as methods that can facilitate “sim2real” transfer using RecSim NG.

Further details regarding the RecSim NG framework can be found in the associated white paper, while code and colabs/tutorials are available here. A video about RecSim NG presented at RecSys-2020 is shown below:

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ NG, including the other members of the RecSim NG team: Vihan Jain, Eugene Ie, Chris Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov and Craig Boutilier.


1 This model is a much simpler version of that presented in this ICML-20 paper

2 This simple heuristic policy is used only to demonstrate RecSim NG’s capabilities. More sophisticated algorithms that compute policies that explicitly maximize long-term user utility are discussed in this ICML-20 paper

Source: Google AI Blog


Announcing ScaNN: Efficient Vector Similarity Search



Suppose one wants to search through a large dataset of literary works using queries that require an exact match of title, author, or other easily machine-indexable criteria. Such a task would be well suited for a relational database using a language such as SQL. However, if one wants to support more abstract queries, such as “Civil War poem,” it is no longer possible to rely on naive similarity metrics such as the number of words in common between two phrases. For example, the query “science fiction” is more related to “future” than it is to “earth science” despite the former having zero, and the latter having one, word in common with the query.

Machine learning (ML) has greatly improved computers’ abilities to understand language semantics and therefore answer these abstract queries. Modern ML models can transform inputs such as text and images into embeddings, high dimensional vectors trained such that more similar inputs cluster closer together. For a given query, we can therefore compute its embedding, and find the literary works whose embeddings are closest to the query’s. In this manner, ML has transformed an abstract and previously difficult-to-specify task into a rigorous mathematical one. However, a computational challenge remains: for a given query embedding, how does one quickly find the nearest dataset embeddings? The set of embeddings is often too large for exhaustive search and its high dimensionality makes pruning difficult.

In our ICML 2020 paper, “Accelerating Large-Scale Inference with Anisotropic Vector Quantization,” we address this problem by focusing on how to compress the dataset vectors to enable fast approximate distance computations, and propose a new compression technique that significantly boosts accuracy compared to prior works. This technique is utilized in our recently open-sourced vector similarity search library (ScaNN), and enables us to outperform other vector similarity search libraries by a factor of two, as measured on ann-benchmarks.com.

The Importance of Vector Similarity Search
Embedding-based search is a technique that is effective at answering queries that rely on semantic understanding rather than simple indexable properties. In this technique, machine learning models are trained to map the queries and database items to a common vector embedding space, such that the distance between embeddings carries semantic meaning, i.e., similar items are closer together.
The two-tower neural network model, illustrated above, is a specific type of embedding-based search where queries and database items are mapped to the embedding space by two respective neural networks. In this example the model responds to natural-language queries for a hypothetical literary database.
To answer a query with this approach, the system must first map the query to the embedding space. It then must find, among all database embeddings, the ones closest to the query; this is the nearest neighbor search problem. One of the most common ways to define the query-database embedding similarity is by their inner product; this type of nearest neighbor search is known as maximum inner-product search (MIPS).

Because the database size can easily be in the millions or even billions, MIPS is often the computational bottleneck to inference speed, and exhaustive search is impractical. This necessitates the use of approximate MIPS algorithms that exchange some accuracy for a significant speedup over brute-force search.

A New Quantization Approach for MIPS
Several state-of-the-art solutions for MIPS are based on compressing the database items so that an approximation of their inner product can be computed in a fraction of the time taken by brute-force. This compression is commonly done with learned quantization, where a codebook of vectors is trained from the database and is used to approximately represent the database elements.

Previous vector quantization schemes quantized database elements with the aim of minimizing the average distance between each vector x and its quantized form . While this is a useful metric, optimizing for this is not equivalent to optimizing nearest-neighbor search accuracy. The key idea behind our paper is that encodings with higher average distance may actually result in superior MIPS accuracy.

The intuition for our result is illustrated below. Suppose we have two database embeddings x1 and x2, and must quantize each to one of two centers: c1 or c2. Our goal is to quantize each xi to i such that the inner product <q, i> is as similar to the original inner product <q, xi> as possible. This can be visualized as making the magnitude of the projection of i onto q as similar as possible to the projection of xi onto q. In the traditional approach to quantization (left), we would pick the closest center for each xi, which leads to an incorrect relative ranking of the two points: <q, 1> is greater than <q, 2>, even though <q, x1> is less than <q, x2>! If we instead assign x1 to c1 and x2 to c2, we get the correct ranking. This is illustrated in the figure below.
The goal is to quantize each xi to i = c1 or i = c2. Traditional quantization (left) results in the incorrect ordering of x1 and x2 for this query. Even though our approach (right) chooses centers farther away from the data points, this in fact leads to lower inner product error and higher accuracy.
It turns out that direction matters as well as magnitude--even though c1 is farther from x1 than c2, c1 is offset from x1 in a direction almost entirely orthogonal to x1, while c2’s offset is parallel (for x2, the same situation applies but flipped). Error in the parallel direction is much more harmful in the MIPS problem because it disproportionately impacts high inner products, which by definition are the ones that MIPS is trying to estimate accurately.

Based on this intuition, we more heavily penalize quantization error that is parallel to the original vector. We refer to our novel quantization technique as anisotropic vector quantization due to the directional dependence of its loss function. The ability of this technique to trade increased quantization error of lower inner products in exchange for superior accuracy for high inner products is the key innovation and the source of its performance gains.
In the above diagrams, ellipses denote contours of equal loss. In anisotropic vector quantization, error parallel to the original data point x is penalized more.
Anisotropic Vector Quantization in ScaNN
Anisotropic vector quantization allows ScaNN to better estimate inner products that are likely to be in the top-k MIPS results and therefore achieve higher accuracy. On the glove-100-angular benchmark from ann-benchmarks.com, ScaNN outperformed eleven other carefully tuned vector similarity search libraries, handling roughly twice as many queries per second for a given accuracy as the next-fastest library.*
Recall@k is a commonly used metric for nearest neighbor search accuracy, which measures the proportion of the true nearest k neighbors that are present in an algorithm’s returned k neighbors. ScaNN (upper purple line) consistently achieves superior performance across various points of the speed-accuracy trade-off.
ScaNN is open-source software and you can try it yourself at GitHub. The library can be directly installed via Pip and has interfaces for both TensorFlow and Numpy inputs. Please see the GitHub repository for further instructions on installing and configuring ScaNN.

Conclusion
By modifying the vector quantization objective to align with the goals of MIPS, we achieve state-of-the-art performance on nearest neighbor search benchmarks, a key indicator of embedding-based search performance. Although anisotropic vector quantization is an important technique, we believe it is just one example of the performance gains achievable by optimizing algorithms for the end goal of improving search accuracy rather than an intermediate goal such as compression distortion.

Acknowledgements
This post reflects the work of the entire ScaNN team: David Simcha, Erik Lindgren, Felix Chern, Nathan Cordeiro, Ruiqi Guo, Sanjiv Kumar, and Zonglin Li. We’d also like to thank Dan Holtmann-Rice, Dave Dopson, and Felix Yu.



* ScaNN performs similarly well on the other datasets of ann-benchmarks.com, but the website currently shows outdated, lower numbers. See this pull request for more representative performance figures on other datasets.

Source: Google AI Blog


RecSim: A Configurable Simulation Platform for Recommender Systems

Originally posted on the Google AI Blog

Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs)—recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs—have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems

One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy—research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ

RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red)

Applications

We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work

While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs—e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

By Martin Mladenov, Research Scientist and Chih-wei Hsu, Software Engineer, Google Research

RecSim: A Configurable Simulation Platform for Recommender Systems



Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs) recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems
One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy — research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ
RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red).
Applications
We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work
While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs — e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

Source: Google AI Blog