EXPEDIA GROUP TECHNOLOGY — DATA

Multi-Variate Web Optimisation Using Linear Contextual Bandits

Or how you can run full webpage optimisations with a context-aware outcome.

Contextual multi-armed bandits offer promising opportunities to improve web content. We describe how to use them to optimise several aspects of a page at the same time. We will cover a common example, a promising model, and an infrastructure sketch to deploy it live. This blog is intended for data scientists and machine learning practitioners that already know the basics of multi-armed bandit algorithms.

Optimising the website

E-commerce websites have a constant need to improve the end user experience. These kinds of changes go from tweaking the colour of a button to a total rebuild of a page layout. Typically, a page is optimised with a series of concurrent of A/B tests run by several product teams. Different ideas for separate aspects of the page are independently tested and rolled out as soon the respective test reading is positive. This has been shown to improve websites in the long term and has become a staple approach in many companies. However, it has three main concerns:

  • Ideas are usually tested independently which might lead to missing out on some positive or negative interactions.
  • The standard A/B testing does not scale well as the amount of variants increases. This becomes especially problematic when attempting to combine several tests into one and using an exponentially-increasing number of combinations of variants.
  • There is no way to add context in a scalable way. If implemented, it is usually done as a segmentation (separate test per segment), which is not the best use of the traffic as it splits it between a series of independent and thus less precise tests.

Given these bottlenecks, the recent development in contextual (multi-armed) bandits applied to web optimisation has shown a lot of promise. In a nutshell, the contextual bandits are a family of algorithms which learn from online feedback to find the best possible options for customers. It naturally balances between exploring existing options and focusing on the most promising ones.

Some ground breaking papers [2–4] have shown that these techniques can alleviate the aforementioned issues by:

  • Repurposing traffic to more promising options on the fly which reduces the risk of poor variants running for too long and makes better use of the traffic in general.
  • Running optimisations as a model (statistical/machine learning model) which enables combining tests into a bundle and taking interactions into account.
  • Optimising for specific users sub-segments which is now not only possible but also maintainable.

For the remainder of this entry: we present a generic use case. Then we go into details on how to apply a linear model. We describe the algorithms used to update and explore the feature space. Finally, we sketch how to put it in production.

The use case

Let us illustrate the above with a simple example (this is not a real use case): say we have a mobile landing page like the Hotels.com (part of Expedia Group™) landing page for Rome. There are several aspects that the team would like to explore: the welcome message, the size of the image, and the complexity of the search module. Each of them has three variants:

Combining the changes of three aspects together yields 3³ = 27 unique layouts — clearly, this does not scale well in a classic testing approach as it splits the traffic among many variants. Yet, these small changes combined can produce quite distinctive user experiences:

Additionally, we would like to be able to find the best possible layout for a specific context, for instance returning versus new users or the customers country. This type of use case is very common across the website and can be specified as follows:

  1. We are trying to find the optimal layout of a page for a certain context.
  2. This layout is composed of several target aspects (eg. the welcome message or the size of the image).
  3. Each aspect has several variants (eg. the welcome message can be “Good Morning”, “Welcome Back” or removed).
  4. Similarly, the context is composed of several categories with levels — this could be for instance customers country of origin: France, Japan, Germany, etc. (We do not consider continuous variables at this stage.)
  5. We want to optimise for a specific reward (i.e., metric, KPI) which is usually the progression rate of the page. There could be different definitions of progression and it is a crucial decision to make. In this case, we target the event of a user reaching another page directly from this one.

The model

To tackle this use case, we use linear contextual multi-armed bandits. This is a quite common approach presented by Agrawal et al. [1] and has been used in very influential projects such as the works from Hill et al. [2] and Chapelle et al., [3] which are the main sources of inspiration for this implementation.

Let us define the model:

  • The layouts A are represented by K categorical variables. The context C is encoded in the same way with D categorical variables. Each of the categorical variables has a specific number of levels:
  • We define m as the function mapping A and C into the feature space X. A feature vector is of length L and is composed of the K one-hot-encoded aspects, the D one-hot-encoded contexts and first-order interactions between categorical variables:
  • Which variables to interact is a modelling decision. It is however recommended to have at least the first-order interactions between A and C to enable the model learning which layout works better in which context. We also recommend using first-order interaction between aspects of A to capture influences of one aspect over another.
  • We link the expected value of the Bernoulli random variable R and an observation x (ie. a and c) with a sigmoid function and a vector of parameters w:
  • For each context c, there is an optimal layout a* that maximises the expected reward:
  • As in most multi-armed bandit cases, the objective is to minimise regret (Δ) which is defined as the sum of differences between the expected reward of the optimal layout and the one chosen at time t:

To accomplish this we will need to update this model online with the feedback generated by our customers. We propose the following model:

  • A logistic regression will be trained on batches of feedback made of (rᵢ , xᵢ) tuples of length N with rᵢ ∈ {0,1} and xᵢX.
  • We use a Bayesian updating strategy in which the current distribution of w serves as prior. With the likelihood of the batch we can update the distribution (posterior):
  • We assume that each weight is an independent gaussian distribution:
  • The following cost function can be used to find the new vector of means of w:
  • However with this likelihood function and a gaussian prior, the posterior distribution has no analytical form. It is however possible to use the Laplace approximation (which can be grossly summarised as approximating a reasonably well-behaved density function with a gaussian one, please see references [3,5] for more details) yielding a simple analytical solution with the posterior following a gaussian distribution. (see Algorithm 3).

The implementation

A contextual multi-armed bandit needs essentially be able to accomplish two operations: choosing a layout given a context and updating from the feedback generated by customers. Our implementation has three key aspects:

  • To chose a layout to display, we use the Thompson sampling heuristic which consists in drawing a vector w̃ from the current distribution of w, then choose an arm a given a context c that maximises the expected reward.
  • To make the above scalable at a web traffic level, the sampling will be modified to do a greedy hill-climbing search on the layout space A to reduce complexity and latency.
  • To learn from the feedback, the distribution of the parameters w is updated using a Laplace approximation with batches of new observations from the layouts sampled from the step above and the observed reward.

The following sections explain in details.

Thompson sampling

Thompson sampling is one of the most common heuristics for multi-armed bandits. It can be easily applied to a linear model as proposed in Agrawal et al [1]. We assume each weight of w follows an independent gaussian distribution. The Thompson sampling (Algorithm 1) consists of:

  • First, sampling a set of random parameters w̃.
  • Second, choosing the arm that maximises the expected reward given those sampled parameters w̃ and a given context c.

Greedy hill climbing to speed up the sampling

One of the problems with Thompson sampling is the need to get the score all the possible layouts before selecting the maximum:

This does not scale well when the number of aspects and variants increase, (ie O(Mᴷ) per request if fixed M levels across K aspects). Hill et al. [2] have proposed an alternative that uses a greedy hill climbing strategy. This method does not guarantee finding the absolute maximum but significantly reduces the complexity when sampling. Described in Algorithm 2, the intuition of this approach is as follows:

  • We are given a context c, a sampled weights vector w̃ and a number of climbing steps S.
  • Select a random layout a⁰.
  • For each climbing step select a random aspect to optimise. For all the variants of this aspect (and keeping the other aspects fixed) score the layouts and pick the one with the highest score.
  • Continue picking random aspects, scoring the variants and selecting the one with the highest score for the remainder of the climbing steps. Optionally, you can add a stopping rule which fires when all aspects have been explored without improvement which means reaching a local maximum (not presented in Algorithm 2 for simplicity).
  • To mitigate the risk of ending in a local maximum, you run the above with R parallel starts.

The main advantage of this algorithm is that it drastically reduces the sampling complexity from O(Mᴷ) to O(M*R*S). There is an obvious trade off between risk of not selecting the currently optimal value and computational cost which increases with R and S. At a high level, this can been summarised as balancing between generating less regret and a lower latency.

Updating the parameters

Finally, we use the feedback from our customers to update the algorithm. To do so, we use the approach proposed by Chapelle et al [3]. Algorithm 3 has two main steps:

  1. Update the means by minimising the following the cost function with any gradient descent based optimisation.
  2. Update the variances using the Laplace approximation.

Production sketch

To give a better understanding, here is a simple diagram on how this approach is usually put in production across the industry. The main idea is decoupling the updating (training) and sampling facets to make the infrastructure more resilient.

The trainer can be scheduled with a CRON job to run hourly. The parameter store is a simple database holding the latest state of the distribution of w. Finally, the sampler can live in several instances scaling horizontally. To reduce latency, it is a good idea to update the parameters periodically and keep them in the sampler memory.

Opportunities ahead

This model is simple and scalable, yet it can capture new patterns in users’ preferences that were not accessible before. This is especially important to support personalisation, which is a key strategic target for web commerces like Expedia. Additionally, this model can accelerate the page optimisation programme, as it is equivalent to running several tests at the same time, while capturing positive or negative interactions. Finally, bandits naturally repurpose traffic to more promising options and reduce the opportunity cost of exploring poor ideas effectively accelerating decision making.

In our opinion, this offers non-negligible improvements over classic A/B testing programs as optimising several aspects of a page at the same time for different contexts is a ubiquitous problem for many web companies. These algorithms are easy to implement, and using a limited sets of parameters enables to keep everything in memory reducing the need for a complex production infrastructure.

Stay tuned for more posts on the infrastructure, successful use cases, simulations and extensions! Meanwhile, check our previous post on how we optimised the main images of properties using a simple bandit approach:

How We Optimised Hero Images on Hotels.com using Multi-Armed Bandit Algorithms

A big thank you to Christian Sommeregger, Kirubhakaran Krishnathan, Dionysios Varelas and Travis Brady for reviewing this entry.

References

[1] Agrawal, S. and Goyal, N., 2013, February. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (pp. 127–135).

[2] Hill, D.N., Nassif, H., Liu, Y., Iyer, A. and Vishwanathan, S.V.N., 2017, August. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1813–1821).

[3] Chapelle, O., Manavoglu, E. and Rosales, R., 2014. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology (TIST), 5(4), pp.1–34.

[4] Li, L., Chu, W., Langford, J. and Schapire, R.E., 2010, April. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web (pp. 661–670).

[5] Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc. (Chapter 4.4, pp. 213–220).

Learn more about technology at Expedia Group

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store