If you’re ever offered a chance to throw a million or so beacons' worth of user data into a Google-developed machine-learning system, I highly recommend you say yes. That’s what we did when Google approached us a year ago about partnering with them on a pioneering research project. The results have been eye-opening.
In this post, I’ll walk through a bit of the methodology we used, as well as some of the highlights of our findings. I’ll also share the machine learning code we used — which we’ve open sourced — as well as a few tips we picked up during this project.
Before we get to that, a few things I’d like to mention:
- I had the privilege of presenting this research with Pat Meenan, the Google software engineer who drove the machine learning side of things, at Velocity last month. The video of our talk is embedded at the bottom of this post.
- For more insight, the Think with Google article also covers this research.
- I want to thank the many people from Google and SOASTA who played a huge role in this project, including Daniel An, Louis Magarshack, and James Urquhart.
What we did (and how we did it)
We started with more than a billion beacons’ worth of user data gathered and stored by online retailers that use mPulse to monitor
user experience and correlate it to UX and business metrics. This data was aggregated, anonymized, and eventually filtered down to just over 1.1 million beacons.
We then fed the data to Google’s machine learning system. The objective was to identify user session and page attributes that were the greatest predictors of two metrics: bounce rate (the percentage of users who navigate away from a site after an initial engagement) and conversion rate (the percentage of users who successfully completed a transaction).
What kind of user data did our RUM beacon collect?
- Top-level attributes – domain, timestamp, SSL
- Session – start time, length (in pages), total load time
- User agent – browser, OS, mobile ISP
- Geography – country, city, organization, ISP, network speed
- Timers – base, custom, user-defined
- Custom metrics
- HTTP headers
Why we abandoned our deep learning model and used machine learning instead
A neural network figures out how to connect the dots between different variables. As Pat explained in our Velocity talk, in order to train a deep learning network, you just keep throwing data at it until a pattern emerges.
Training a deep learning neural network takes a lot of time, guesswork, and math — almost all of which is done by GPUs (graphics processing units: computer chips that perform rapid mathematical calculations). When you have a trained neural network, using it is really fast. It took about six hours to train the deep learning system to find patterns in our data, and then less than half an hour to use it at the end.
When we ran our beacon data through the deep learning system, its predictions had an accuracy rate of about 99.6% — which we liked. What we didn’t like was the fact that the deep learning model is essentially a black box. It can’t give you any visibility into the connections between your inputs and outputs.
So we decided to go a slightly different route and use a machine learning model. This involved taking the same data set but using random forests instead. Random forests are basically decision trees. With this model, machine learning goes through a decision tree looking at variables, making decisions, and seeing what patterns emerge.
The good thing about random forests is that you can actually investigate your findings and learn from them. The other good thing is that this model comprises roughly 200 lines of code. Another good thing is that you can download our code on GitHub and do your own research on your own user data.
How we ensured our data set was meaningful
There were a few things we needed to do to ensure our data was usable:
Machine learning works off numbers. For example, if you have an input such as “device name is Apple”, the answer will be “yes/no” (represented as 1/0). Having a few strings (e.g., device name) is okay, but having too many (e.g., user agents) could overwhelm the system.
With our data set, the conversion rate was 3%. The problem with machine learning is that the system would have a high accuracy rate (97%) just by guessing “no” to everything. To prevent this, we sampled the data to get as close to a 50/50 mix as we could. This gave the ML system a good baseline so that it could make decisions based on the data, not just guesswork.
It’s possible to overtrain a network so that it can actually recognize specific user sessions (you don’t want that.) So we used 80% of the data to train the ML system and held back 20% of the data to use later to validate our findings and confirm our accuracy rate.
We ended up training the system to operate with 96% accuracy, which isn’t as high as the 99.6% we got with deep learning, but still falls within acceptable parameters for statistical significance.
4. Smoothing distribution
Machine learning systems like to have data that is normally distributed. The team used three short lines of code to fit all the data to a normal distribution before feeding it into the ML system:
prism:generic]scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_val = scaler.transform(x_val)[/prism:generic]
5. Ensure input data is as independent of outputs as possible
When you’re trying to establish causal relationships between variables, you need to remove correlated data. For example, in our study SSL highly correlated with conversions, which was not surprising since checkout pages use SSL. Also unsurprisingly, long sessions correlated with lower bounce rates. We filtered our data to remove these kinds of correlations that could throw off our results.
When we initiated this project, we didn’t have any hard-and-fast hypotheses even though our findings made us aware that we had some unspoken assumptions — some of which were proved wrong. We’re now in the process of digging further. We’ll present any new findings at Velocity New York this September.
1. (Almost) everything matters
When looking at a total of 93 different attributes (pictured below), you might be tempted to focus only on those that mattered the most, on the left-hand side of this graph.
Obviously, those attributes deserve attention, but it bears remembering that almost every attribute we studied was able to predict conversions and bounce rate to various degrees.
2. Number of scripts was a predictor of conversions, but…
…not in the way we expected. Sessions that converted contained 48% more scripts, including third-party scripts, such as ads, analytics beacons, and social buttons) than sessions that didn’t.
3. When entire sessions were more complex, they converted less
While the previous finding tells us that more scripts correlate to increased conversions, when you add in more images and other elements that make pages more complex, those sessions converted less.
Why? The culprit might be the cumulative performance impact of all those page elements. The more elements on a page, the greater the page’s weight (total number of kilobytes) and complexity.
A typical web page today contains a hundred or so assets hosted on dozens of different servers. Many of these page assets are unoptimized, unmeasured, unmonitored — and therefore unpredictable. This unpredictability makes page loads volatile. Site owners can tackle this problem by setting performance budgets for their pages and culling unnecessary page elements. They should also audit and monitor all the third-party scripts on their sites.
4. Sessions that converted contained fewer images than sessions that didn’t
When we talk about images, we’re referring to every single graphic element on a page — from favicons to logos to product images. On a retail site, those images can quickly add up. On a typical retail page, images can easily comprise up to two thirds (in other words, hundreds of kilobytes) of a page’s total weight. The result: cumulatively slow page loads throughout a session.
5. DOM Ready was the greatest predictor of bounce rate
“DOM Ready” refers to the amount of time it takes for the page’s HTML to be received and parsed by the browser. Actual page elements, such as images, haven’t appeared yet. While it isn’t shocking that DOM Ready was a predictor, it was very surprising to see that it was the number one predictor. Our team agreed that this finding needs more study.
6. Full load time was the second greatest predictor of bounce rate
Bounced sessions had median full page load times that were 53% slower than non-bounced sessions. This finding is very interesting because in recent years there’s been a growing movement within the performance community to disregard load time as a meaningful metric.
I’ll put my hand up and admit that I’ve been guilty of doing this. The rationale makes sense on paper: with so many assets, such as third-party scripts and below-the-fold content, slowing down total load time, it’s difficult to see how it works as a metric for measuring perceived user experience. But now, with such a strong correlation between load time and bounce rate, dismissing it may be premature.
7. Mobile-related measurements weren’t meaningful predictors of conversions
This came as a surprise. We looked at attributes such as device type, OS, bandwidth, and connection speed, and found that none of these were strong predictors of conversions. This is interesting because it suggests that, contrary to what many people believe, internet users don’t behave especially differently depending on what device they’re using. As Pat said in our talk, there’s no more “mobile web”. It’s just the web.
8. Start Render Time wasn’t a strong predictor of conversions
Out of the 93 different metrics we studied, Start Render Time (when content begins to display in the user’s browser) ranked 69th in its ability to predict whether a session would convert or not. This was probably the most surprising finding. Up until now, many user experience proponents who participate in the web performance community have placed some value on Start Render Time. This makes sense, because — on paper, anyway — Start Render would seem to reflect the user’s perception of when a page begins to load. But this research suggests that start render isn’t an accurate measure of the user experience — at least as it pertains to triggering more conversions.
This was one study conducted on one data set. I can’t stress this enough: We’re not proposing that organizations abandon metrics, such as start render time, that our industry has been using for years. Instead, our hope is that more organizations will explore their own data and see what patterns emerge.
Given the number of RUM tools currently on the market, it’s never been as easy to collect user data as it is now. Combine that with the relatively low barrier to entry for doing this kind of machine learning, and it’s my prediction that more and more companies will explore this opportunity. Speaking of which, don’t forget to download our machine learning code on GitHub and take it for a spin.
As a final note, there’s an observation to be made here about how performance monitoring is driven by what we’re able to measure versus what we should measure. Performance monitoring tools can gather massive amounts of data about a wide swath of metrics, but are all those metrics meaningful? To what extent do we, as people who care about monitoring the user experience, let the tail wag the dog? These are interesting questions that merit more examination. Luckily, we now have the data and the tools to do this.