Context & Uncertainty in Web Analytics
Trying to make decisions with data
“If a measurement matters at all, it is because it must have some conceivable effect on decisions and behaviour. If we can’t identify a decision that could be affected by a proposed measurement and how it could change those decisions, then the measurement simply has no value” - Douglas W. Hubbard, How to Measure Anything: Finding the Value of Intangibles in Business, 2007
Like many digital businesses we use web analytics tools that measure how visitors interact with our websites and apps. These tools provide dozens of simple metrics, but in our experience their value for informing a decision is close to zero without first applying a significant amount of time, effort and experience to interpret them.
Ideally we would like to use web analytics data to make inferences about what stories our readers value and care about. We can then use this to inform a range of decisions: what stories to commission, how many articles to publish, how to spot clickbait, which headlines to change, which articles to reposition on the page, and so on.
Finding what is newsworthy can not and should not be as mechanistic as analysing an e-commerce store, where the connection between the metrics and what you are interested in measuring (visitors and purchases) is more direct. We know that — at best — this type of data can only weakly approximate what readers really think, and too much reliance on data for making decisions will have predictable negative consequences. However, if there is something of value the data has to say, we would like to hear it.
Unfortunately, simple web analytics metrics fail to account for key bits of context that are vital if we want to understand if their values are higher or lower than what we should expect (and therefore interesting).
Moreover, there is inherent uncertainty in the data we are using, and even if we can tell whether the value is higher or lower than expected, it is difficult to tell whether this is just down to chance.
Good analysts, familiar with their domain often get good at doing the mental gymnastics required to account for context and uncertainty, so they can derive the insights that support good decisions. But doing this systematically when presented with a sea of metrics is rarely possible or the best use of an analyst’s valuable sense-making skills. Rather than all their time being spent trying to identify what is unusual, it would be better if their skills could be applied to learning why something is unusual or deciding how we might improve things. But if all of our attention is focused on the lower level what questions, we never get to the why or how questions — which is where we stand a chance of getting some value from the data.
“The value of a fact shrinks enormously without context” - Howard Wainer, Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot, 1997
Take two metrics that we would expect to be useful — how many people start reading an article (we call this readers), and how long they spend on it (we call this the average dwell time). If the metrics worked as intended, they could help us identify the stories our readers care about, but in their raw form, they tell us very little about this.
- Readers: If an article is in a more prominent position on the website or app, more people will see it and click on it.
- Dwell time: If an article is longer, on average, people will tend to spend more time reading it.
Counting the number of readers tells us more about where an article was placed, and dwell time more about the length of the article than anything meaningful.
It’s not just length and position that matter. Other context such as the section, the day of the week, how long since it was published, and whether people are reading it on our website or apps all systematically influence these numbers. So much so, that we can do a reasonable job of predicting how many readers an article will get and how long they will spend on it by only looking at its context, and completely ignoring the content of the article.
From this perspective, articles are a victim of circumstance, and the raw metrics we see in so many dashboards tell us more about their circumstances than anything more meaningful — it’s all noise and very little signal.
Knowing this, what we really want to understand is how much better or worse an article did than we would expect, given that context. In our newsroom, we do this by turning each metric (readers, dwell time and some others) into an index that compares the actual metric for an article to it’s expected value. We score it on a scale from 1 to 5, where 3 is expected, 4 or 5 is better than expected and 1 or 2 is worse than expected.
Article A: a longer article in a more prominent position. Neither the number of readers nor the time they spent reading it was different from what we would expect (both indices = 3).
Article B: a shorter article in a less prominent position. Whilst it had the expected number of readers (index = 3), they spent longer reading it than we would expect (index = 4).
The figures above show how we present this information when looking at individual articles. Article A had 7,129 readers, more than four thousand more readers than article B, and people spent 2m 44s reading article A, almost a minute longer than article B. A simple web analytics display would pick article A as the winner on both counts by a large margin. And completely mislead us.
Once we take into account the context, and calculate the indices, we find that both articles had about as many readers as we would expect, no more or less. Even though article B had four thousand fewer, it was in a less prominent position, and so we wouldn’t expect so many. However, people did spend longer reading article B than we would expect, given factors such as it’s length (it was shorter than article A).
The indices are the output of a predictive model, which predicts a certain value (e.g. number of readers), based on the context (the features in the model). The difference between the actual value and the predicted value (the residuals in the model) then form the basis of the index, which we rescale into the 1–5 score. An additional benefit is that we also have a common scale for different measures, and a common language for discussing these metrics across the newsroom.
Unless we account for context, we can only really use data for inspection: ‘Just tell me which article got me the most readers, I don’t care why’. If the article only had more readers because it was at the top of the edition we’re not learning anything useful from the data, and at worst it creates a self fulfilling feedback loop (more prominent articles get more readers — similar to the popularity bias that can occur in recommendation engines).
In his excellent book Upstream, Dan Heath talks about moving from data for inspection to data for learning. Data for learning is fundamental if we want to make better decisions. If we want to use data for learning in the newsroom, it’s incredibly useful to be able to identify which articles are performing better or worse than we would expect, but that is only ever the start. The real learning comes from what we do with that information, trying something different, and seeing if it has a positive effect on our readers’ experience.
“Using data for inspection is so common that leaders are sometimes oblivious to any other model.” - Dan Heath, Upstream: The Quest to Solve Problems Before They Happen, 2020
“What is not surrounded by uncertainty cannot be truth” - Richard Feynman (probably)
The metrics presented in web analytics tools are incredibly precise. 7,129 people read the article we looked at earlier. How do we compare that to an article with 7,130 readers? What about one with 8,000? When presented with numbers, we can’t help making comparisons, even if we have no idea whether the difference matters.
We developed our indices to avoid meaningless comparisons that didn’t take into account context, but earlier versions of our indices were displayed in a way that suggested more preciseness than they provided — we used a scale from 0 to 200 (with 100* as expected).
*Originally we had 0 as our expected value, but quickly learnt that nobody likes having a negative score for their article, but something below 100 is more palatable.
Predictably, people started worrying about small differences in the index values between articles. ‘This article scored 92 , but that one scored 103, that second article did better, let’s look at what we can learn from it’. Sadly the model we use to generate the index is not that accurate, and models, like data have uncertainty associated with them. Just as people agonise over small meaningless differences in raw numbers, the same was happening with the indices, and so we moved to a simple 5 point scale.
Most articles get a 3, which can be interpreted as ‘we don’t think there is anything to see here, the article is doing as well as we’d expect on this measure’. An index of 2 or 1 means it is doing a bit worse or a lot worse than expected, and a 4 or a 5 means it is doing a bit better or a lot better than expected.
In this format, the indices provide just enough information for us to know — at a glance — how an article is doing. We use this alongside other data visualisations of indices or raw metrics where more precision is helpful, but in all cases our aim is to help focus attention on what matters, and free up time to validate these insights and decide what to do with them.
Why are context and uncertainty so often ignored?
These problems are not new and covered in many great books on data sense-making — some are decades old, but more recently Howard Wainer, Stephen Few and R J Andrews.
Practical guidance on dealing with uncertainty is easier to come by, but in our experience, thinking about context is trickier. From some perspectives this is odd. Predictive models — the bread and butter of data scientists — inherently deal with context as well as uncertainty, as do many of the tools for analysing time series data and detecting anomalies (such as statistical process control). But we are also taught to be cautious when making comparisons where there are fundamental differences between the things we are measuring. Since there are so many differences between the articles we publish, from length, position, who wrote them, what they are about, to the section and day of week on which they appear, we are left wondering whether we can or should use data to compare any of them. Perhaps the guidance on piecing all of this together to build better measurement metrics is less common, because how you deal with context is so contextual.
Even if you set out on this path, there are many mundane reasons to fail. Often the valuable context is unavailable. It took us months to bring basic metadata about our articles— such as length and the position in which they appear— into the same system as the web analytics data. An even bigger obstacle is how much time it takes just to maintain a reliable metrics system (digital products are constantly changing, and this often breaks the web analytics data, including ours as I wrote this). Ideas for improving metrics often stay as ideas or proof of concepts that are not fully rolled out as you deal with these issues.
If you do get started, there are myriad choices to make to account for context and uncertainty— from technical to ethical — all involving value judgements. If you stick with a simple metric you can avoid these choices. Bad choices can derail you, but even if you make good ones, if you can’t adequately explain what you have done, you can’t expect the people who use the metrics to trust them. By accounting for context and uncertainty you may replace a simple (but not very useful) metric with something that is in theory more useful, but the opaqueness causes more problems than it solves. Even worse, people place too much trust in the metric and use it without questioning it.
As for using data to make decisions. We will leave that for another post. But if the data is all noise and no signal, how do you present it in a clear way so the people using it understand what decisions it can help them make? The short answer is you can’t. But if the pressure is on to present some data, it is easier to passively display it in a big dashboard, filled with metrics and leave it to others to work out what to do, in the same way passive language can shield you if you have nothing interesting to say (or bullshit as Carl T. Bergstrom would call it). This is something else we have battled with, and we have tried to avoid replacing big dashboards filled with metrics with big dashboards filled with indices.
Adding an R for reliable and an E for explainable, we end up with a checklist to help us avoid bad — or CRUDE — metrics (Context Reliability Uncertainty Decision orientated Explainability). Checklists are always useful, as it’s easy to forget what matters along the way.
Anybody promising a quick and easy path to metrics that solve all your problems is probably trying to sell you something. In our experience, it takes time and a significant commitment by everybody involved to build something better. If you don’t have this, it’s tough to even get started.
Part of the joy and pain of applying these principles to metrics used for analytics — that is, numbers that are put in front of people who then use them to help them make decisions — is that it provides a visceral feedback loop when you get it wrong. If the metrics cannot be easily understood, if they don’t convey enough information (or too much), if they are biased, or if they are unreliable or if they just look plain wrong vs. everything the person using them knows, you’re in trouble. Whatever the reason, you hear about it pretty quickly, and this is a good motivator for addressing problems head on if you want to maintain trust in the system you have built.
Many metrics are not designed to be consumed by humans. The metrics that live inside automated decision systems are subject to many of the same considerations, biases and value judgements. It is sobering to consider the number of changes and improvements we have made based on the positive feedback loop from people using our metrics in the newsroom on a daily basis. This is not the case with many automated decision systems.
Author: Dan Gilbert