What most of us get wrong about measuring marketing

Let’s say that you run an online store. You advertise that store using a range of different marketing channels — Google Search, Facebook, YouTube et cetera. If someone buys something from your store having seen ads on all of these channels; which one do you give the credit to?

This is the fundamental question of how you measure marketing. In a digital world, where consumers can see hundreds of your ads before deciding to make a purchase, it can be exceptionally difficult to decide which of your ads are really driving sales.

This isn’t just an abstract question either. If you want to increase your sales by spending more on advertising, you have to decide where to put your extra budget. But in order to know the best place to put that budget, you have to understand which of your ads are actually driving sales.

There are three different approaches to answering this question, which come from three different eras of marketing.

The Rule Era

The first approach to measurement uses rules.

A classic example of a rule-based approach is what’s know as last touch measurement. This says that sales should always be attributed to whichever ad a user interacted with last.

The idea behind this is fairly easy to motivate; if a certain ad was the last ad a user interacted with prior to purchasing, it was probably the reason they made the purchase.

Then again, maybe it was the first ad we showed them which sparked their interest, so surely we should give all the credit to the first ad? This approach to measurement is known, unsurprisingly, as first touch measurement.

First touch measurement is fairly uncommon, and it’s not difficult to see why. Sure, the very first ad a user sees is going to play some role in convincing them to finally go on and make a purchase, but to say it plays the only role is a little extreme.

You might have noticed that both first and last touch measurement approaches give no credit at all to the ads in the middle of the journey. This seems wrong because, even if these ads likely didn’t play a huge role, they surely play some role.

Linear measurement seeks to address this concern by assigning credit equally to all ads that a user has seen before making a purchase.

Sure, linear measurement is simplistic, but at least it respects our intuition that each ad plays a part in convincing someone to purchase from you.

The problem with rules

To understand one of the biggest problems with rules-based approaches, I’m going to use an analogy.

I’m an average runner. I’ve got two friends (Jack and Jill) who are also average runners. We’ve got an upcoming 4x100m relay race, for which we manage to enlist one Usain Bolt as the final member of our team.

We decide to race in the following order:

Race day comes, and we win the 4x100m relay. Once the celebrations die down, we decide to ask ourselves: who deserves credit for us winning the race?

A last touch approach to this question would say:

“Jill deserves all the credit. She was the last runner, and so the only one of us to actually cross the finish line. Clearly she deserves the credit”

Instinctively this seems wrong. Running last shouldn’t make you more responsible for the win than anyone else. It certainly shouldn’t mean that everyone but the last runner played zero part.

A first touch approach isn’t much better. It would say:

“Jack deserves all the credit. He was the first runner; he was the only one that actually started the race. Therefore he deserves the credit”

This seems wrong, for largely the same reasons as we think that the first touch approach is wrong.

If we looked at it from a linear perspective, we might say:

“Everyone deserves one quarter of the credit. It’s a race of four runners, so each runner plays an equal part”

As fair and amicable as this approach might seem, it doesn’t respect our intuition that Usain Bolt likely deserves more credit than the rest of us. He’s the world class sprinter, he helped us win the race the most, he deserves most of the credit. Right?

The Model Era

Let’s ask ourselves why we think that Usain Bolt deserves the most credit for our race win in this example. When we ask ourselves this question, most of us are tempted to answer something along the lines of:

“He made the most difference to the team. The team would have been so much slower without him”

One of the ways that we can quantify the difference Usain Bolt makes to the team is by looking at his counterfactual value. A runner’s counterfactual value is the difference in how the team performs when they’re swapped out for an average runner.

Swap Bolt out for an average runner, and the team performs significantly worse; Bolt has high counterfactual value. Swap myself, Jack, or Jill out for an average runner, and the performance of the team is largely unchanged. This indicates that we, in contrast to Bolt, have little counterfactual value.

So how does this relate back to marketing?

Instead of runners in a team, we can think of channels that a user has interacted with in a purchase journey. Just as a relay team requires all four runners to win its race, a purchaser who interacts with four different channels is influenced by each one.

In answering the role that each channel played in pushing the consumer to purchase, and therefore assigning it credit, we can take a similar approach to the above.

If we want to know the role that Facebook plays in a purchase journey, we can look at consumer journeys that involved Facebook, and journeys that are identical except for the absence of Facebook. By comparing the difference in how many people purchase at the end of these two journeys, we can see the counterfactual impact that Facebook had.

If people are more likely to purchase in the journey where Facebook was present, then Facebook deserves credit for helping to drive these purchases: it’s having an incremental impact.

This is the basis for a number of different approaches to marketing measurement, most notably the Data Driven Attribution (DDA) model that Google uses to measure the impact of different channels.

The problem with models

Let’s say that I’m running Facebook ads for you, and I come to you with the following graph¹:

I then say to you:

“Purchase rates are much higher amongst people who’ve seen our ads than those who haven’t, and even higher amongst those who’ve clicked our ads. Clearly our ads are having a big impact; they’re increasing the likelihood of people purchasing”.

How convinced would you be by this? Does the graph above support what I’ve said?

In actual fact, it’s entirely possible that even if the graph above were true, I could be completely wrong in saying that our ads are having an impact. The reason this could happen comes down to delivery bias.

If the results in my earlier graph came from showing ads to people completely at random, with no regard for targeting, then the higher purchase rate amongst those who saw/clicked the ads would indicate that the ads were having an impact.

In reality though, no advertiser shows ads at random. Advertisers will almost always show ads to people who are already likely to purchase, in order to push them one step closer to purchase.

But if we’re showing ads to people who are most likely to purchase, then observing that people who see our ads are more likely to purchase doesn’t mean our ads are causing them to purchase. It could very well be just that we’re targeting people who are already likely to convert.

If this is true, then my graph from earlier doesn’t tell us anything about the causal impact our ads are having; it just tells us that we’re showing ads to people who are already on their way to purchase.

This might seem like an abstract worry, but it’s not. Today’s advertising platforms are so efficient at finding people who are already likely to purchase, that the scenario above presents a real issue: our ad delivery (who we’re showing ads to) is often heavily biased.

What does this mean for models?

The above presents a challenge to the sort of model we looked at earlier. If we try to look at the impact that a channel has by comparing journeys that feature that channel to journeys that don’t, we need to be able to assume that the channel in question isn’t just showing ads to the people that are most likely to purchase.

If the channel in question is just showing ads to users that are most likely to purchase, then it will inflate the perceived counterfactual value of that channel. This is because the channel is in-effect cherry picking the best people to show its ads to. Just because its showing ads to people that are likely to purchase doesn’t mean it’s causing them to purchase.

In order to get around this challenge, we need to be able to be able to split users who have and haven’t seen ads from a particular channel into unbiased groups. By comparing purchase rates between these groups, we can then work out what causal impact the channel in question is having on users’ purchase decisions.

But to construct these groups we can’t just look at people who have/haven’t seen ads from a particular channel, as users likely to purchase are more likely to be targeted by your ads. So how can we build these groups then?

The Lift Era

One effective way of getting around the challenges posed above is to use what’s known as a lift test.

To lift test a channel, you start off by serving your ads as normal. Once you win an ad auction, and just before you show a user your ad, a coin is flipped:

If the coin lands heads, the user is shown your ad as normal, and is put in a group of people who are allowed to see your ads.
If the coin lands tails, the user isn’t shown your ad, and they’re put in a group of people who aren’t allowed to see any of your ads.

It’s crucial that the user isn’t placed into a group up until the moment before they’re shown an ad. The fact that the user is placed into a group so late on means that which group they’re placed in isn’t biased by your targeting methods. The most valuable users, those most likely to purchase, aren’t any more likely to be in one group than another.

As you continue to show your ads to new people, you continue to build up groups of people who respectively are and aren’t allowed to see your ads.

An amount of time later, which could be weeks or months depending on your customer’s typical purchase cycle, you can go back and compare the two groups of users. The key question to ask is: how many more purchases were there amongst people who saw my ads, than those who didn’t? This is a measure of the incremental impact of your ads.

If the incremental impact is larger than expected, it suggests that whatever measurement system you had been using was under-crediting the channel you’ve lift tested. This channel actually provides much more value to you than you’d previously expected. This is likely to be true if the channel targets users early on in the purchase journey, which traditional (rule) based measurement approaches often give less credit to.

Conversely, if the incremental impact is smaller than expected, it suggests your ads are being shown to people who would’ve purchased anyway. The ads you’re showing on this channel aren’t causing people to purchase; you’re just showing ads to people who are about to purchase.

The way forward

Lift testing is still fairly uncommon amongst advertisers. This isn’t helped by the fact that most platforms don’t readily support it, or require large minimum spends to even set up a test. With Facebook currently paving the way for self-serve lift tests, hopefully other platforms will follow suit.

As platforms make it easier to measure in this way, we’ll begin to see lift tests becoming the norm rather than the exception. After running their first lift tests, advertisers will become increasingly skeptical of rules and model-based approaches, using them only as proxies for the results you might get from a lift test.

-----

¹ This example is taken from A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook, which is well worth a read if you’ve gotten this far.

‍

This piece was co-written with

mack grenfell

forecasting

other writing