What events should I fire to track an experiment?
A fabulous question and one with a number of different answers. During my travels I've found that there are 4 basic ways to try and tackle the problem
So how do we choose? Well I think the right way to look at this is to go backwards. Rather than start with the events, let’s start with the results we want to analyze. What are we looking to get to measure whether these experiments were successful? I'd say that the platonic ideal of experiment results is something like:
Experiment |
Arm |
Exposures |
Conversions |
Ez74_new_header |
Control |
10011 |
1031 |
Ez74_new_header |
ez74-on |
10090 |
1100 |
Ez75_big_food_picture |
Control |
8000 |
873 |
Ez75_big_food_picture |
ez75-on |
8005 |
888 |
What are we doing? Well, we get all the exposure events, then see if there's any conversions for them, and group em up.
Setting on the user sounds nice and the tools generally seem to suggest that's what you ought do do, but things get weird quickly. If you go with method 1, a property per experiment you'll quickly explode the number of columns on your user table (afaik the solution here is generally 'call your account rep to delete old user properties'). If you go with method 2, well, now you've got structured data within a property value and YMMV about how you can query that. Best case you'll end up having to do some weird "contains" logic each time you want to query (or worse having to regex your way to answers in SQL).
But it gets worse! Say you do any gradual turning up of the percentage of people you're exposing. Well, then it gets very confusing very quickly because you can't see when users changed their percentage.
The final pain with putting it on the user is that now you always need to join/query/update schemas for two tables in Redshift. Not the worst thing in the world, but it does mean 2 critical data warehousing syncs to monitor.
The problems with putting it on an event are similar to putting them on the user, but in my experience it's even tougher to get a grasp of what experiments have been run. Each event is running around with a ton of properties on it, but in order to see whether they're currently in the experiment you have to now look at the most recent events.
So what's the answer? Let's just fire a new event when we expose a user. This solves all the issues mentioned above. We can clearly search for all experiment events, and we only need three columns (see part 1 for thoughts on experiment schema design) for as many experiments as we want to run:
And voila! If we simply want to track conversion rates two arms of an experiment we can stop now, but if we want to look at our conversion rates over time or look at retention there's one more subtle problem to conquer.
Experiments are the most volatile part of your codebase. I know we're supposed to 'not look' at the results of the experiment until they've reached significance, but let's be honest: we all peek. But it's for a good reason I swear! Experiments break all the time. If we simply look at the big picture conversion change it's easy to miss what's really happening.
So what do we do? We add the exposure date to our SQL or choose conversion funnel over time in Amplitude. However, if we simply break down the exposure conversion funnel by when we've exposed people we're almost certainly going to have 'exposed' people many times. If the experiment is for a 'ez74_new_header' then likely we've exposed them every single time we've rendered the header.
The big problem with this is that if people enter our experiments multiple times, then they'll appear on multiple days. Amplitude and MixPanel differ in how they deal with these situations but there are major gotchas with both of them that in my experience make analysis quite error prone.
MixPanel and Amplitude each treat this situation subtly different depending on which report you're running and what time window you're looking at. But if you want "each person to be in one and only one bucket and that bucket corresponds with the first time they were exposed" you're going to have a hard time doing that. You can stretch your SQL skills of course to find only the first exposure, but if you do that then your Amplitude and Redshift are always going to give you different numbers and that's nobodies idea of a good time.
The easiest most sane fix in my opinion is to only fire the first exposure event for each user-experiment-experiment-result. By only firing the first one, we no longer need to do some advanced SQL hackery to get the "first event" and we can simply group by. How do we go about only firing one event? Well we need a simple service that will tell us when we're already done something. You can outsource this to something like Prefab.cloud or build a simple table that tracks when you've fired an event and returns false after the first time.
Your resulting tracking code should look something like this.
With some basic deduplication in place our analysis is now easy as Boston Cream Pie. Did I mention we're hiring and we have a cupcake policy?