How to Choose the Right Metrics for Your Experimentation Platform

Discover how collecting the right Experimentation Platform metrics will help you mature your experimentation program, unlock new organizational learnings and strengthen your company experimentation culture.

6 min readOct 17, 2023

Image by author, created using Midjourney

Experimentation Platforms enable the design and execution of controlled experiments, such as A/B tests, where different versions of an element are presented to users to determine which performs better based on specified business metrics.

They’ve become essential tools for data-driven digital businesses because they quantify the impact of changes on user behavior, allowing Product Managers to make data-informed decisions when introducing new product features.

We believe measuring the effectiveness of an Experimentation Platform is necessary and a success factor of a mature experimentation program as it will help you unlock new organizational learnings and strengthen your company experimentation culture.

In this article we detail the crucial metrics required to evaluate a production running Experimentation Platform, either built or bought.

The Need for Platform Metrics

Platform Engineering, the discipline that operates the internal systems other employees use in an organization (like compute platforms, data platforms or software development toolkits), has been growing as a unique discipline for the past years.

In the case of experimentation platforms, some vendors provide some metrics out of the box but to get the full picture, in-house development is frequently needed.

Collecting and monitoring these metrics is sometimes regarded as an administrative burden for organizations and we often see organizations prioritizing initiatives with clearer impact over initiatives to select, calculate, and monitor platform metrics.

The Myth of the Single Metric Solution

Unfortunately, there isn’t a single metric that can be used to measure the effectiveness of an Experimentation Platform perpetually. Instead, we recommend a multi-metric approach where depending on the maturity of your experimentation program and the business priorities, you can choose which metric to optimize for on a given period.

Experimentation Platform Metrics

The following metrics are the ones we recommended for constant monitoring of an Experimentation Platform and we believe every vendor should include them out of the box in their products.

Number of experiments

This is the most obvious one. When companies start their experimentation journey, they will focus on increasing the number of experiments their people are able to run and trust on a given period.

Before starting to count, you must agree internally on the definition of “experiment”: should you include inconclusive experiments? Should you include experiments that were stopped because of quality issues? Should you include A/A tests?

In the following diagram you see how “number of experiments” can be split using two dimensions: one for its trustworthiness (or lack of) and another to differentiate A/B tests from A/A tests.

We recommend being explicit when communicating about “number of experiments” and using the right “sub-definition” at the right time. Here are the ones we use:

Trustworthy completed A/B test. Finished A/B test where the information has been used to make or inform a decision and data is accurate.
Unreliable A/B test. Finished (or stopped earlier) A/B test with data quality issues and/or bugs in the implementation.
Trustworthy completed A/A test. Finished A/A test where the information has been used to make or inform a decision and data is accurate.
Unreliable A/A test. Finished (or stopped earlier) A/A test with data quality issues and/or bugs in the implementation.

And some examples of cases when you want to focus on one or another:

Instead of setting a yearly goal to increase the number of experiments, you can be explicit about actually wanting to increase trustworthy completed A/B tests.
If a lot of experiments get stopped early due to quality issues, you can set goals towards decreasing unreliable A/B tests.
If teams are not A/A testing their setups, having a goal to run more A/A tests is a great idea.

When collecting the number of experiments it is also beneficial to include several dimensions. The most useful dimensions in this regard and some possible values are:

Conclusion type: win, lose, inconclusive, did not finish.
Team owning the experiment.
Status: draft, running, stopped.
Source of the idea: user feedback (qualitative), team, quantitative observation, business goal.
Theme: communication change, product change, logistics change. Tends to be industry specific.
Device, platform and country (if in multiple regions).

You can also include some metrics:

Experiment length, in a number of weeks.
Number of treatments.
% of population being targeted and size of each group.
Impressions, understood as the number of times a particular variation is displayed.

These dimensions and metrics provide the foundation for creating multiple Key Performance Indicators (KPIs) to track and act upon. They also enable a meta-analysis of experiments to identify the key factors contributing to successful outcomes, allowing to either focus more on those factors or discover opportunities for improvement. These are several examples:

If your goal is to increase the number of experiments, you can check the data and decide if it is more impactful to enable new teams to start experimenting, or help teams that already run A/B tests to experiment more.
If a certain source of ideas produces more positive results, you can double down on that.
Certain commercial experimentation tools use impressions as the basis for measuring usage and for billing purposes. If your platform is developed in-house and you’re contemplating whether to purchase a third party solution, you should monitor impressions.

Win rate / Fail rate

One of the KPIs that can be calculated with the above information is the win rate, which measures the percentage of A/B tests that produce a trustworthy and statistically significant positive outcome. Conversely, you can also track the fail rate, representing the percentage of experiments that do not yield positive results or do not reach statistical significance.

Win rates indicate how often changes or innovations lead to improvements. A high win rate is not necessarily good as it might mean the product can be optimized further. About two thirds of ideas failed at Microsoft, and Booking or Google Ads report numbers that are around 80 to 90% failure rate of ideas (source).

Business metrics

Organizations greatly benefit from obtaining an aggregate view of the cumulative impact of various experiments on critical business metrics. This begins with the inclusion of standardized business metrics such as revenue, profitability, Customer Lifetime Value (CLV), Gross Merchandise Value (GMV), or retention. This enables teams to designate them as either key metrics or guardrail metrics in their A/B tests.

Subsequently, the platform should provide aggregated numbers on those metrics to facilitate the measurement of the impact of experimentation on these vital business indicators, beyond the mere number of experiments.

Experiments per employee

We often hear about large companies running thousands of experiments annually, which sounds impressive. These big numbers can be somewhat misleading since they often overlook the fact that larger companies have a greater number of employees dedicated to these initiatives. This oversight can create unrealistic expectations for smaller companies in their pursuit of an effective experimentation strategy.

Measuring experiments per employee is an insightful way to assess the effectiveness and efficiency of an experimentation platform within an organization in proportion to its size and human capacity.

Cost

Lastly, it’s crucial to monitor the costs associated with running the platform. Understanding the cost of infrastructure, team salaries, and licensing fees helps organizations make informed decisions about resource allocation and budget management.

Senior managers can utilize this information to assess the ROI of the experimentation platform and make strategic decisions about scaling, optimizing, or fine-tuning the experimentation process to align with the company’s financial objectives and growth strategies.

You can track these costs in absolute numbers, providing a clear picture of the financial investment involved, or as a percentage of revenue, which offers a contextually relevant perspective.

Final words

The metrics you select to evaluate your experimentation platform will help determine the effectiveness of your experimentation program, the maturity of your organization’s learning process, and the strength of your experimentation culture.

We advocate for a multi-metric approach that considers the maturity of your experimentation program and current business priorities to decide on which metric to focus.