Dealing with False Positive A/B Test Results in Google Play Experiments

Item not bookmarked
Resource bookmarked
Bookmarking...
Review star
Review star
Review star
Review star
Review star
💎  x
6

Simon Thillay (Head of ASO at AppTweak - ASO Tool) presents a testing protocol to help reduce statistical noise in GP Experiments and identify possible false-positive results.

Source:
Dealing with False Positive A/B Test Results in Google Play Experiments
(no direct link to watch/listen)
(direct link to watch/listen)
Type:
Presentation
Publication date:
May 12, 2020
Added to the Vault on:
May 13, 2020
These insights were shared through the free Growth Gems newsletter.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04
The "gems" from this resource are only available to premium members.
  • Unlock access to gems from over 130 mobile growth resources
  • Define your preferred categories and receive new relevant gems directly in your inbox
  • Discuss key insights (and any other mobile growth topic) in the members-only community.
Upgrade Your Plan
💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04
The "gems" from this resource are only available to premium members.

Gems are the key bite-size insights "mined" from a specific mobile marketing resource, like a webinar, a panel or a podcast.
They allow you to save time by grasping the most important information in a couple of minutes, and also each include the timestamp from the source.

Become a member to:
  • Unlock access to gems from over 130 mobile growth resources
  • Define your preferred categories and receive new relevant gems directly in your inbox
  • Discuss key insights (and any other mobile growth topic) in the members-only community.
Request Access
💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04
The access to discussions on each resource is only available to premium members.

Growth Gems members discuss resources and their key insights (as well as other mobile growth topics) in the community. It's the perfect way to dig deeper, ask questions and get additional perspectives.
Upgrade to premium to:
  • Unlock access to key insights from over 130 mobile growth resources
  • Define your preferred categories and receive new relevant gems directly in your inbox
  • Discuss key insights (and any other mobile growth topic) in the members-only community.
Upgrade Your Plan
The access to discussions on each resource is only available to premium members.

Growth Gems members discuss resources and their key insights (as well as other mobile growth topics) in the community. It's the perfect way to dig deeper, ask questions and get additional perspectives.
Become a member to:
  • Unlock access to gems from over 130 mobile growth resources
  • Define your preferred categories and receive new relevant gems directly in your inbox
  • Discuss key insights (and any other mobile growth topic) in the members-only community.
Request Access

You need to be logged in the community to be able to see the discussion below.
You can also head over directly to this topic in the community

Notes for this resource are currently being transferred and will be available soon.

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.

Google-Play-Experiments-No-Live-Impact.png


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Google-Play-Experiments-Seasonality.png

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.

Significance-vs-statistical-power.png


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

Experiments-ABB-Test.png


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


ABBTests-Examples.png

Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.

Experiments-Trends.png


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.


The notes from this resource are only available to premium members.
↘ At this point, you know what to do ↙
Upgrade Your Plan

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.

Google-Play-Experiments-No-Live-Impact.png


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Google-Play-Experiments-Seasonality.png

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.

Significance-vs-statistical-power.png


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

Experiments-ABB-Test.png


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


ABBTests-Examples.png

Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.

Experiments-Trends.png


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.


The notes from this resource are only available to premium members.

The detailed notes taken for a resource are an easy way to see the gems in context to get a better understanding. They also include any relevant visuals from the source.
↘ At this point, you know what to do ↙
Request Access

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.

Google-Play-Experiments-No-Live-Impact.png


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Google-Play-Experiments-Seasonality.png

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.

Significance-vs-statistical-power.png


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

Experiments-ABB-Test.png


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


ABBTests-Examples.png

Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.

Experiments-Trends.png


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.