Dealing with False Positive A/B Test Results in Google Play Experiments

Item not bookmarked
Resource bookmarked
Bookmarking...
⛏️
Guest Miner:
Sylvain Gauchet
Review star
Review star
Review star
Review star
Review star
💎  x
6

Simon Thillay (Head of ASO at AppTweak - ASO Tool) presents a testing protocol to help reduce statistical noise in GP Experiments and identify possible false-positive results.

Source:
Dealing with False Positive A/B Test Results in Google Play Experiments
(no direct link to watch/listen)
(direct link to watch/listen)
Type:
Presentation
Publication date:
May 12, 2020
Added to the Vault on:
May 13, 2020
Invite a guest
These insights were shared through the free Growth Gems newsletter.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04
The gems from this resource are only available to premium members.
💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04
The gems from this resource are only available to premium members.

Gems are the key bite-size insights "mined" from a specific mobile marketing resource, like a webinar, a panel or a podcast.
They allow you to save time by grasping the most important information in a couple of minutes, and also each include the timestamp from the source.

💎 #
1

All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise. 

03:38
💎 #
2

A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.

06:45
💎 #
3

Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit

08:36
💎 #
4

It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results. 

10:07
💎 #
5

To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.

10:38
💎 #
6

If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down. 

11:04

Notes for this resource are currently being transferred and will be available soon.

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.


The notes from this resource are only available to premium members.

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.


The notes from this resource are only available to premium members.

Google Play Experiments basics

Sometimes after applying an experiment you do not see the impact live.


[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.


Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.

Seasonality is important: games are more downloaded on weekends and business apps during the week


Learning more about A/B testing


[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.

→ A negative or even neutral result can qualify as a successful test.

Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.


[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.

Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".


Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.


[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.

Flagging false positive difference with A/B/B tests

[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.


[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.


Note: sample at the bottom is quite small


Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.


Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.