Simon Thillay (Head of ASO at AppTweak - ASO Tool) presents a testing protocol to help reduce statistical noise in GP Experiments and identify possible false-positive results.
All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit.
It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit.
It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference. Check out this paper by Qubit.
It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable & similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
Notes for this resource are currently being transferred and will be available soon.
Sometimes after applying an experiment you do not see the impact live.
[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.
Seasonality is important: games are more downloaded on weekends and business apps during the week
[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.
[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.
Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".
Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.
[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
Note: sample at the bottom is quite small
Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.
Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.
Sometimes after applying an experiment you do not see the impact live.
[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.
Seasonality is important: games are more downloaded on weekends and business apps during the week
[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.
[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.
Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".
Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.
[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
Note: sample at the bottom is quite small
Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.
Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.
Sometimes after applying an experiment you do not see the impact live.
[💎@03:38] All samples (audience for each variation) are not created equal: Google does not provide data, and users coming from different sources do not have the same propensity to download your app. This creates statistical noise.
Google doesn't account for the weekly seasonality. Google recommends to do a test for at least 7 days even though the product itself can declare winners earlier.
Seasonality is important: games are more downloaded on weekends and business apps during the week
[💎@06:45] A successful test is a conclusive test, not a test that gives you the results you wanted. You want the test to be statistically significant and reliable.
→ A negative or even neutral result can qualify as a successful test.
Expect that the vast majority of your tests will show negative or neutral results. Finding the low hanging fruit takes time.
[💎@08:36] Statistical power is the probability that a statistical test will detect a difference between two values when there truly is an underlying difference.
Here is the link to the paper that Simon referred to "Most winning A/B test results are illusory".
Statistical significance (90% in Google Play Experiments) is achieved when your test reveals a big enough difference between the results measured in sample A and sample B.
[💎@10:07] It's important that you use the same sample size for all of your samples, otherwise it's tougher to make comparison because your don't have the same probability of having true positive results and false positive results.
[💎@10:38] To flag false positive when running Google Play Experiments (and mitigate the low statistical significance), run A/B/B tests: 7 days minimum, traffic composition stable &. similar to usual, 2 "B" samples to see if you get the same result, equal traffic split between samples.
[💎@11:04] If you're running a massive UA campaign in the middle of your Google Play Experiment you are basically knocking everything down.
Note: sample at the bottom is quite small
Never guess about negative or positive trends. If you have a performance result that ranges from negative to positive, the test is inconclusive.
Check AppTweak's article here to read more about a study in preparation with different companies. Here is the template to share your results if you want to participate.