Ad copy testing in Google Ads: how to design rigorous RSA tests
RSAs don't test like ETAs did. Methodology, metrics, and common mistakes for designing valid Responsive Search Ads tests in 2026.
In this article
Responsive Search Ads (RSA) don’t test the way Expanded Text Ads (ETA) did. Google assembles each impression by combining up to 15 headlines and 4 descriptions in real time, so comparing “ad A vs ad B” loses its meaning when both share assets. The right question isn’t “which ad wins” — it’s which assets make the set win.
This guide explains what’s actually testable in RSAs, how to design the experiment so it has statistical validity, which metrics to track, and the most common mistakes that invalidate tests before they even begin.
In 30 seconds:
- RSAs have been the only option for search ads since June 2022 (Google Ads Help, 2022)
- Google’s Ad Strength score doesn’t predict real performance: Optmyzr’s study of RSAs found that “Excellent” Ad Strength ads have the worst CPA ($28.68) and lowest CVR (4.97%), while “Average” ads delivered the best CPA ($12.43) and highest CVR (12.65%) (Optmyzr, 2024)
- What you test is no longer entire ads, but individual assets (headlines, descriptions) and thematic patterns
- The right method: Drafts & Experiments with parallel campaigns, minimum 4 weeks, minimum 100 conversions per arm
- Recommended cadence: one active test per ad group every 4-6 weeks, no more
Why RSAs don’t test the way they used to
In the ETA model, each ad was a closed unit: two headlines, one description, one URL. You compared ad A against ad B and the statistics were straightforward. With RSAs, Google decides at each auction which combination of your assets to serve, adjusted to user, device, and search term. Two accounts with the same assets can see different combinations.
This changes three things:
- You can’t compare “ad against ad” within the same ad group because they share assets. If you add an ad B with the same 15 headlines and change 2 descriptions, Google will mix them and attribute impressions to one or the other without logic you control.
- Ad Strength is directional, not predictive. Optmyzr’s RSA study on real accounts found that “Excellent” Ad Strength ads have the worst CPA ($28.68) and lowest CVR (4.97%) in the dataset, while “Average” ads delivered the best CPA ($12.43) and highest CVR (12.65%) (Optmyzr, 2024). It’s useful to verify you’re not making basic setup mistakes (length, missing keywords), not to bet on the winner.
- The combinations report is opaque. Google shows impressions per asset but not impressions per served combination. That makes it impossible to reconstruct which headline+description pair worked best.
When I walk into a new account and see a “test” of two near-identical RSAs in the same ad group, the first thing I usually do is pause it. It wasn’t measuring what the client thought it was measuring.
What’s actually testable in an RSA
If Google assembles the combinations, what’s left for you to control? More than it looks, but on different levels.
Individual assets (headlines and descriptions)
Google classifies each asset as Best, Good, Low, or Learning based on relative performance. That signal is useful for spotting which headlines work within the current set. The concrete action: pause or replace “Low” assets every 2-4 weeks and supply variations covering different angles (price, social proof, urgency, value proposition).
This isn’t a statistical test, it’s a rotation based on Google’s internal signals. But it’s the most realistic day-to-day optimisation friction.
Themes or copy angles at campaign level
This is where testing gets rigorous. Instead of comparing one headline against another, you compare two parallel campaigns with two distinct copy approaches:
- Campaign A: RSA with headlines focused on price + social proof
- Campaign B: RSA with headlines focused on speed/convenience + guarantee
You use the Drafts & Experiments feature inside Google Ads to split traffic 50/50 between both. The unit of testing is no longer the ad: it’s the theme.
Pinning configuration as an extra variable. Google lets you “pin” headlines to specific positions (position 1, 2, or 3). Applying pinning reduces the number of combinations Google can generate, which paradoxically can improve relevance for specific keywords at the cost of flexibility. It’s testable as another layer: a campaign with strategic pinning vs one without, measuring CTR and CPA on both.
How to design an RSA test with statistical validity
For a test to actually tell you something real, you need three conditions at the same time: enough sample, enough time, and proper variable isolation.
Minimum sample per arm
The PPC rule of thumb is 100 conversions per arm as an absolute minimum, and 300+ if the expected difference between versions is small (under 15%). Below that, any difference you see is more likely statistical noise than signal.
In a standard A/B significance calculator (z-test for proportions), detecting a 10% relative lift in conversion rate at a 3% baseline requires about 4,700 conversions per arm at 95% confidence. Most accounts don’t come anywhere near that volume in a month. That’s why realistic tests are for large effects (over 20% difference) or you accept working at 90% confidence to cut test duration.
Minimum time
Minimum 4 weeks, ideally 6-8. Under 4 weeks and you’re capturing day-of-week noise, seasonal spikes, and algorithm learning. Beyond 8 weeks and you start running into other problems: search behaviour shifts, Google adjusts its model, old data loses relevance.
Variable isolation
One variable changes between arms. If you’re testing headlines with different angles, keep descriptions, extensions, landing page, targeting, and bids identical between A and B. If two variables change at once, you won’t know which moved the result.
The most rigorous tests I’ve seen in accounts with budgets between £8,000 and £50,000/month are the ones that change a single dimension at a time. It’s slow, and it’s the only way to learn something actionable.
What metrics validate an RSA test?
Don’t use CTR as the primary metric. CTR responds to average position, audience, and time of day, not just copy. The metrics that actually matter for validating an RSA test are:
| Metric | Why it matters | Common trap |
|---|---|---|
| Conversions | The business metric | Attributing late conversions to the wrong arm |
| CPA | Reflects spending efficiency | Can improve from CPC drops without copy doing anything |
| CVR (Conversion Rate) | Isolates copy’s effect on the decision | Needs high volume to be reliable |
| CTR | Indicator of relevance, not lead quality | High CTR can bring irrelevant traffic |
| Quality Score | Moves slowly, reflects set coherence | Doesn’t shift in 4-week tests, better as a diagnostic |
For B2B with long sales cycles, CPA inside Google Ads can be misleading because real conversion (qualified lead, closed opportunity) happens months later. In those cases I use offline conversions imported from CRM as a validation metric: copy A might generate cheaper leads but worse-qualified than copy B.
Common mistakes that invalidate an RSA test
Why do so many tests run for months without concluding anything? When I audit accounts, the patterns that show up are always the same:
- Two near-identical RSAs in the same ad group. Google rotates them internally and shared assets contaminate attribution. If you want to compare copy approaches, use Drafts & Experiments at campaign level, not two parallel RSAs.
- Changing several variables at once. Headlines, descriptions, and landing page URL all modified at the same time. When the test ends, you can’t attribute the difference to anything. I’ve seen this in more accounts than I’d like.
- Cutting the test short on an “early winner”. In the first 2 weeks, variance is huge. What looks like a winner at 30 conversions disappears at 6 weeks with 200. Resisting that temptation is the difference between testing and storytelling.
- Not documenting the hypothesis. If you launch a test without writing down “I expect copy B to reduce CPA by 15% because the urgency angle connects better with transactional searches”, the outcome (positive or negative) doesn’t teach you anything new. Documenting hypotheses is what turns tests into cumulative learning.
- Ignoring the landing page. A copy change with the same landing page only tests the “click promise”. If the landing doesn’t reinforce the ad’s message, you’ll win on CTR and lose on conversion.
Recommended cadence
A productive ad group doesn’t need to be in test mode at all times. The cadence I recommend to clients with mid-size accounts (50-200 active ad groups):
- Individual asset rotation: every 2-4 weeks, review “Low” signals and pause/replace.
- Theme or angle test (Drafts & Experiments): 1 per strategic ad group every 4-6 weeks, max 3-4 active tests in parallel across the whole account.
- Configuration test (pinning, extensions): quarterly, only in ad groups with enough volume for the result to be reliable.
More tests running at once doesn’t equal more learning. It equals less attention per test and a higher chance of finishing without knowing what really happened. If in doubt, auditing the full account structure before testing is a non-negotiable step.
Frequently asked questions
Can I test two identical RSAs by changing only one headline?
Technically yes, but the result is going to be hard to interpret. Google rotates shared assets between both ads and attributes impressions per its own model, without you controlling the distribution. It’s better to isolate the variable at campaign level using Drafts & Experiments.
How many headlines should an RSA have for proper testing?
An RSA with fewer than 8-10 headlines doesn’t take advantage of Google’s algorithm. The maximum is 15 headlines and 4 descriptions. Starting with 10-12 varied headlines (price, urgency, benefit, social proof, guarantee) and 3-4 descriptions is the balance point between coverage and maintenance. Beyond 15 there’s no technical option.
How much budget do I need for reliable RSA tests?
You need conversion volume, not budget. As a reference, an account with 100 conversions/month can test large changes (over 25% difference) in 4-6 weeks. To detect small changes (under 10%), you need accounts with several thousand monthly conversions. Below 50 conversions/month, almost any test is going to be inconclusive. If you want to size the total investment (management + media), I wrote a guide on Google Ads consultant cost in 2026.
Does Ad Strength work as a testing metric?
Not for validating tests, yes for auditing setup. Ad Strength warns you if you’re missing headlines, if your descriptions are too short, or if you’re repeating keywords. It doesn’t predict which ad will convert better. As a success metric for a test it’s misleading.
Should I pin headlines?
By default, no. Pinning reduces the combinations Google can generate, limiting algorithm optimisation. Pinning makes sense when you need to guarantee brand messages or comply with legal restrictions (e.g., mandatory disclaimers). For everything else, full freedom to the algorithm gives better average performance.
What do I do with a test showing no statistical difference after 6 weeks?
It means both versions perform similarly. That’s useful information: don’t invest more time optimising copy at that level and move focus to another lever (audiences, bids, landing page). An inconclusive test isn’t a failed test.
Sources
- Google Ads Help. About responsive search ads. https://support.google.com/google-ads/answer/7684791. 2025.
- Google Ads Help. About expanded text ads (sunset 30 June 2022). https://support.google.com/google-ads/answer/7056544. 2022.
- Optmyzr. Google RSA performance study: does Ad Strength predict success? https://www.optmyzr.com/blog/google-rsa-performance-study/. 2024.
- Google Ads Help. About the Experiments page (formerly drafts and experiments). https://support.google.com/google-ads/answer/10682377. 2025.
- WordStream. Responsive Search Ads best practices guide. https://www.wordstream.com/blog/responsive-search-ads. 2025.
Want a second opinion on your account’s testing structure? I offer free initial audits as a freelance Google Ads consultant — 30 minutes over video call, no commitment. We’ll review which tests make sense for your volume and which metrics to track.
Could your ad campaigns
perform better?
30 minutes to review your situation and tell you exactly what I would change. No pitch, no sales proposal.