Multi-model comparison
When the corpus runs out: how four AI models handle a §414(m) gap
A real TaxGPT.ai user asked about affiliated service group rules for a doctor-owned surgical partnership. The retrieved IRS sources didn’t include §414(m) directly. Each of four models took a different path: reason about the gap, fill it from training, or defer to a professional. Which one would you trust?
These are results for a group of un related doctors each with their own separate medical practice form a partnership to provide surgical services at a surgical center. the services they provide are for patients of the individual doctors own medical practices. There is no cross patient service between doctors. what are the affiliated service group rules for retirement plan for the individual doctors practices and the partnership of the surgical services
How this was assembled
1This article uses the original question from a real TaxGPT.ai chat session as the starting point, then re-runs the conversation from scratch using all four models. The user simulator (gpt-4o-mini) supplies plausible follow-up replies based on stated scenario assumptions; production replies in this re-run are clean GPT-5.4 outputs (no paywall interference). The retrieved IRS sources, system prompt, and model configurations match production. Only the initial Turn 1 question is from real user data.
Scenario assumptions: Unrelated physicians, each with separate medical practices, jointly own a partnership that operates a surgical center and provides services exclusively to each doctor’s own patients. No cross-doctor patient services. Question scope: how do the §414(m) affiliated service group rules apply for retirement plan testing across the individual practices and the partnership?
§ 1The conversation, turn by turn
Click any candidate model card to expand. The badge under each card flags whether the model materially diverged from production for that turn.
These are results for a group of un related doctors each with their own separate medical practice form a partnership to provide surgical services at a surgical center. the services they provide are for patients of the individual doctors own medical practices. There is no cross patient service between doctors. what are the affiliated service group rules for retirement plan for the individual doctors practices and the partnership of the surgical services
§ 2Which sources did each model cite?
| Source | GPT-5.4 | Sonnet 4.6 | Opus 4.7 | Gemini 3.1 |
|---|---|---|---|---|
| [1] IRC §414 -- IRC §414 — Definitions and special rules | ✓ | ✓ | ✓ | — |
| [2] 26 CFR § 1.469-4T -- § 1.469-4T Definition of activity (temporary). | — | — | — | — |
| [3] 26 CFR § 1.1402(a)-17 -- § 1.1402(a)-17 Retirement payments to retired partners. | — | ✓ | — | — |
| [4] 26 CFR § 1.199A-4 -- § 1.199A-4 Aggregation. | — | — | — | — |
| [5] 26 CFR § 1.404(a)-10 -- § 1.404(a)-10 Profit-sharing plan of an affiliated group; applic | — | — | — | — |
| [6] 2025 Publ 560 (PDF) | ✓ | — | ✓ | — |
| [7] 2025 Inst 990 (Schedule H) (PDF) | — | — | — | — |
Retrieved IRS sources used in this conversation
- IRC §414 -- IRC §414 — Definitions and special rules
- 26 CFR § 1.469-4T -- § 1.469-4T Definition of activity (temporary).
- 26 CFR § 1.1402(a)-17 -- § 1.1402(a)-17 Retirement payments to retired partners.
- 26 CFR § 1.199A-4 -- § 1.199A-4 Aggregation.
- 26 CFR § 1.404(a)-10 -- § 1.404(a)-10 Profit-sharing plan of an affiliated group; application of section 404(a)(3)(B).
- 2025 Publ 560 (PDF)
- 2025 Inst 990 (Schedule H) (PDF)
Three strategies for the same gap
What this run reveals is not a tax disagreement — it is a methodology disagreement. Each of the four models faced the same problem: the retrieved IRS corpus included IRC §414 generally but not the §414(m) text that actually defines affiliated service groups. The authority needed to answer the question wasn’t in the materials provided. Each model responded differently.
GPT-5.4 (the model currently powering TaxGPT) reasoned at length about why it could not answer. It explained that §414 contains aggregation rules, that the affiliated service group definitions live in §414(m), that the excerpt provided did not include §414(m), and that the user should consult the missing authority directly. Sonnet 4.6 and Opus 4.7 took the opposite approach: both acknowledged the corpus gap, then walked through the §414(m) A-Org and B-Org tests from training data and applied them to the doctor partnership facts. Opus went further, naming §414(m)(5) (the management rule) and tabling all three ASG types. Sonnet built its own comparison table and reached a different conclusion than Opus on the A-Org analysis (Sonnet said "probably no"; Opus said "likely yes"). Gemini 3.1 Pro took a third path entirely: it acknowledged §414(m) was the right framework, declined to apply the test from training, and redirected the user to read §414(m) directly and consult an ERISA attorney. Gemini’s reply was the shortest of the four (278 tokens, vs. 366 for production, 1167 for Sonnet, 1437 for Opus).
For a tax product, the honest framing is that there is no clean winner here — each strategy has tradeoffs. Production’s response is informative without being authoritative. Sonnet and Opus are useful frameworks for someone doing initial research, but their answers come from training data, not from authority the reader can verify in the cited sources. Gemini’s response is the safest from a liability standpoint but offers no analysis. The fact that Sonnet and Opus reached different conclusions on the A-Org question is itself a finding: a tax professional cannot rely on either model’s framework without checking the underlying authority anyway. The right product fix is to address the corpus gap so all four models can ground their answers in the actual §414(m) text and Q&A regulations.
A side observation on speed and cost. Opus was the most thorough at 1437 tokens but took 27 seconds. Gemini was the leanest at 278 tokens in 17 seconds. Sonnet sat in the middle. For a chat product where users wait on response, the cost-quality tradeoff favors the leaner models when the question requires the model to reason about a gap rather than fill it. Production’s 366 tokens in 9 seconds is a defensible operating point.
Where the analysis is uncertain — please poke holes
- Sonnet and Opus reached different conclusions on the A-Org analysis (Sonnet: "probably no"; Opus: "likely yes"). Both worked from training data on the same facts. Which one is right? The answer depends on whether the surgical partnership performing services for each doctor’s patients counts as the practices and the partnership being "regularly associated in performing services for third parties" — that’s the operative phrase from §414(m)(2)(A)(ii). A practitioner reading the case law would know; the models are guessing.
- The retrieved chunks included IRC §414 generally and several Treasury Regulations, but not the §414(m) text itself. Is this a corpus chunking issue (the §414(m) language got separated from the rest of §414), an embedding issue (the query did not surface the right chunk), or a coverage gap (§414(m) is not in the corpus at all)? Worth verifying.
- None of the four models cited Rev. Rul. 81-105, the seminal IRS ruling on ASG analysis for medical practices, nor the proposed §414(m) regulations (1983) that practitioners actually rely on. Should a general-purpose tax assistant be expected to surface these? If yes, that’s a corpus expansion question.
- Gemini’s response was the most conservative — it declined to apply the test even from training data and pointed the user to a professional. Is that the most defensible product behavior, or is it the least useful? For a tax product whose users are paying for analysis, "go consult a professional" may be honest but unhelpful. Where should the line be?
- Source [2] (26 CFR §1.469-4T on passive activity grouping), Source [4] (§1.199A-4 on QBI aggregation), and Source [5] (§1.404(a)-10 on profit-sharing plans of an affiliated group) were all retrieved but uncited by every model. Were they irrelevant, or did the models miss connections — for example, the §1.404(a)-10 reference to "affiliated group" sounds directly on point even though it’s a different code section.
Have a tax question of your own?
Ask TaxGPT and see what the production model says with full IRS source citations.
Ask TaxGPT →What did we miss?
If you're a CPA, EA, tax attorney, or tax tech practitioner — what did the models get wrong?
§ 3Updates from professional discussion — last reviewed pending
Footnotes
- This article uses the original question from a real TaxGPT.ai chat session as the starting point, then re-runs the conversation from scratch using all four models. The user simulator (gpt-4o-mini) supplies plausible follow-up replies based on stated scenario assumptions; production replies in this re-run are clean GPT-5.4 outputs (no paywall interference). The retrieved IRS sources, system prompt, and model configurations match production. Only the initial Turn 1 question is from real user data.