A weekly series from TaxGPT.ai

Question of the Week

Four AI models. One real tax question. The same retrieved IRS sources. Open evaluation by tax professionals — what each model got right, what it missed, and which one you would actually trust.

Cadence: Weekly · Models compared: GPT-5.4 · Sonnet 4.6 · Opus 4.7 · Gemini 3.1 Pro · Audience: CPAs, EAs, tax attorneys, tax tech practitioners

§ 1Issues

Each issue takes a real question from the TaxGPT.ai chat, runs all four models against the same retrieved IRS sources, and shows what they said.

No. 1

Published May 5, 2026 · 3 models compared · 1-turn conversation

When the corpus runs out: how AI models handle a §414(m) gap

A real TaxGPT.ai user asked about affiliated service group rules for a doctor-owned surgical partnership. The retrieved IRS sources didn’t include §414(m) directly. Two models filled the gap from training data; one stayed in its lane. Which approach is correct?

§414(m)Affiliated Service GroupsRetirement PlansCitation Honesty

Read the full comparison →

About this series

Question of the Week is a transparency project. Tax software products generally show you one AI answer and ask you to trust it. We show you four — what GPT-5.4, Claude Sonnet 4.6, Claude Opus 4.7, and Gemini 3.1 Pro each say when given the same real user question and the same retrieved IRS sources — and we invite tax professionals to evaluate the outputs.

Each issue includes:

A real user question from a TaxGPT.ai chat session
The retrieved IRS sources (Code, Treasury Regulations, Publications, Rev. Procs.)
Each model’s full answer with token counts, latency, and citation tracking
A source attribution matrix showing which model cited which authority
Editorial commentary on where the models diverged and why
A "where the analysis is uncertain" section inviting professional critique

The goal is not to declare a winner. The goal is to help tax professionals understand how different AI models handle the same question, where current tax-AI products have gaps, and where human judgment still matters.

Have a tax question of your own?

Ask TaxGPT directly and see what the production model says, with full IRS source citations.

Ask TaxGPT →

Frequently asked questions

Are these real tax questions?

Yes. Each issue starts with the actual user message from a real TaxGPT.ai chat session. When the original chat ended at the paywall, we re-run the conversation from scratch using a user simulator so all four models get a complete conversation. This is disclosed in each issue’s methodology footnote.

Are these meant as professional tax advice?

No. Question of the Week is a transparency exercise comparing how AI models handle tax questions. It is not professional tax advice. For specific tax situations, consult a CPA, EA, or tax attorney.

How are the models compared?

All four models receive identical conversation history, the same retrieved IRS source chunks from our Pinecone corpus, and the same system prompt used in production TaxGPT.ai chat. Only the underlying language model differs. Each issue captures token count, latency, citations, and a divergence analysis showing where each model differs from the production answer.

Can I contribute to the discussion?

Yes — each issue invites professional critique. Comment on the LinkedIn post that accompanies each issue, or use our contact page. Substantive contributions are credited and added to the issue’s "Updates from professional discussion" section seven days after publication.

What models are tested?

The current test set is OpenAI GPT-5.4 (which powers TaxGPT.ai chat in production), Anthropic Claude Sonnet 4.6, Anthropic Claude Opus 4.7, and Google Gemini 3.1 Pro Preview. The model lineup updates as new frontier models become available; coverage is disclosed per issue.