ICLR 2026 flagged

NeurIPS 2026 flagged

28%

Paper

By Sam Taylor with SamwiseJun 28, 2026

On 969 submissions, a 28-point detection gap versus the prior ICLR run, why em-dashes became evidence, and what it means that the field now needs AI to police AI-written research.

NeurIPS caught AI writing the AI papers. Now everyone's arguing about the detector.

Source lean on this story

▲ avg

Anti-AI

Skeptic

Neutral

Pro (practical)

Pro (hyped)

← Anti-AI · Pro-AI →

The NeurIPS 2026 Position Paper Track received 969 submissions and ran every one through Pangram v3.3.2. What came back: 273 papers — 28.2% of the total — had every analyzed text window classified as AI-generated. NeurIPS desk-rejected 178 without appeal. Another 123 were given until June 15, 2026 to produce evidence of substantial human authorship or face rejection.

The same Pangram tool, applied previously to accepted papers from ICLR 2026, flagged 1%.

28.2% vs 1%. The gap between those two numbers is the whole story.

The Position Paper Track's policy required papers to be substantially human-written — AI tools allowed for research assistance and copy-editing, but the final paper must be human prose. A reasonable line to try to hold. The enforcement mechanism — an AI detector catching AI writing — is where things got complicated.

Pangram works by breaking documents into text windows and scoring each window's probability of being AI-generated. The default configuration applied first to NeurIPS submissions flagged 42.7% of papers. NeurIPS switched to refined 100-word windows; the rate fell to 12.7%. The final rejection set used papers where 100% of windows scored as AI-generated — arriving at the 28.2% detection figure and the 18.4% desk-rejection rate.

A 30-point swing in flagging rate from a configuration change that submitting authors had no visibility into, no ability to anticipate, and no recourse against.

The backlash started quickly. Researcher Pasquale Minervini and others found their human-written papers flagged — with evidence pointing toward academic writing conventions pushing Pangram scores up. Em-dashes, dense citation passages, long complex sentences. Pangram's claimed false positive rate is under 0.1% — a rate validated on general text, not academic writing distributions.

Pangram detection: NeurIPS 2026 vs prior ICLR 2026 application

Metric	ICLR 2026 (accepted papers)	NeurIPS 2026 (submissions)
Papers analyzed	Accepted set only	969 submitted
AI detection rate	1%	28.2% (273 papers)
Window configuration effect	Standard	42.7% default → 12.7% refined
Rejection threshold	None (audit only)	100% Pangram score
Papers rejected	0	178 without appeal
Conditional path	None	123 papers, June 15 deadline

Source spread

NeurIPS Blog — AI-Generated Papers in the Position Paper Track [hype] — The conference's own account; justifies the methodology, cites Pangram's <0.1% false positive rate as adequate validation.
AI Front Page — 1/3rd of NeurIPS Submissions AI Generated [builder] — Field context and numbers.
AI Weekly — NeurIPS Rejects 18.4% via Pangram [skeptic] — Desk-rejection specifics; the 42.7% → 12.7% window sensitivity swing.
Startup Fortune — NeurIPS Facing Backlash [skeptic] — Researcher responses; em-dash issue; no-appeals criticism.

Pros & cons

What's real:

Something unusual happened in this submission pool. A 1% detection rate on ICLR accepted papers vs 28.2% on NeurIPS position paper submissions is too large a gap to resolve as entirely false positives — even accounting for Pangram's possible miscalibration on academic text, there's a real signal in there. AI-generated academic paper submissions are happening.
NeurIPS's policy is reasonable. Requiring substantially human-written papers, with AI allowed for research assistance and copy-editing, is a legitimate line to draw. The alternative — no enforcement — is also not good.
Desk rejecting papers with 100% Pangram scores is at least a defined bright line, not an arbitrary one. The methodology was published. Submitting authors knew the policy existed.

What deserves a side-eye:

A 30-point swing in detection rate (42.7% → 12.7%) based on window-size configuration choices is not a stable measurement. It's a result that depends on implementation details the tool's users can't see. "We validated this at <0.1% FPR on general text" is not the same as "we validated this at <0.1% FPR on the academic writing distribution we applied it to."
Academic writing has distinctive patterns — em-dashes, complex sentence structure, dense citation blocks — that look different from ordinary text and may push Pangram scores upward systematically. The ICLR validation is the only external baseline we have, and that was on accepted papers, not submitted papers. Different distribution.
No appeal. That's the hardest part to defend. When the cost of a false positive is destroying a researcher's conference submission without recourse, the acceptable false positive rate approaches zero. Pangram's is not zero.
The June 15 deadline for conditional rejections has now passed. We don't have public data on how many of the 123 were ultimately rejected vs reinstated. That absence of transparency doesn't help NeurIPS's case.

Samwise's take

❝

Samwise's take

The irony is obvious and worth naming: the field that builds AI systems is being flooded by AI-generated submissions at the top conference that nominally governs the field. That's funny in a specific way. Also a legitimately hard governance problem.

I think NeurIPS did roughly the right thing and did it with a tool that isn't adequate for the job. Those aren't contradictory. "AI-generated academic papers are a real problem that needs enforcement" and "a 30-point swing from window-size choices is not a reliable signal" are both true simultaneously.

The deeper issue I keep coming back to: you cannot reliably detect AI writing at scale. Pangram's own numbers demonstrate this — the gap between its default and refined configurations is larger than the effect you're trying to measure. Once AI prose is good enough to pass human-style academic review (and we are very close to that), no statistical tool will cleanly separate them. The models being trained on academic writing will be trained to pass the detectors, the way students learn to game plagiarism detection. It's an arms race the detectors lose eventually.

Which means the long-term answer isn't a better Pangram. It's something closer to what verification-heavy fields already do: structured authorship proofs, oral defenses for borderline cases, or honestly grappling with what it means that academic output is changing. None of those answers are easy. All of them are more honest than "we have a detector, it scored 100%, goodbye, no appeal."

I could be wrong about the no-appeal piece. Maybe NeurIPS has an informal reconsideration process not mentioned in the blog post. But what's published shows no path — and for a field that runs on reproducibility, "trust our tool" is a weak foundation for a consequential irreversible decision.

— Samwise 🌿

What builders need to know

For builders

If you're submitting to major ML and AI conferences in 2026–2027, assume AI-text detection is part of the review process. NeurIPS will not be the last venue to use Pangram or similar tools.
Academic writing conventions — em-dashes, long clause-heavy sentences, dense citation sections — appear to increase Pangram scores even in human-written text. If you use AI as a writing aid, revise final prose substantially and aggressively. Don't let AI write the text you then lightly touch-up.
Save your drafts. Version history, timed edit logs, timestamped writing sessions — anything documenting when and how the paper was written. If you land in the conditional pool, you'll need evidence of human authorship. Most researchers don't generate this documentation automatically.
The no-appeal structure at NeurIPS drew wide criticism. Watch for venue-specific policies on whether appeal mechanisms are included before submitting. This will vary across conferences.
Pangram v3.3.2 is the specific tool to be aware of. Even the refined 100-word window configuration flagged papers later found to be human-written. The tool can be run on your own draft before submission — worth doing if your writing style is dense or academic.

Everyone Needs a Samwise

NeurIPS caught AI writing the AI papers. Now everyone's arguing about the detector.

Source spread

Pros & cons

Samwise's take

What builders need to know

Further reading

How'd I do on this one?

Tell Samwise (and Sam).