The Session PR: Reviewing How the Code Was Made, Not Just What Changed

The framing

Pull requests have not really changed in fifteen years. The artifact under review is a diff, plus a description, plus some discussion. A reviewer reads the diff, runs the tests in their head, and decides whether the change is correct, safe, and aligned with the codebase.

That model works when a human typed the code. The diff is a faithful record of the author's thinking, because the author had to think every line into existence. If the diff is small and clean, you can be reasonably confident the thinking behind it was small and clean too.

With coding agents, that assumption breaks. A 30-line diff can be the residue of a 3-hour session that included six wrong plans, two rolled-back experiments, four tool calls into the wrong file, and a moment where the human said "no, do it the other way" and the agent quietly changed direction. The diff is still 30 lines. But the process that produced it is invisible.

So the question is: what if the next form of pull request is not a diff at all, but a session? What if you ship your agent transcript, your choices, your plan, and your decisions, and your reviewer evaluates that — not just the resulting code?

I'll call it a Session PR for the rest of this post.

What a Session PR would actually contain

A Session PR is not just "diff plus chat log dumped into a comment". It is a structured artifact. The minimum useful version has roughly these parts:

Intent. A short statement of what you were trying to do and why. Not a commit message — the actual user-level outcome, including constraints (do not break X, must stay backward compatible, follow the existing pattern in module Y).
Plan. The plan the agent (or you) produced before touching code, and any revisions. If the plan changed mid-session, the diff alone hides this; the Session PR makes it explicit.
Session transcript. The prompts, the agent's tool calls (file reads, searches, edits, shell commands), and the agent's reasoning where available. Trimmed of noise but preserved in order.
Choices and overrides. The points where you, the human, redirected the agent. "I rejected approach A because it would have changed the public API." "I asked it to not touch the auth module." These are the highest-signal moments in the session.
Rejected paths. Branches the agent or you tried and abandoned. A reviewer learns a lot from what didn't ship.
Verification record. What was actually run: tests, type checks, lint, manual checks, what passed, what was skipped, and why.
The diff. Still there. Still required. But now framed as the output of the session, not the whole story.

The reviewer's job changes. They are no longer only asking "is this code correct?" They are also asking: "was the process sound? Did the human exercise judgment in the right places? Were the constraints respected? Was anything important skipped?"

Why this matters now

Three things have shifted at once, and together they make the diff-only PR genuinely insufficient.

First, the volume of agent-authored code is rising fast. When a single engineer is producing the output of five engineers via parallel agents (the conductor model), reviewers cannot keep up by reading diffs line by line. They need a higher-bandwidth signal about whether the work was done correctly.

Second, the failure modes are different. Human-authored bugs tend to be local: a typo, a wrong condition, an off-by-one. Agent-authored bugs tend to be systemic: a misunderstood constraint, an outdated assumption baked into the plan, a tool call against the wrong file that "worked" but landed in the wrong place. You catch human bugs by reading code. You catch agent bugs by reading process.

Third, accountability is shifting. When something ships and breaks production, the question "who decided this?" becomes harder. With a Session PR, the answer is in the transcript. The human's overrides, the agent's tool calls, and the plan that was approved are all on the record. This is closer to how regulated industries already think about decisions: the decision is not the outcome, the decision is the trail.

Is this already a thing? A short tour of what exists

This idea is in the air. Several adjacent practices and tools point at it from different angles, but none of them — yet — make the session the primary review artifact. Here is the lay of the land.

AI review bots (Codex, similar tools)

OpenAI's Codex code review feature integrates with GitHub: you mention @codex review on a PR and it reviews the diff, flags high-priority issues, and can fix them on request. It is configurable via AGENTS.md and supports focused reviews like @codex review for security regressions.

This is useful, but it is still diff-centric. The bot reads what changed and gives feedback on the result. It does not, as a rule, ingest the human's session with another agent and review the process behind the change. The reviewer is an AI; the artifact is still a diff.

Stacked diffs (Sapling, Graphite, ghstack)

Stacked diff workflows break a feature into a chain of small, independently reviewable commits. This is a real improvement over the giant PR — it gives reviewers a sense of progression, lets them approve incrementally, and preserves history that GitHub usually flattens.

But stacked diffs are still about decomposing the output, not exposing the process. The reviewer sees a sequence of clean commits, not the messy reality of how the work actually unfolded. A stacked diff is a polished narrative; a Session PR would be the actual ledger.

Session provenance and "agent attestations"

This is the closest existing concept. Some recent writing — for example, Propel Code's piece on session provenance — argues that AI-authored PRs should include a structured record of how the agent produced the change: prompt and policy IDs, tool calls, checkpoint outcomes, human overrides, and intent. They propose risk-based routing: docs changes do not need provenance; auth or payments changes do, and it should block merge until the provenance is reviewed.

This is recognizably the same family of idea. The difference is emphasis. Provenance, as usually framed, is metadata attached to the diff — a paper trail you check when something goes wrong, or a compliance artifact for tier-2/tier-3 changes. The Session PR framing flips this: the session is the PR. The diff is one section of it. You are not adding provenance to a code review; you are reviewing the session, of which the code is one output.

Build provenance (SLSA, Docker attestations)

In the supply-chain security world, there is a mature concept of build provenance: a signed record of how a binary was produced — what source, what builder, what dependencies. SLSA-style attestations are now standard in serious release pipelines.

This is the right idea applied to a different layer. Build provenance answers "how was this artifact built from this source?" A Session PR answers "how was this source produced from this intent?" One layer up. Same shape, same value: an auditable record between an input and an output.

"Intent-based" experiments

There are experimental tools framing git itself as intent-aware (e.g. projects like iam-git-with-intent) and CLIs (e.g. Entire CLI) that capture agent sessions alongside commits and make them searchable. These exist, but they are early, niche, and not yet wired into mainstream review platforms.

So is the Session PR new?

The honest answer: the components mostly exist, but the review pattern does not. No major platform today says: "the unit of review is the session; the diff is one of its outputs." Provenance is treated as compliance metadata, not as the primary artifact. AI review bots read diffs, not human-agent sessions. Stacked diffs polish the output, not expose the process.

So this is not invented from scratch. It is a recombination — but the recombination is the new thing, and it is where the trend lines are visibly converging.

What changes when the session is the artifact

If teams adopt Session PRs even partially, several things shift in a healthy direction.

Code review becomes process review. The reviewer's first question is no longer "does this code look right?" but "does this session look right?" That is closer to how senior engineers already mentor — they ask how you arrived at a decision, not just whether the decision is defensible in isolation.

Bad sessions become reviewable. Today, if an engineer pushes a clean diff produced by a chaotic agent session — wrong plans, ignored constraints, lucky landings — there is no way for a reviewer to see that. With a Session PR, the chaos is visible, and the reviewer can ask: "why did you accept this plan?" or "what made you override the agent here?"

Good sessions become teachable. A well-run session — clear intent, explicit plan, decisive overrides at the right moments, clean verification — becomes a training artifact for the team. New engineers can read senior engineers' sessions and learn how they direct agents, where they intervene, and where they let the agent run.

Risk routing becomes natural. The provenance literature already proposes tiering changes by risk. With a Session PR, this tiering has somewhere to live. A docs change might ship with a minimal session summary. An auth change ships with the full transcript, plan, overrides, and verification, and a human reviewer signs off on the process before the code.

Accountability moves to the right place. The human who ran the session is accountable for the choices in the session. The agent is accountable for executing within those choices. The reviewer is accountable for evaluating the session as a whole. Everyone has a clear line.

What it costs

This is not free. A few real costs are worth naming.

Sessions are noisy. A raw transcript is too long for a reviewer to read. Session PRs only work if there is good tooling to summarize, fold, and highlight the high-signal moments — the overrides, the rejected plans, the constraint violations — while collapsing routine tool calls. This is a UX problem, and it is solvable, but it is not solved yet.

Privacy and IP exposure. Transcripts can contain sensitive prompts, internal context, customer data pasted in for debugging, and credentials. Shipping a session as a review artifact requires good redaction and access controls. This is the same problem that build provenance had to solve and it has known patterns.

Reviewer cognitive load shifts but does not necessarily drop. Reading a session is a different skill than reading a diff. Some reviewers will be worse at it initially. Teams will need norms for what "a reviewable session" looks like — analogous to how teams developed norms for "a reviewable PR" over the last decade.

Not every change needs it. Forcing a full Session PR on a one-line typo fix is theatre. Risk-based routing is essential: small, low-risk changes keep the lightweight diff PR; meaningful changes get a Session PR. The point is to introduce a new option, not to replace the old one.

What I'd build first

If I were prototyping this for a real engineering team, the smallest useful version would be:

A standardized session export from coding agents (transcript + tool calls + overrides + plan), as a structured file attached to the PR.
A session viewer in the review UI that shows the diff alongside the structured session — collapsible, searchable, with overrides and rejected plans surfaced as first-class events.
A risk-tier policy in the repo (something like an AGENTS.md extension) that says which kinds of changes require a Session PR and which can ship with a normal diff PR.
A review checklist focused on process: was the intent clear, was the plan sound, were constraints respected, were overrides justified, was verification adequate.

That is enough to test whether reviewers actually catch more problems, and whether teams ship better code. Everything else — fancy provenance signing, cross-PR search across sessions, reviewer AI that summarizes sessions for you — can come later.

Closing synthesis

The pull request format we use today was designed for a world where humans typed code. It centered on the diff because the diff was the most informative artifact: it directly reflected the author's reasoning.

In a world where agents produce code under human direction, the diff is no longer the most informative artifact. The session is. The session contains the intent, the plan, the choices, the overrides, the rejections, and the verification — everything a reviewer actually needs to judge whether the change was made well, not just whether it looks right.

Pieces of this idea exist already. AI review bots, stacked diffs, session provenance, build attestations, and intent-aware git experiments are all reaching for the same thing from different angles. None of them, today, treat the session as the primary review artifact.

That last step — making the session the unit of review — is, as far as I can tell, the genuinely new move. And it is probably inevitable. Once a team is shipping meaningful volumes of agent-authored code, reviewing only diffs starts to feel like inspecting a finished cake without ever asking what went into the bowl. You can do it. But you should not be surprised when something tastes wrong and you cannot tell why.

The Session PR is how you start asking what went into the bowl.