Totem · SHUR IQ · System Note

The Dynamic Workflow System

How we hand a job to a team of AI agents — some working at once, some checking each other — and stitch their work back into one verified result. Built on Opus 4.8, proven this week on the American Heart Association report.

Internal · for the team · 9 June 2026
Dynamic Workflows

One job, a team of agents, one clean answer

A dynamic workflow is a small, fixed program that hands a job to a team of AI agents — some working at the same time, some checking each other's work — and stitches their answers back into one result. It runs on Opus 4.8. We use it instead of asking a single agent to do everything in one pass.

The shift came from watching where a single agent breaks. When one agent does a big job alone, it fails two ways, and a workflow fixes each one directly.

One agent, working alone
was It runs out of attention. Asked to read a long report end to end, it misses things — a repeated point on page 9, a weak headline three sections down. And no one checks it, so a confident wrong answer ships unnoticed.
now We divide the work so each agent reads through one lens, and nothing falls through the cracks. Then we add an adversary — a separate agent that re-checks every claim against the live file before anything is applied, and throws out the false alarms.

Divide, and add adversaries. Those are the two moves. Below is what each one looks like in practice.

Divide the work

In our grammar review, four reviewers read the same report at once, each owning one slice: one watches for self-reference, one for inverted phrasing, one for repeated ideas, one for jargon and weak headlines. Coverage comes from having more eyes, each looking for one thing.

Add adversaries

Nothing an agent claims is taken on faith. A synthesizer re-checks every finding against the actual file, drops the false positives, and returns only what holds up. On the grammar review that turned 20 raw findings into 12 verified edits — the other 8 did not survive the check.

Why it's better, plainly

The same rulebook runs every time, so the output is consistent. Agents work in parallel, so it's fast. And every agent's reasoning is saved, so any result can be traced back and audited later.

This is not a thought experiment. Across one day, four dynamic workflows ran against our work and produced shippable output.

4
workflows run in a day
20 → 12
grammar findings → verified edits
1.2M
tokens in the deepest re-run (32 min)
3
live sites shipped
grammar-review

Score the report against every editorial rule

A rubric-compiler turned our written rules into one shared scoresheet, four reviewers each judged the report through their own lens, and a synthesizer merged them into an apply-ready edit list — 20 findings in, 12 verified edits out.

methodology-audit

Check whether the best analytical methods were even used

An inventory step listed what the report drew on, five auditors each scored one dimension of rigor, and a synthesizer returned the verdict: rigor 52 out of 100 — a clear signal the analysis had been left shallow.

deep-rerun

Rebuild the analysis with the methods that were missing

Four analysts re-did the underlying work, seven composers wrote the sections, and a packager assembled them — 1.2 million tokens over 32 minutes, 7 sections regenerated. This is the run behind the AHA v07 report, the gold standard our Report Studio models are tuned to match.

viz-polish

Fix squished graph text by actually looking at the page

Five browser agents opened the live visualizations, took screenshots, fixed the cramped text, and re-checked their own work against the rendered result — each one looping screenshot, fix, verify until the graphs read cleanly.

One detail worth holding onto, because it shows the adversary working in our favor and not just against us. The fact-check gate corrected a competitor's revenue upward — Oura from roughly $350M to roughly $1B. A weaker process would have quietly left the smaller number in. The correction made the report's value-leak argument stronger, because the money walking out the door was bigger than we'd first written.

The day's output: the AHA v07 gold-standard report rebuilt end to end (aha-v07-opus48.pages.dev), the report collection site (aha-report-collection.pages.dev), and a working demo of the engine itself (shuriq-grammar-engine.pages.dev).

The shape

Every workflow runs the same three stages

We stopped asking one agent to do a big job in one pass. Instead the work splits into three stages that always run in order: one agent builds the shared rulebook, several agents judge in parallel, and one agent checks the findings against reality before anything ships.

Stage 1

Compile

One agent reads the source of truth — our written rules, the data, the prior report — and turns it into a single shared reference everyone else scores against. This is the "everyone judges from the same page" step. When we re-ran the grammar review on the AHA report, this agent compiled every editorial rule into one rubric before a single reviewer started reading.

Stage 2

Fan out

Several agents run at the same time, each owning one lens. On the grammar review, four reviewers read the same report at once: one watching for self-reference, one for inverted phrasing, one for repeated ideas, one for jargon and weak headlines. They read and judge only — they change nothing. Because no agent edits the file, none of them can collide.

Stage 3

Synthesize

One agent merges the parallel findings, removes duplicates, re-checks every claim against the live file, and drops anything that does not hold up. What comes back is one clean, apply-ready answer. On the grammar review, the four reviewers raised 20 findings; the synthesizer verified them down to 12 real edits. A human applies the result once, at the end, in one controlled pass.

Today four of these workflows ran back to back: the grammar review above, a methodology audit (an inventory agent, five dimension auditors, one synthesizer — rigor scored 52 out of 100), a deep re-run that rebuilt seven sections of the report (four analysts, seven composers, a packager — 1.2 million tokens across 32 minutes), and a browser-driven polish pass where five agents fixed squished graph text by taking live screenshots. Three sites shipped from that work: the gold-standard AHA report, its report collection site, and the Grammar Engine demo.

Verified, not trusted

Nothing an agent claims is taken on faith. The synthesize stage re-checks every finding against the live file before it counts. Today's fact-check gate caught a competitor number that was too low — Oura's revenue was closer to a billion dollars than the ~$350M we had — and corrected it upward. That made the report's core argument about leaked value stronger, because the gate cares about the truth above protecting the conclusion.

Accumulation

A correction made once becomes a rule. A rule becomes an agent's standing instruction. That instruction then runs on every report from then on. Nothing we learn has to be learned twice — the system gets sharper each pass instead of starting fresh, which is why the AHA report can serve as the gold standard the Report Studio is tuned to match.

3
stages, every time
4
workflows ran today
20→12
findings verified to real edits
52/100
methodology rigor scored
How the work actually runs

Four workflows, four team shapes

A workflow is a small crew of agents we spin up for one job, each with a defined role, that hands its output to the next. We ran all four today to rebuild the AHA report from scratch — the gold standard our Report Studio models are tuned to match.

3
sites shipped today
7
sections regenerated
1.2M
tokens, one rerun
32 min
start to finish

Grammar review

Checks a draft against our writing rules and fixes what breaks them. One agent compiles the rulebook into a checklist, four reviewers read the draft in parallel — each hunting a different class of problem — and a synthesizer merges their notes into a single edit list, dropping duplicates and contradictions. Today: 20 flagged findings became 12 verified edits.

Methodology audit

Grades how rigorous the analysis underneath a report really is. One agent takes inventory of every claim and source, then five auditors each pressure-test one dimension of quality, and a synthesizer rolls their scores into one honest number. Today it came back at 52 out of 100 — a blunt signal of where the work still needs to be tighter.

Deep rerun

Regenerates a whole report end to end when the foundations have moved. Four analysts rebuild the research, seven composers each write their assigned section, and a packager assembles the finished site. Today this crew burned 1.2 million tokens in 32 minutes and regenerated 7 sections of the AHA report from the ground up.

Viz polish

Cleans up the charts a reader actually sees. Five browser agents open the live graphs the way a visitor would, spot what's broken — squished, overlapping labels — and fix the layout until the text breathes. Run with Playwright, the tool that lets an agent drive a real browser and read the rendered page, not just the code.

The fact-check gate, working as designed
wasWe had a competitor, Oura, sized at roughly $350M.
nowThe gate corrected it upward to about $1B — and a bigger rival made the report's value-leak argument stronger.
Anatomy

What the grammar-review workflow is made of

The grammar review that tuned today's gold-standard AHA report is a three-stage pipeline of small, single-purpose agents. One compiles the rules everyone scores against, four review the draft from four different angles at once, and one adversary verifies every finding against the live file before a single word changes. Nothing ships on faith.

Stage 1 · Compile

The rubric-compiler

One agent reads every source of editorial law — the grammar spec plus eleven standing correction-memories the team has accumulated — and distills it into a single deduplicated rubric. For the AHA run it collapsed all of that into 15 distinct rules. The point is simple: all four reviewers judge from the same page, so we get four lenses on one standard instead of four people inventing their own.

Stage 2 · Review

Four reviewers, in parallel

Four agents read the same draft at the same time, each hunting one class of problem: self-reference, inversion-and-slop language, argument progression, and scaffolding-or-headline defects. Running them side by side means the lenses overlap on purpose — the same line can get flagged twice — which is the signal the next stage is built to clean up.

Stage 3 · Synthesize

The synthesizer (the adversary)

The reviewers' raw findings overlap, conflict, and contain false positives. The synthesizer re-reads the actual current text, re-checks every finding against it, drops the noise, merges duplicates, and returns one clean, ordered, apply-ready edit list. In the AHA run it took 20 raw findings down to 12 verified edits — every find-string confirmed to match the live file exactly once before the edit was allowed out. A confident wrong finding never reaches the file, because a separate agent re-grounds it against reality first.

15
rules compiled into one rubric
4
reviewers, run in parallel
20→12
findings verified into edits

At the center of the rubric sit three gate rules for argument progression — the failure mode where one idea gets restated in slot after slot, each version well-written and true, but adding nothing the others don't already carry. Token-level checks never catch a repeated idea; these three do.

R-DIST.1 — Distinctness

Checks that an enumerated set — the gap cards, the recommendations, the risks — is genuinely distinct, with no two items making the same load-bearing claim from different angles. It runs early, on the bare propositions before any prose exists, and a collision blocks the render until the duplicate is regenerated into its own idea. It also writes the claim ledger that the next gate reads.

R-PROG.1 — Progression

A reviewer must be able to name the one thing each section adds that no earlier section added. If a section only re-proves a claim already established upstream, it fails and blocks publish. Where R-DIST.1 keeps the set distinct up front, R-PROG.1 keeps the finished, ordered report moving forward — it reads the ledger R-DIST.1 wrote.

The deterministic backstop

A cheap, certain check that runs first and needs no model: it flags a unit restating an upstream idea, and catches two list items whose headlines collide (same actors, same claim shape) before any judgment call is spent. It's the high-confidence pre-filter for the two smarter gates above, and it blocks publish on the same path as any other rule.

The four reviewers, each with one job:

Self-reference reviewer

Catches the report narrating itself — phrases that describe what the brief is doing instead of just doing it. The argument should make its case, never announce that it's about to.

Inversion & slop reviewer

Hunts the most obvious AI tell — the "not X, but Y" construction — along with filler and buzzword language across the whole draft. Drop the negation, lead with the affirmative claim.

Progression reviewer

Reads the report end to end and tests whether each section earns its place by adding something new. In the AHA run it found the central thesis faceted across four slots and drove each one to carry a distinct point instead.

Scaffolding & headlines reviewer

Strips internal section labels and method-jargon that leaked into the body, and checks that headlines name the actual story — the winners and the losing incumbent — rather than abstract enumerations.

What the progression gate fixed in the AHA report
was The thesis — owns the science while others own the everyday relationship — re-proved in the editor's letter, a gap card, and twice more in the closing block. Four slots, one idea.
now Each unit carries a distinct contribution. The closing block, for instance, lands on convergence and a decade-long head start instead of restating the thesis a fourth time.

The full verbatim text of every rule and every agent instruction lives in the vault notes and the Dynamic-Workflows base — this is the working anatomy; the statute lives in the vault notes.

The worked example

The AHA report, rebuilt in a day

We took a report that already read well and put it through the full set of dynamic workflows. By the end of the day it was the gold standard the rest of the studio is now tuned to match. Here is what happened, in order.

AHA is a brand intelligence report we'd already written and published. It read cleanly. The prose was tight, the argument landed, and on a first read nothing looked wrong. That is exactly the case these workflows are built for, because a report can read well and still be only half-rigorous underneath.

So before touching the writing, we ran a methodology audit on the analysis itself. One workflow took an inventory of the report, handed it to five auditors who each scored a different dimension of rigor, and a synthesizer pulled their scores together. The verdict on the analysis was 52 out of 100. The negative-space work — finding what competitors weren't saying — was strong. The value-flow work (how money actually moves through the category) and the ontology work (the underlying map of who's who and what's what) were thin. The thesis was right; the rigor under it was half-built.

The audit's core finding
was "The thesis is right, but the rigor is half-built — strong negative-space, thin value-flow and ontology. 52/100."
now A rebuilt analysis: the value leak quantified in dollars, a toll ladder showing where each player extracts margin, the actuarial mechanism spelled out, the sponsor-conflict named, a dual-population ranking, and a sixth strategic move the first pass never found.

That rebuild was its own workflow — the deep re-run. Four analysts went back into the raw material and rebuilt the analysis from the ground up. Seven composers then wrote the regenerated sections. A packager assembled the result. It burned through 1.2 million tokens in 32 minutes and produced seven freshly written sections, each carrying analysis the original simply didn't have.

workflow 1

Methodology audit

Inventory, then five dimension auditors, then a synthesizer. Scored the analysis at 52/100 and named exactly where it was thin.

workflow 2

Deep re-run

Four analysts rebuilt the analysis, seven composers rewrote the sections, a packager assembled it. 1.2M tokens, 32 minutes, seven regenerated sections.

workflow 3

Grammar review

A rubric-compiler set the standard, four reviewers read the prose against it, a synthesizer reconciled them. 20 findings narrowed to 12 verified edits.

workflow 4

Viz polish

Five browser agents drove the live pages with Playwright, caught graphs whose text was squished, and fixed them where readers would actually see them.

With the analysis rebuilt, the grammar engine cleaned the prose. A rubric-compiler wrote the standard to judge against, four reviewers each read the draft for different problems, and a synthesizer reconciled their notes. They raised 20 findings; after the synthesizer checked each one, 12 became real edits. The rest were noise, and the workflow's job was to tell the difference rather than apply all 20.

Then the fact-check gate went after the numbers — and this is the part worth sitting with. The gate made five corrections, and they pushed the figures up. A key competitor we'd sized at roughly $350M was actually closer to $1B. That correction made the report's central argument stronger: if the players capturing the value are bigger than we thought, the value leaking out of our client's position is bigger too. The fact-check didn't soften the thesis. It reinforced it.

Last, the Playwright viz-polish workflow. Five browser agents opened the actual published pages, looked at the charts the way a reader would, and found graphs where the text had gotten squished. They fixed them on the live pages, so the version a person opens is the version that's correct.

52/100
rigor score the audit gave the original analysis
12
agents in the deep re-run (4 analysts, 7 composers, 1 packager)
20→12
grammar findings narrowed to verified edits
5
fact-check corrections — all pushing the numbers upward

The report that came out the other side is the AHA v07 gold standard. It's the version the Report Studio models are now tuned to match, the anchor of the AHA report collection, and the reference behind the Grammar Engine demo. Same brand, same starting draft — a different report, because every workflow did one specific job and handed a stronger draft to the next.

One day, proving itself

What the system did today

Today was the system running on itself. We rebuilt a gold-standard report end to end, shipped two more sites around it, and let four dynamic workflows do the work we used to do by hand. Everything below went live in a single day.

Three sites are live right now:

The gold-standard report

The AHA v07 report, rebuilt from scratch today. This is the bar the Report Studio is tuned to hit — the reference every other report is measured against. aha-v07-opus48.pages.dev

The report collection

A home for the AHA reports as a set, so you can move between them and see how the work holds together. aha-report-collection.pages.dev

The grammar-engine demo

A live look at the rules that keep reports honest — the engine that catches repetition and weak arguments before anything ships. shuriq-grammar-engine.pages.dev

Four workflows ran across the day. Each one is a team of focused agents — they split the job up, check each other, and hand back a finished result.

Workflow 1

Grammar review

A rule-builder set the standard, four reviewers read the draft against it, and a synthesizer pulled their notes together. It surfaced 20 issues; 12 became real, checked edits to the report.

Workflow 2

Methodology audit

We took inventory of the method, then sent five auditors at it — one per dimension — and a synthesizer scored the whole thing. The honest verdict: 52 out of 100 on rigor. A real number we can now improve against.

Workflow 3

Deep re-run

Four analysts and seven writers regenerated the report from the ground up, with a packager assembling the final piece. It burned 1.2 million words of thinking in 32 minutes and rewrote 7 sections.

Workflow 4

Visual polish

Five browser agents opened the live graphs, found text that was squished and hard to read, and fixed it in place — the kind of finish work a person would otherwise do by eye.

One moment from today is worth holding onto. The fact-check step caught a competitor number and corrected it upward — Oura's revenue moved from roughly $350M to roughly $1B. That could have softened the report. It did the opposite.

The fact-check that made the case stronger
wasOura at ~$350M — a competitor that looked containable
nowOura at ~$1B — a bigger rival capturing value AHA is leaving on the table, which sharpens the report's core point

We also mapped the concepts behind all of this as a knowledge graph, so the relationships between the pieces are visible at a glance: infranodus.com/sensecollective/totem-dynamic-workflows.

3
sites live
4
workflows run
12
verified edits
7
sections regenerated
1.2M
words of thinking

Here is the principle the day proves: the system gets better by running. Every correction we make — a fixed number, a tightened argument, a cleaner graph — becomes a permanent rule. Today's work doesn't fade when the day ends. It compounds.