How we hand a job to a team of AI agents — some working at once, some checking each other — and stitch their work back into one verified result. Built on Opus 4.8, proven this week on the American Heart Association report.
A dynamic workflow is a small, fixed program that hands a job to a team of AI agents — some working at the same time, some checking each other's work — and stitches their answers back into one result. It runs on Opus 4.8. We use it instead of asking a single agent to do everything in one pass.
The shift came from watching where a single agent breaks. When one agent does a big job alone, it fails two ways, and a workflow fixes each one directly.
Divide, and add adversaries. Those are the two moves. Below is what each one looks like in practice.
In our grammar review, four reviewers read the same report at once, each owning one slice: one watches for self-reference, one for inverted phrasing, one for repeated ideas, one for jargon and weak headlines. Coverage comes from having more eyes, each looking for one thing.
Nothing an agent claims is taken on faith. A synthesizer re-checks every finding against the actual file, drops the false positives, and returns only what holds up. On the grammar review that turned 20 raw findings into 12 verified edits — the other 8 did not survive the check.
The same rulebook runs every time, so the output is consistent. Agents work in parallel, so it's fast. And every agent's reasoning is saved, so any result can be traced back and audited later.
This is not a thought experiment. Across one day, four dynamic workflows ran against our work and produced shippable output.
A rubric-compiler turned our written rules into one shared scoresheet, four reviewers each judged the report through their own lens, and a synthesizer merged them into an apply-ready edit list — 20 findings in, 12 verified edits out.
An inventory step listed what the report drew on, five auditors each scored one dimension of rigor, and a synthesizer returned the verdict: rigor 52 out of 100 — a clear signal the analysis had been left shallow.
Four analysts re-did the underlying work, seven composers wrote the sections, and a packager assembled them — 1.2 million tokens over 32 minutes, 7 sections regenerated. This is the run behind the AHA v07 report, the gold standard our Report Studio models are tuned to match.
Five browser agents opened the live visualizations, took screenshots, fixed the cramped text, and re-checked their own work against the rendered result — each one looping screenshot, fix, verify until the graphs read cleanly.
One detail worth holding onto, because it shows the adversary working in our favor and not just against us. The fact-check gate corrected a competitor's revenue upward — Oura from roughly $350M to roughly $1B. A weaker process would have quietly left the smaller number in. The correction made the report's value-leak argument stronger, because the money walking out the door was bigger than we'd first written.
The day's output: the AHA v07 gold-standard report rebuilt end to end (aha-v07-opus48.pages.dev), the report collection site (aha-report-collection.pages.dev), and a working demo of the engine itself (shuriq-grammar-engine.pages.dev).
We stopped asking one agent to do a big job in one pass. Instead the work splits into three stages that always run in order: one agent builds the shared rulebook, several agents judge in parallel, and one agent checks the findings against reality before anything ships.
One agent reads the source of truth — our written rules, the data, the prior report — and turns it into a single shared reference everyone else scores against. This is the "everyone judges from the same page" step. When we re-ran the grammar review on the AHA report, this agent compiled every editorial rule into one rubric before a single reviewer started reading.
Several agents run at the same time, each owning one lens. On the grammar review, four reviewers read the same report at once: one watching for self-reference, one for inverted phrasing, one for repeated ideas, one for jargon and weak headlines. They read and judge only — they change nothing. Because no agent edits the file, none of them can collide.
One agent merges the parallel findings, removes duplicates, re-checks every claim against the live file, and drops anything that does not hold up. What comes back is one clean, apply-ready answer. On the grammar review, the four reviewers raised 20 findings; the synthesizer verified them down to 12 real edits. A human applies the result once, at the end, in one controlled pass.
Today four of these workflows ran back to back: the grammar review above, a methodology audit (an inventory agent, five dimension auditors, one synthesizer — rigor scored 52 out of 100), a deep re-run that rebuilt seven sections of the report (four analysts, seven composers, a packager — 1.2 million tokens across 32 minutes), and a browser-driven polish pass where five agents fixed squished graph text by taking live screenshots. Three sites shipped from that work: the gold-standard AHA report, its report collection site, and the Grammar Engine demo.
Nothing an agent claims is taken on faith. The synthesize stage re-checks every finding against the live file before it counts. Today's fact-check gate caught a competitor number that was too low — Oura's revenue was closer to a billion dollars than the ~$350M we had — and corrected it upward. That made the report's core argument about leaked value stronger, because the gate cares about the truth above protecting the conclusion.
A correction made once becomes a rule. A rule becomes an agent's standing instruction. That instruction then runs on every report from then on. Nothing we learn has to be learned twice — the system gets sharper each pass instead of starting fresh, which is why the AHA report can serve as the gold standard the Report Studio is tuned to match.
A workflow is a small crew of agents we spin up for one job, each with a defined role, that hands its output to the next. We ran all four today to rebuild the AHA report from scratch — the gold standard our Report Studio models are tuned to match.
Checks a draft against our writing rules and fixes what breaks them. One agent compiles the rulebook into a checklist, four reviewers read the draft in parallel — each hunting a different class of problem — and a synthesizer merges their notes into a single edit list, dropping duplicates and contradictions. Today: 20 flagged findings became 12 verified edits.
Grades how rigorous the analysis underneath a report really is. One agent takes inventory of every claim and source, then five auditors each pressure-test one dimension of quality, and a synthesizer rolls their scores into one honest number. Today it came back at 52 out of 100 — a blunt signal of where the work still needs to be tighter.
Regenerates a whole report end to end when the foundations have moved. Four analysts rebuild the research, seven composers each write their assigned section, and a packager assembles the finished site. Today this crew burned 1.2 million tokens in 32 minutes and regenerated 7 sections of the AHA report from the ground up.
Cleans up the charts a reader actually sees. Five browser agents open the live graphs the way a visitor would, spot what's broken — squished, overlapping labels — and fix the layout until the text breathes. Run with Playwright, the tool that lets an agent drive a real browser and read the rendered page, not just the code.
The grammar review that tuned today's gold-standard AHA report is a three-stage pipeline of small, single-purpose agents. One compiles the rules everyone scores against, four review the draft from four different angles at once, and one adversary verifies every finding against the live file before a single word changes. Nothing ships on faith.
One agent reads every source of editorial law — the grammar spec plus eleven standing correction-memories the team has accumulated — and distills it into a single deduplicated rubric. For the AHA run it collapsed all of that into 15 distinct rules. The point is simple: all four reviewers judge from the same page, so we get four lenses on one standard instead of four people inventing their own.
Four agents read the same draft at the same time, each hunting one class of problem: self-reference, inversion-and-slop language, argument progression, and scaffolding-or-headline defects. Running them side by side means the lenses overlap on purpose — the same line can get flagged twice — which is the signal the next stage is built to clean up.
The reviewers' raw findings overlap, conflict, and contain false positives. The synthesizer re-reads the actual current text, re-checks every finding against it, drops the noise, merges duplicates, and returns one clean, ordered, apply-ready edit list. In the AHA run it took 20 raw findings down to 12 verified edits — every find-string confirmed to match the live file exactly once before the edit was allowed out. A confident wrong finding never reaches the file, because a separate agent re-grounds it against reality first.
At the center of the rubric sit three gate rules for argument progression — the failure mode where one idea gets restated in slot after slot, each version well-written and true, but adding nothing the others don't already carry. Token-level checks never catch a repeated idea; these three do.
Checks that an enumerated set — the gap cards, the recommendations, the risks — is genuinely distinct, with no two items making the same load-bearing claim from different angles. It runs early, on the bare propositions before any prose exists, and a collision blocks the render until the duplicate is regenerated into its own idea. It also writes the claim ledger that the next gate reads.
A reviewer must be able to name the one thing each section adds that no earlier section added. If a section only re-proves a claim already established upstream, it fails and blocks publish. Where R-DIST.1 keeps the set distinct up front, R-PROG.1 keeps the finished, ordered report moving forward — it reads the ledger R-DIST.1 wrote.
A cheap, certain check that runs first and needs no model: it flags a unit restating an upstream idea, and catches two list items whose headlines collide (same actors, same claim shape) before any judgment call is spent. It's the high-confidence pre-filter for the two smarter gates above, and it blocks publish on the same path as any other rule.
The four reviewers, each with one job:
Catches the report narrating itself — phrases that describe what the brief is doing instead of just doing it. The argument should make its case, never announce that it's about to.
Hunts the most obvious AI tell — the "not X, but Y" construction — along with filler and buzzword language across the whole draft. Drop the negation, lead with the affirmative claim.
Reads the report end to end and tests whether each section earns its place by adding something new. In the AHA run it found the central thesis faceted across four slots and drove each one to carry a distinct point instead.
Strips internal section labels and method-jargon that leaked into the body, and checks that headlines name the actual story — the winners and the losing incumbent — rather than abstract enumerations.
The full verbatim text of every rule and every agent instruction lives in the vault notes and the Dynamic-Workflows base — this is the working anatomy; the statute lives in the vault notes.
We took a report that already read well and put it through the full set of dynamic workflows. By the end of the day it was the gold standard the rest of the studio is now tuned to match. Here is what happened, in order.
AHA is a brand intelligence report we'd already written and published. It read cleanly. The prose was tight, the argument landed, and on a first read nothing looked wrong. That is exactly the case these workflows are built for, because a report can read well and still be only half-rigorous underneath.
So before touching the writing, we ran a methodology audit on the analysis itself. One workflow took an inventory of the report, handed it to five auditors who each scored a different dimension of rigor, and a synthesizer pulled their scores together. The verdict on the analysis was 52 out of 100. The negative-space work — finding what competitors weren't saying — was strong. The value-flow work (how money actually moves through the category) and the ontology work (the underlying map of who's who and what's what) were thin. The thesis was right; the rigor under it was half-built.
That rebuild was its own workflow — the deep re-run. Four analysts went back into the raw material and rebuilt the analysis from the ground up. Seven composers then wrote the regenerated sections. A packager assembled the result. It burned through 1.2 million tokens in 32 minutes and produced seven freshly written sections, each carrying analysis the original simply didn't have.
Inventory, then five dimension auditors, then a synthesizer. Scored the analysis at 52/100 and named exactly where it was thin.
Four analysts rebuilt the analysis, seven composers rewrote the sections, a packager assembled it. 1.2M tokens, 32 minutes, seven regenerated sections.
A rubric-compiler set the standard, four reviewers read the prose against it, a synthesizer reconciled them. 20 findings narrowed to 12 verified edits.
Five browser agents drove the live pages with Playwright, caught graphs whose text was squished, and fixed them where readers would actually see them.
With the analysis rebuilt, the grammar engine cleaned the prose. A rubric-compiler wrote the standard to judge against, four reviewers each read the draft for different problems, and a synthesizer reconciled their notes. They raised 20 findings; after the synthesizer checked each one, 12 became real edits. The rest were noise, and the workflow's job was to tell the difference rather than apply all 20.
Then the fact-check gate went after the numbers — and this is the part worth sitting with. The gate made five corrections, and they pushed the figures up. A key competitor we'd sized at roughly $350M was actually closer to $1B. That correction made the report's central argument stronger: if the players capturing the value are bigger than we thought, the value leaking out of our client's position is bigger too. The fact-check didn't soften the thesis. It reinforced it.
Last, the Playwright viz-polish workflow. Five browser agents opened the actual published pages, looked at the charts the way a reader would, and found graphs where the text had gotten squished. They fixed them on the live pages, so the version a person opens is the version that's correct.
The report that came out the other side is the AHA v07 gold standard. It's the version the Report Studio models are now tuned to match, the anchor of the AHA report collection, and the reference behind the Grammar Engine demo. Same brand, same starting draft — a different report, because every workflow did one specific job and handed a stronger draft to the next.
Today was the system running on itself. We rebuilt a gold-standard report end to end, shipped two more sites around it, and let four dynamic workflows do the work we used to do by hand. Everything below went live in a single day.
Three sites are live right now:
The AHA v07 report, rebuilt from scratch today. This is the bar the Report Studio is tuned to hit — the reference every other report is measured against. aha-v07-opus48.pages.dev
A home for the AHA reports as a set, so you can move between them and see how the work holds together. aha-report-collection.pages.dev
A live look at the rules that keep reports honest — the engine that catches repetition and weak arguments before anything ships. shuriq-grammar-engine.pages.dev
Four workflows ran across the day. Each one is a team of focused agents — they split the job up, check each other, and hand back a finished result.
A rule-builder set the standard, four reviewers read the draft against it, and a synthesizer pulled their notes together. It surfaced 20 issues; 12 became real, checked edits to the report.
We took inventory of the method, then sent five auditors at it — one per dimension — and a synthesizer scored the whole thing. The honest verdict: 52 out of 100 on rigor. A real number we can now improve against.
Four analysts and seven writers regenerated the report from the ground up, with a packager assembling the final piece. It burned 1.2 million words of thinking in 32 minutes and rewrote 7 sections.
Five browser agents opened the live graphs, found text that was squished and hard to read, and fixed it in place — the kind of finish work a person would otherwise do by eye.
One moment from today is worth holding onto. The fact-check step caught a competitor number and corrected it upward — Oura's revenue moved from roughly $350M to roughly $1B. That could have softened the report. It did the opposite.
We also mapped the concepts behind all of this as a knowledge graph, so the relationships between the pieces are visible at a glance: infranodus.com/sensecollective/totem-dynamic-workflows.
Here is the principle the day proves: the system gets better by running. Every correction we make — a fixed number, a tightened argument, a cleaner graph — becomes a permanent rule. Today's work doesn't fade when the day ends. It compounds.