Skip to content
TeamMCPPro

One AI alone gets things confidently wrong.
Three frontier models catch what one missed.
Team gets them in the room, blind round, then discuss and reach agreement.

We love Claude Code, Gemini, Codex and Grok. They are well-trained and smart. But they make mistakes. Imagine sending a tough prompt to Claude Opus, Gemini Pro, Codex and Grok at the same time. Each one goes into a private room and does its own research. They come back to the table and share their findings. Then they work collaboratively, recognising each other’s good ideas and errors, building on them round-robin for several turns, until they reach agreement. They put a final proposal to you. That is what Twira Team does, and the results are remarkable.

AI alone

  • One model, one set of blind spots
  • Confident-but-wrong answers go unchecked
  • Manual: spawn three sessions, copy-paste, reconcile
  • No structured discussion phase
  • Same model stuck in the same loop
  • No record of what was debated

Twira Team powertool

  • Three or four frontier models in one call
  • Blind round catches independent errors
  • Round-robin discussion sharpens the answer
  • Optional synthesis pass reconciles outputs
  • Different providers, different training data
  • Every session saved with a display ID

One AI hallucinates alone. Three AIs catch each other. Team gets them in the room.

ask

Borrow one model’s answer on one question, pure blind round, no discussion.

review

Multiple models peer-review the same code or decision. Stress-tested.

brainstorm

Wide ideation across models. Nothing criticised. Option space mapped.

debate

Adversarial, two sides argued out. Strongest case for each.

synthesise

Optional final pass, reconcile every voice into one answer.

You ask

Should this rate-limiter be token-bucket or leaky-bucket?

Twira instantly

  • spawns claude-opus, gemini-3-1-pro, grok-4 in parallel, no model sees the others yet
  • each investigates independently and writes its blind answer
  • all three answers go on the table
  • round-robin discussion, each model responds to the others
  • optional synthesis reconciles the three into one verdict

Three frontier brains. One answer. Documented disagreement where it matters.

Without Twira
With Twira
one model’s blind spots
three models’ overlap
"looks confident to me"
structured disagreement
spawn manually, copy-paste, reconcile
one command
no record of what was debated
display ID per session, chain to next
one provider, one viewpoint
pick the brain mix on purpose
single perspective
blind round + discussion

How the agent uses this

Agent calls `team` via MCP. `mode: "ask" / "review" / "brainstorm" / "debate"`. Set the topic, optionally pass `models` (or use `preset: "pro" / "standard" / "fast"`) and `turns`. Read-only by default, `edit: true` is opt-in. Long-running call; the MCP server streams progress at each phase.

When you reach for it

  • A risky architectural decision, get three frontier models to debate before you commit; the case for each side gets sharper in the back-and-forth.
  • A complex bug your current model has been struggling with, bring in two more fresh sessions; they will see solutions the model with its blinkers on has been missing.
  • A code review on a security- or money-sensitive change, peer-review across Anthropic + OpenAI + Google catches what one provider’s blind spots miss.
  • A hard question outside your current model’s strength, `ask` Gemini for the algorithmic angle, `ask` Codex for the implementation; both answers come back to your session.
  • A brainstorm on a feature scope, every model generates ideas, nothing criticised; you get the option-space mapped before you narrow.
  • A chained workflow, `brainstorm` the options, `follows:` it with `review` on the top three, `follows:` that with `debate` on the final two; one clear decision at the end.

See it work

$ twira team review --models claude-opus,gemini-3-1-pro,grok-4 \
  --topic "Should this rate-limiter be token-bucket or leaky-bucket?" \
  --files src/limiter/* --synthesise
✓ Blind round started · 3 models · claude-opus, gemini-3-1-pro, grok-4 (running in parallel, wall-clock = slowest model)✓ Blind: claude-opus done (4,217 chars, 4m 12s)✓ Blind: gemini-3-1-pro done (3,891 chars, 5m 38s)✓ Blind: grok-4 done (3,540 chars, 6m 21s)⚠ Models diverged on burst-handling, discussion rounds will run✓ Discussion round 1 of 2 complete · 4m 47s✓ Discussion round 2 of 2 complete · 5m 12s✓ Synthesis: claude-opus reconciled all responses · 1m 58s✓ Saved to .TwiraTeam/2026-05-17-184230-a3f1-Review.md Total elapsed: 18m 23s · this session can run longer on harder topics Display ID: 2026-05-17-184230-a3f1 (use `--follows` for the next session)

Team takes time, and burns real tokens

A multi-model peer review typically runs 15–20 minutes and bills across every provider you invoke (your keys, your bills). It is a forward investment, the time and tokens you spend here save many times that downstream. The right call for architectural decisions, security reviews, hard refactors, anything you would otherwise have to back out of. Not for every line of code.

Peer review is my favourite mode. It is remarkable for solving hard problems, finding bugs, and working through solutions a single model struggles with. You can see each model finding different things, and watch them work together to build something that none of them could do alone.

, Chris, Founder

Technical depth, for engineers who want it

In your editor

You’ve done this manually before. Paste the same question into Claude, ChatGPT, and Gemini. Switch between tabs to compare answers. Copy them into a fourth tool for synthesis. It works, for the high-stakes calls where one model isn’t enough, but it’s manual, error-prone, and slow.

What Team does

Team automates the multi-model panel review you would otherwise do by hand. Pick the modes, ask · review · brainstorm · debate, and the model mix (13 frontier models across Anthropic, OpenAI, Google, xAI; presets `pro` / `standard` / `fast` if you don’t want to specify). The session runs five phases: each model investigates blind (no peer pressure); all answers go on the table; a round-robin discussion sharpens the response; an optional synthesis reconciles into one verdict. Every session writes a markdown + JSON transcript with a display ID, chain the next session against the previous (`brainstorm` → `review the top three` → `debate the final two`). Read-only by default, agents can read, search, diagnose, but not edit, unless you opt in.

How it actually works

One AI alone can hallucinate. Two AIs notice each other's mistakes. Three AIs notice the mistakes two AIs missed. Four frontier AIs that go off blind, then come back to the table and bounce ideas off each other, produce results you simply cannot get any other way. Different models have different training data, different reasoning styles, different blind spots, when their answers converge you have very high confidence; when they diverge you have a map of exactly where the hard parts of the decision are. Team is the Twira command that orchestrates that.

How a team session actually works, set the assignment, send the models off blind, bring them back to the table, round-robin until done. Every team session walks the same five phases. (1) Set the assignment: you give Team a topic (a question, a decision, a piece of code) and optional context (specific files, symbol lookups, search queries, blast-radius output). That context gets bundled into the same prompt every model sees. (2) Blind investigation: each model goes off into its own room. No model sees any other model's response. Each one investigates independently, using the Twira tools they have access to (Code Read, Search, Diagnose, Lore, etc.), and writes their answer. This is where fresh perspectives come from: a model that has been struggling with the problem alone in your session has its blinkers on; a fresh session with no prior context often sees solutions the original missed. (3) All on the table: once every model has finished its blind round, every response is put on the table for every other model to see. (4) Round-robin discussion: now they talk. Each model responds to the others in a round-robin, default 2 turns per model, configurable up or down, even down to 0 if you only want the blind round. (5) Optional synthesis: set synthesise: true and a final model reconciles every response into one answer. Skip it and you keep every model's individual voice.

The four modes, pick the one that fits the question. Each mode is the same five-phase machine; what changes is the style of conversation the agents have in the discussion rounds.

Ask, borrow expertise from a specific AI for a single question. Different AIs are stronger in different areas. You are mostly working in one harness (say, Claude Code), but on this one question Gemini's analysis would be sharper, or Codex would investigate it more thoroughly. Ask lets your Claude session pose the question to Gemini or Codex (or any combination) and get their answer back without leaving your session. Pure blind round, no discussion. "What does this expert think, independently of mine?"

Brainstorm, every idea on the table, nothing criticised. Models go wide. They generate ideas, combinations, refinements. Nobody pushes back. The output is a deliberately diverse spread of possibilities, including ones that might not work. Use this when you want the option space mapped, not narrowed.

Peer review, refine, validate, criticise constructively. The opposite of brainstorm. Models stress-test the same idea or piece of code, find flaws, propose improvements, validate strengths. Round-robin discussion sharpens each critique. Use this on code you are about to ship, a decision you are about to commit to, a refactor you have just finished.

Debate, protagonist and antagonist, adversarial. Models argue both sides. One takes a position; another challenges it; the round-robin escalates and refines. By the end you have the strongest case for each side and a much clearer view of where the actual disagreement is. Use this on architectural forks, build-vs-buy, framework choices, anything where reasonable engineers can land in different places.

13 frontier models across 4 providers, you pick the mix. Twira ships with adapters for every major frontier provider so you can mix worldviews on purpose. Anthropic: claude-opus, claude-sonnet, claude-haiku. OpenAI: codex. Google: gemini-pro3, gemini-flash3, gemini-3-1-pro. xAI: grok-3, grok-4, grok-standard, grok-fast, grok-mini, grok-code. Specify exactly which models you want (--models claude-opus,gemini-3-1-pro,grok-4) or use a preset for auto-selection: pro picks the premium tier from each provider, standard picks the workhorse tier, fast picks the speed tier. The point is not "use all 13", it is "deliberately choose the brain mix this question deserves."

Context loading, every team agent starts on the same page as you. Before the blind round even begins, Team can bundle code context into the prompt every model sees: files (specific paths to include verbatim), symbols (names looked up via the Twira index), search (results from a query you specify), blast_radius (the dependents-and-callers graph for a file). The agents do not start with abstractions; they start with the same concrete code you would put in front of a human reviewer.

Read-only by default, no unwanted file edits. Every team agent is spawned with file-edit permission OFF. They can read, search, diagnose, query, every other Twira capability, but they cannot modify files. If you want a team session that actually applies changes, set edit: true explicitly. The conservative default exists because team sessions take time and cost real tokens; the last thing you want is one model rewriting your file while three others were still thinking. Read-only also means the team output is reproducible: same inputs, same blind responses, same conversation, no side effects.

Session chaining, brainstorm → review → debate. Every session writes its full transcript to .TwiraTeam/ (markdown + JSON) with a date-first display ID (e.g. 2026-05-17-184230-a3f1-Brainstorm.md). Pass that display ID as follows: <id> to the next session and the agents start with the full synthesis and findings of the prior session as context. The killer workflow: brainstorm to generate options, peer-review the top three, debate the final two. Three sessions; one chained context; one clear final answer. Or continueFrom: <conversation_id> to resume a paused session exactly where it stopped.

Convergence detection, stop early when models already agree. Review and debate sessions are configured by default to detect early convergence: after the blind round, if all models' answers are substantively in agreement, the discussion rounds are skipped. The session completes in roughly half the time with the same answer. Override with early_convergence: true or false if you want to force discussion regardless.

It will take time. The quality of the result is the trade. A four-model peer review with two discussion rounds and an optional synthesis is not a fast operation, five to fifteen minutes is typical, longer for deep sessions. Every phase boundary emits a progress notification so you see what is happening in real time: blind round started, model N done, blind complete, discussion round 1 started, and so on. This is by design. "Four senior engineers in a room debating your design" takes longer than one engineer answering off the cuff, and produces a wildly better answer. The product is built around the second outcome.

Recursion guard and cooperative cancellation. Every team-spawned agent has every other Twira tool available except team itself, no team-calls-team recursion. Long sessions can be cancelled gracefully: an MCP notifications/cancelled from the calling agent (or the human via the harness) triggers a SIGTERM-then-SIGKILL on every in-flight subprocess, with the same grace path used for timeouts.

Setup. Team is enabled on every Pro licence. The first run prompts you for any provider CLIs missing on the system (it shells out to claude, codex, gemini, and the xAI client) and writes a short setup note for each.

What it isn’t

  • Team takes time. A four-model peer review with two discussion rounds is typically 5–15 minutes. Longer for deep sessions with synthesis. This is by design, it is the cost of getting a peer-reviewed answer instead of a single guess.
  • Read-only by default. Spawned agents can read, search, diagnose, query, but they cannot edit files unless you explicitly pass `edit: true`. The conservative default is what makes Team safe to leave running.
  • Team-spawned agents do not have access to Team themselves, no recursion. Every other Twira tool is available to them.
  • Requires the provider CLIs installed on your machine (`claude`, `codex`, `gemini`, the xAI client). Team shells out to each. The first run prompts you for any missing CLI with a setup note.
  • Pro tier. The orchestration layer is included; the actual token costs are billed by each provider you invoke (your keys, your bills).

One install. Your agent will know the difference in the first session.

$ curl -fsSL twira.com/install.sh | sh
Team, Tools · Twira