Skip to content

Opinions: Contrarian Views on AI Agents

These views come from three months of real operations. Not theory — conclusions hammered out by reality. You may disagree. That’s fine.


Most people use AI agents to build products: an app, a website, a bot.

I’m not building a product. I’m building an organization.

A 24/7 digital company staffed by AI agents. COO is COO, Coder is engineer, Research is analyst, Marketing is marketing, QA is QA. They have roles, processes, and performance standards.

Why organization over product?

  • Products become obsolete. The app you build today might be irrelevant in 6 months.
  • Organizations keep producing. The same agent team can build unlimited products.
  • Organizations self-improve. Every failure gets written into SOUL.md. Product bug fixes are one-time. Organizational mechanism improvements are permanent.

Spend 80% of time building organization (processes, monitoring, communication, QA). 20% on specific products. Not the reverse.

Comparison: Frameworks like CrewAI and AutoGen provide multi-agent collaboration scaffolding, but they lean toward “use agents to build a product.” Our approach is closer to “use agents to build an organization.” Not mutually exclusive, but different starting points.


Opinion 2: SOUL.md is the most important code

Section titled “Opinion 2: SOUL.md is the most important code”

Agent behavior isn’t determined by code — it’s determined by system prompts. SOUL.md is the agent’s “operating system.”

Well-written SOUL.md:

  • Clear role boundaries (what you do, what you don’t)
  • Specific operational rules (not “be safe,” but “never write API keys into chat”)
  • Failure behavior constraints (“stop retrying after 2 failures”)
  • Communication protocols (“must @mention COO after completion”)

Poorly-written SOUL.md:

  • “You are a helpful AI assistant”
  • No constraints, relies entirely on agent judgment
  • Rules too long, agent ignores half

Rule of thumb: Keep SOUL.md under 200 lines. Beyond 200, agents start ignoring later rules. Put critical rules first, mark hard constraints with ⛔.

# Good SOUL.md structure
## 1. Identity (first 10 lines)
Who you are, your role, core responsibilities
## 2. Hard constraints (first 50 lines)
⛔-marked rules that must never be violated
Put these first — model compliance is highest for the first 50 lines
## 3. Workflow (lines 50-150)
Standard operating procedures, communication protocols, reporting formats
## 4. Context (lines 150-200)
Current project state, known issues, notes

Further reading: Anthropic has detailed system prompt best practices. Also, Cursor Rules patterns has many transferable ideas.


Opinion 3: Agents don’t need long-term memory

Section titled “Opinion 3: Agents don’t need long-term memory”

Intuition says agents should remember all history. In practice, this causes context overflow and hallucination.

My approach: short-term memory + on-demand search:

  • Today’s log auto-loaded (memory/YYYY-MM-DD.md)
  • Historical info retrieved via semantic search as needed
  • Long-term decisions written to MEMORY.md (manually maintained, under 200 lines)

This is 10x more effective than “remember everything.” Agent context windows are finite — filling them with irrelevant history = degrading current task quality.

Analogy: You don’t need to remember what you ate every day last year. You need to know what you’re allergic to. The former is historical data; the latter is decision rules. Same for agents.

Many people research better agent memory via vector databases (embedding → store → retrieve). We tried it but found:

  1. Unreliable recall — semantic search recall rate isn’t high enough for critical info
  2. Noisy results — “relevant” content mixed with irrelevant info, diluting useful context
  3. High maintenance — requires additional infrastructure (vector DB, embedding API)
  4. Non-deterministic — you can’t be sure what the agent will retrieve each time

Our alternative:

  • SOUL.md + MEMORY.md = deterministic documents, loaded every session, 100% reliable
  • Daily logs = short-term memory, only today’s, not drowned by history
  • When historical info needed = human specifies (“refer to memory/2026-02-08.md”)

Detailed memory configs: See Workflows — Memory Management


Opinion 4: Failure is the system’s most valuable input

Section titled “Opinion 4: Failure is the system’s most valuable input”

Every failure should become a permanent mechanism. Not “be more careful next time” — write it as a rule, script, or constraint.

Our process:

  1. Failure happens
  2. Ask the agent: “Walk me through your reasoning”
  3. Identify root cause (lost context? wrong tool? bad assumption?)
  4. Implement prevention: edit SOUL.md, add memory note, create new check
  5. Log to memory/improvements-log.md

Hard rule: never just retry. Always add a guardrail first.

This is Harness Engineering — every failure makes the system stronger. After months, most common failure patterns are covered by rules. Agents become more reliable not because models improved, but because constraints became more comprehensive.

FailureRoot causeNew rule
Shipped broken productNo QA step⛔ QA Gate — QA must PASS
Context froze COONo context monitoring⛔ Auto-compact at 60%
API quota burned overnightInfinite retries⛔ Stop after 2 failures
Agent silent for 6 hoursNo timeout mechanism30-min task timeout alert
@mention not workingFormat/config errorsEnd-to-end communication test checklist
SOUL.md update not taking effectSession cachingMust restart gateway after config changes

After three months: ~30% of SOUL.md rules were distilled from failures. These rules are worth more than any individual task output.


Opinion 5: Human attention is the real bottleneck

Section titled “Opinion 5: Human attention is the real bottleneck”

With 5 agents running simultaneously, the bottleneck isn’t compute, API quota, or model capability. It’s your attention.

5 agents producing simultaneously → you need to review 5 outputs → each needs quality judgment → each judgment needs context → your brain is single-threaded.

This is why QA Gate isn’t wasted time — it’s an attention lever. QA does first-pass review; you only need to see PASS/FAIL and a summary.

Deeper insight: As agent count grows, you don’t need more agents — you need better filtering. Information must be tiered. Only things requiring human decisions reach you. Everything else, agents handle.

  1. COO is the only interface — human only talks to COO. COO handles all dispatch and aggregation.
  2. Agents shouldn’t ask questions — if the brief is good, no follow-up needed. If follow-up is needed, the brief was bad.
  3. Automated decisions — P2/P3 tasks, agents decide execution. Only P0/P1 need human input.
  4. Batch reporting — don’t report after every task. Batch 3-5 together.

Goal: Human spends 30 minutes/day managing agents. Rest on things only humans can do. Not there yet. Working toward it.


Opinion 6: Speed is worthless, reliability is everything

Section titled “Opinion 6: Speed is worthless, reliability is everything”

A system that gives 80% results every time is far more valuable than one that gives 100% sometimes and 0% other times.

The former you can trust, delegate to, go to sleep. The latter you must watch, always ready to firefight.

This is why I spent enormous time on QA, monitoring, constraints — looks “slow,” but it’s building trust. When you trust the system, you can truly let go. After letting go, system output far exceeds what you’d get by watching it.

Analogy: You wouldn’t give money to a fund manager who sometimes makes 100% and sometimes loses everything. You’d pick the one with steady 15% annual returns. Agent systems are the same.


The common theme across all six opinions: AI agents’ value isn’t how smart they are — it’s how much you can trust them. Trust comes from reliability. Reliability comes from constraints and monitoring.

Being smart is the model company’s job. Your job is building systems that make smart reliable.

If you have different experiences and opinions — open a GitHub Issue for discussion. Multi-agent systems are still early stage. We need more real cases to validate or challenge these views.


More practical content: Architecture | Debug Playbook | Workflows