Zup Innovation

Building an Internal
Coding Agent at Zup

Lessons and Open Questions
Gustavo Pinto
Introduction

The gap between prototype and production

LLM-based coding agents can accelerate routine development tasks. But building one that performs well on benchmarks is fundamentally different from deploying one that developers actually use.

Research agents production agents. Production demands engineering.

Introduction

When engineering challenges go unaddressed

GitHub issue: Critical Bug — Agent executed destructive rm -rf command without safeguards
Real incident: openai/codex#3934
Section 02

Origins of
CodeGen

Origins

From conversational assistant to autonomous agent

StackSpot AI
IDE conversational assistant—developers ask questions, but the tool cannot act on the environment
Agentic Gap
Adding agent capabilities to the enterprise codebase would require changes the platform roadmap could not absorb
Independent POC
Two team members build a CLI-based coding agent outside the main codebase
CodeGen
Production-ready agentic system used daily by real developers at Zup
Origins

From conversational assistant to autonomous agent

StackSpot AI
IDE conversational assistant—developers ask questions, but the tool cannot act on the environment
Agentic Gap
Adding agent capabilities to the enterprise codebase would require changes the platform roadmap could not absorb
Independent POC
Two team members build a CLI-based coding agent outside the main codebase
CodeGen
Production-ready agentic system used daily by real developers at Zup
Origins

From conversational assistant to autonomous agent

StackSpot AI
IDE conversational assistant—developers ask questions, but the tool cannot act on the environment
Agentic Gap
Adding agent capabilities to the enterprise codebase would require changes the platform roadmap could not absorb
Independent POC
Two team members build a CLI-based coding agent outside the main codebase
CodeGen
Production-ready agentic system used daily by real developers at Zup
Origins

From conversational assistant to autonomous agent

StackSpot AI
IDE conversational assistant—developers ask questions, but the tool cannot act on the environment
Agentic Gap
Adding agent capabilities to the enterprise codebase would require changes the platform roadmap could not absorb
Independent POC
Two team members build a CLI-based coding agent outside the main codebase
CodeGen
Production-ready agentic system used daily by real developers at Zup
Section 03

CodeGen
Internals

Architecture

Three-tier architecture

CLI

Node.js client
  • User interaction & local tool execution
  • Executes tools on the developer's machine
  • Supports IDE plugins & VM executors via same protocol
WebSocket

Backend API

FastAPI
  • Auth, routing & task lifecycle (REST)
  • WebSocket for bidirectional tool dispatch
  • SSE for web portal read-only streams
Orchestration

Maestro

Orchestration engine
  • Bootstrap: collects OS, git history, project structure
  • Sends system prompt + tool manifest to LLM
  • Iterative loop: tool call → execute → feed back
Architecture

Three-tier architecture

CLI

Node.js client
  • User interaction & local tool execution
  • Executes tools on the developer's machine
  • Supports IDE plugins & VM executors via same protocol
WebSocket

Backend API

FastAPI
  • Auth, routing & task lifecycle (REST)
  • WebSocket for bidirectional tool dispatch
  • SSE for web portal read-only streams
Orchestration

Maestro

Orchestration engine
  • Bootstrap: collects OS, git history, project structure
  • Sends system prompt + tool manifest to LLM
  • Iterative loop: tool call → execute → feed back
Architecture

Three-tier architecture

CLI

Node.js client
  • User interaction & local tool execution
  • Executes tools on the developer's machine
  • Supports IDE plugins & VM executors via same protocol
WebSocket

Backend API

FastAPI
  • Auth, routing & task lifecycle (REST)
  • WebSocket for bidirectional tool dispatch
  • SSE for web portal read-only streams
Orchestration

Maestro

Orchestration engine
  • Bootstrap: collects OS, git history, project structure
  • Sends system prompt + tool manifest to LLM
  • Iterative loop: tool call → execute → feed back
ReAct loop: ReasonAct (tool call) → Observe (result) → Iterate until final response
Architecture

Tool manifest

Tool design quality is a first-class citizen of agent effectiveness.

read

Retrieves file contents. Enforces a read-before-edit policy to prevent stale-context errors and hallucinated edits.

edit

Targeted string replacement rather than full-file rewriting—mitigates LLM truncation failures on large files.

shell

Executes terminal commands subject to multiple guardrail layers: command blocking, human approval mode, full audit logging.

Example: tool specification
"WriteFile": {
  "name": "WriteFile",
  "description":
    "Writes content to a specified
     file, creating it if it doesn't
     exist and overwriting if it does.",
  "parameters": {
    "type": "object",
    "properties": {
      "file_path": {
        "type": "string",
        "description":
          "The path to the file."
      },
      "content": {
        "type": "string",
        "description":
          "The content to write."
      }
    },
    "required":
      ["file_path", "content"]
  }
}
Section 04

Design
Decisions

Design Decisions

14 decisions across three dimensions

Architecture & Framework

  1. LangChain's chain model was inadequate for the agentic loop
  2. Manual implementation gave more control than early framework adoption
  3. Transitioning toward modern orchestration as frameworks converged
  4. FastAPI chosen for native async support
  5. Session-scoped memory already delivers substantial UX value
  6. Reasoning delegated to the LLM, not hand-coded in the orchestrator
  7. Strong model capabilities don't eliminate need for orchestration

Tool Design & Safety

  1. Tool design quality proved more impactful than prompt-only tuning
  2. Edit tool uses targeted string replacement, not full-file rewrites
  3. Read-before-edit policy prevents hallucinated edits
  4. Shell tool requires multiple guardrail layers
  5. Policy consistency across tools is required for effective safety
  6. Strict code quality enforcement as additional safety net

Human Oversight & Adoption

  1. Approval mode serves as trust-calibration during onboarding
  2. Separating planning from execution addresses single-pass limitations
  3. Progressive deployment mirrors individual trust-building patterns
  4. Most decisions involved balancing competing concerns
Design Decisions

14 decisions across three dimensions

Architecture & Framework

  1. LangChain's chain model was inadequate for the agentic loop
  2. Manual implementation gave more control than early framework adoption
  3. Transitioning toward modern orchestration as frameworks converged
  4. FastAPI chosen for native async support
  5. Session-scoped memory already delivers substantial UX value
  6. Reasoning delegated to the LLM, not hand-coded in the orchestrator
  7. Strong model capabilities don't eliminate need for orchestration

Tool Design & Safety

  1. Tool design quality proved more impactful than prompt-only tuning
  2. Edit tool uses targeted string replacement, not full-file rewrites
  3. Read-before-edit policy prevents hallucinated edits
  4. Shell tool requires multiple guardrail layers
  5. Policy consistency across tools is required for effective safety
  6. Strict code quality enforcement as additional safety net

Human Oversight & Adoption

  1. Approval mode serves as trust-calibration during onboarding
  2. Separating planning from execution addresses single-pass limitations
  3. Progressive deployment mirrors individual trust-building patterns
  4. Most decisions involved balancing competing concerns
Architecture & Framework

LangChain's chain model was inadequate for the agentic loop

At project inception, the team experimented with LangChain as an orchestration framework. Its early abstractions were designed around linear chains—a unidirectional pipeline where each step feeds into the next.

Input
PromptTemplate
LLM
OutputParser
Output
Architecture & Framework

LangChain's chain model was inadequate for the agentic loop

The agentic coding assistant required a cyclical interaction pattern: the model requests a tool, the client executes it, and the result is fed back into the model repeatedly until the task is complete.

START User prompt AGENTIC LOOP 1 Reason LLM thinks 2 Act Tool call 3 Observe Tool result loop: feed result back when done END Final answer

This fundamental mismatch forced the team to work around the framework rather than with it.

Architecture & Framework

Manual implementation gave more control than early framework adoption

We adopted but later abandoned LangChain.
The team implemented the agentic loop directly—the Maestro component. This gave explicit control over stop criteria, tool dispatch, WebSocket communication, and error propagation.

For novel or poorly understood interaction patterns, manual implementation accelerates learning and provides clearer ownership of execution semantics—advantages that outweigh framework convenience during early project phases.
Architecture & Framework

Transitioning toward modern orchestration as frameworks converged

As the agentic pattern became widespread, frameworks evolved to support it natively. LangChain introduced LangGraph, offering first-class support for cyclical tool-calling loops.

When evaluated, the team found that the design closely resembled what had already been built by hand. This convergence validated the original choices and made transition cost low.

Building manually first and adopting frameworks later—once they mature to match actual requirements—avoids both premature abstraction and long-term maintenance burden.
Architecture & Framework

Creating agentic apps in 2026

Design Decisions

14 decisions across three dimensions

Architecture & Framework

  1. LangChain's chain model was inadequate for the agentic loop
  2. Manual implementation gave more control than early framework adoption
  3. Transitioning toward modern orchestration as frameworks converged
  4. FastAPI chosen for native async support
  5. Session-scoped memory already delivers substantial UX value
  6. Reasoning delegated to the LLM, not hand-coded in the orchestrator
  7. Strong model capabilities don't eliminate need for orchestration

Tool Design & Safety

  1. Tool design quality proved more impactful than prompt-only tuning
  2. Edit tool uses targeted string replacement, not full-file rewrites
  3. Read-before-edit policy prevents hallucinated edits
  4. Shell tool requires multiple guardrail layers
  5. Policy consistency across tools is required for effective safety
  6. Strict code quality enforcement as additional safety net

Human Oversight & Adoption

  1. Approval mode serves as trust-calibration during onboarding
  2. Separating planning from execution addresses single-pass limitations
  3. Progressive deployment mirrors individual trust-building patterns
  4. Most decisions involved balancing competing concerns
Tool Design & Safety

Tool design > prompt design

Tool specification is a first-class engineering concern—not an afterthought.

Tool Design & Safety

Shell safety & cross-tool policy · guardrails for shell

Block tool Block specific commands Approval mode Log every command

⚠️ PROBLEM Blocking direct file deletion is useless if shell remains unrestricted—a shell command achieves the same effect.

Tool Design & Safety

The 3 layers of tool safety

Design Decisions

14 decisions across three dimensions

Architecture & Framework

  1. LangChain's chain model was inadequate for the agentic loop
  2. Manual implementation gave more control than early framework adoption
  3. Transitioning toward modern orchestration as frameworks converged
  4. FastAPI chosen for native async support
  5. Session-scoped memory already delivers substantial UX value
  6. Reasoning delegated to the LLM, not hand-coded in the orchestrator
  7. Strong model capabilities don't eliminate need for orchestration

Tool Design & Safety

  1. Tool design quality proved more impactful than prompt-only tuning
  2. Edit tool uses targeted string replacement, not full-file rewrites
  3. Read-before-edit policy prevents hallucinated edits
  4. Shell tool requires multiple guardrail layers
  5. Policy consistency across tools is required for effective safety
  6. Strict code quality enforcement as additional safety net

Human Oversight & Adoption

  1. Approval mode serves as trust-calibration during onboarding
  2. Separating planning from execution addresses single-pass limitations
  3. Progressive deployment mirrors individual trust-building patterns
  4. Most decisions involved balancing competing concerns
Human Oversight

Trust is earned, not mandated

APPROVAL MODE

Every file edit and shell command requires explicit human confirmation.

AUTONOMOUS MODE --yolo

Agent operates with minimal interruption.

Developers begin in approval mode, then organically migrate to autonomous mode as confidence grows.

Human Oversight

Balancing competing concerns

Throughout CodeGen's development, the team repeatedly faced decisions where improving one dimension came at the cost of another:

Safety via approval modes
vs.
added latency & friction
Session memory for UX
vs.
infrastructure complexity
Reasoning delegated to LLM
vs.
less deterministic control
Section 06

Open
Questions

Open Questions

For researchers and tool builders

1

Is there a methodology for designing tools that LLMs invoke correctly?

2

Where should reasoning live—in the model or in the orchestrator?

3

How do we enforce safety when tools have overlapping capabilities?

4

How do agents earn autonomy beyond a binary on/off switch?

5

What should agents remember—and what should they forget?

6

How do we QA code that agents wrote?

Thanks, G.
Zup Innovation

Building an Internal
Coding Agent at Zup

Lessons and Open Questions
Gustavo Pinto