HoloDeck Part 3: How I'm Approaching Agent Development

In Part 1, I talked about what feels broken in agent development. In Part 2, I looked at what’s out there. Now let me walk through what I’m building.

This is Part 3 of a 3-Part Series

Why It Feels Broken - What’s wrong with agent development
What’s Out There - The current landscape
How HoloDeck Works (You are here)

The Core Idea

The insight that got me started: agents are systems with measurable components that can be optimized systematically. We did this for ML. Why not agents?

The Analogy

Traditional ML	Agent Engineering
NN Architecture	Agent Artifacts (prompts, instructions, tools, memory)
Loss Function	Evaluators (NLP metrics, LLM-as-judge, custom scoring)
Hyperparameters	Configuration (temperature, top_p, max_tokens, model)
Training Loop	Agent Execution Framework
Evaluation Metrics	Agent Benchmarks & Test Suites
Model Checkpoints	Agent Versions & Snapshots

ML engineers don’t manually tweak neural network weights. So why are we manually tweaking agent behavior? We should be able to:

Version our artifacts - Track which prompts, tools, and instructions we’re using
Measure systematically - Define evaluators that quantify agent performance
Optimize through configuration - Run experiments across temperature, top_p, context length, tool selection
Test rigorously - Benchmark against baselines, compare variants, ship only what passes

That’s the approach I’m taking.

How HoloDeck Works

Three design principles:

Configuration-First - Pure YAML defines agents, not code
Measurement-Driven - Evaluation baked in from the start
CI/CD Native - Agents deploy like code

Architecture

┌─────────────────────────────────────────────────────────┐
│                    HOLODECK PLATFORM                    │
└─────────────────────────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Agent      │  │  Evaluation  │  │  Deployment  │
│   Engine     │  │  Framework   │  │  Engine      │
└──────────────┘  └──────────────┘  └──────────────┘
        │                  │                  │
        ├─ LLM Providers   ├─ AI Metrics     ├─ FastAPI
        ├─ Tool System     ├─ NLP Metrics    ├─ Docker
        ├─ Memory          ├─ Custom Evals   ├─ Cloud Deploy
        └─ Vector Stores   └─ Reporting      └─ Monitoring

Config-First Design

You define your entire agent in YAML. Here’s a simple example:

name: "My First Agent"
description: "A helpful AI assistant"
model:
  provider: "openai"
  name: "gpt-4o-mini"
  temperature: 0.7
  max_tokens: 1000
instructions:
  inline: |
    You are a helpful AI assistant.
    Answer questions accurately and concisely.

No Python. No custom code. You define what your agent does, HoloDeck handles how it runs.

Then interact via the CLI:

# Initialize a new project
holodeck init my-chatbot

# Edit your agent configuration
# (customize agent.yaml as needed)

# Chat with your agent interactively
holodeck chat agent.yaml

# Run automated tests
holodeck test agent.yaml

# Deploy as a local API
holodeck deploy agent.yaml --port 8000

Pretty minimal. The docs cover more complex setups—tools, memory, evaluators.

When You Need Code

YAML isn’t everything. For programmatic test execution, dynamic configuration, or complex workflows, there’s an SDK:

from holodeck.config.loader import ConfigLoader
from holodeck.lib.test_runner.executor import TestExecutor
import os

# Load configuration with environment variable support
os.environ["OPENAI_API_KEY"] = "sk-..."
loader = ConfigLoader()
config = loader.load("agent.yaml")

# Run tests programmatically
executor = TestExecutor()
results = executor.run_tests(config)

# Access detailed results with metrics
for test_result in results.test_results:
    print(f"Test: {test_result.test_name}")
    print(f"Status: {test_result.status}")
    print(f"Metrics: {test_result.metrics}")

Start with YAML, drop into code when you need to. The SDK gives you:

ConfigLoader - dynamic configuration
TestExecutor - test orchestration
Agent Models - Pydantic validation
Evaluators - NLP metrics and LLM-as-judge scoring

DevOps Integration

Agents should work like software:

Version Control - Agent configs are versioned. Track changes, rollback if needed.

Testing Pipeline - Run agents through test suites. Compare across versions.

holodeck test agents/customer_support.yaml
holodeck deploy agents/ --env staging --monitor

Monitoring - OpenTelemetry integration following GenAI Semantic Conventions:

Standard trace, metric, and log collection
GenAI attributes: gen_ai.system, gen_ai.request.model, gen_ai.usage.*
Cost tracking
Works with Jaeger, Prometheus, Datadog, Honeycomb, LangSmith

CI/CD - Works with GitHub Actions, GitLab CI, Jenkins, whatever you use.

# .github/workflows/deploy-agents.yml
on: [push]
jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - run: holodeck test agents/
      - run: holodeck deploy agents/ --env production

What I’m Going For

I got tired of:

Needing to know 10 different frameworks to do anything
Writing custom orchestration for every project
Manual testing and “hope it works” deployments

What I wanted:

Accessible - YAML-based, code optional
Measurable - Evaluation from day one
Reliable - Systematic testing and versioning
Portable - Not locked to any cloud

Current State

HoloDeck is in active development. What’s working now:

CLI - Commands for init, chat, test, validate, and deploy
Interactive Chat - CLI chat with streaming and multimodal support
Tools - Vector store integration, MCP (Model Context Protocol) support
Test Cases - YAML-based test scenarios, multimodal file support
Evaluations - NLP metrics (F1, ROUGE, BLEU, METEOR) and LLM-as-judge scoring
Configuration Management - Environment variable substitution, config merging, validation

What’s Next

Actively building:

API Serving - Deploy agents as REST APIs with FastAPI
Observability - OpenTelemetry integration with GenAI semantic conventions

Down the Road

Eventually:

Cloud Deployment - Native integration with AWS, GCP, Azure
Multi-Agent Orchestration - Advanced patterns for agent communication
Cost Analytics - LLM usage tracking and optimization

On Ownership

Here’s something that bothers me: we don’t outsource our entire software development lifecycle to cloud providers. We choose our own version control, CI/CD, testing frameworks, deployment targets. Why should agents be different?

Software Development	Agent Development (today)
Git (self-hosted or any provider)	Agent definitions (locked to platform)
CI/CD (Jenkins, GitHub Actions, GitLab)	Testing & validation (vendor-specific)
Testing frameworks (Jest, pytest, JUnit)	Evaluation (proprietary metrics)
Deployment (your infrastructure)	Runtime (cloud-only)

Cloud platforms are convenient, but you give up:

Portability - Your agent definitions are tied to proprietary formats
Flexibility - Limited to their supported models and patterns
Cost control - Usage-based pricing that scales against you
Data sovereignty - Your prompts and responses live on their servers

I wanted something different: portable YAML definitions, any LLM (cloud or local), your own evaluation criteria, deploy anywhere, integrate with existing CI/CD. That’s what HoloDeck is trying to be.

Try It Out

HoloDeck focuses on a few things: config-driven agents, systematic testing, and fitting into your existing workflow. Not trying to be everything to everyone.

If any of this resonates, check out the docs.

Series Recap

Part 1: Why It Feels Broken - The problem
Part 2: What’s Out There - The landscape
Part 3: How HoloDeck Works (You are here)