In Part 1, I talked about what feels broken in agent development. In Part 2, I looked at what’s out there. Now let me walk through what I’m building.
This is Part 3 of a 3-Part Series
- Why It Feels Broken - What’s wrong with agent development
- What’s Out There - The current landscape
- How HoloDeck Works (You are here)
The Core Idea
The insight that got me started: agents are systems with measurable components that can be optimized systematically. We did this for ML. Why not agents?
The Analogy
| Traditional ML | Agent Engineering |
|---|---|
| NN Architecture | Agent Artifacts (prompts, instructions, tools, memory) |
| Loss Function | Evaluators (NLP metrics, LLM-as-judge, custom scoring) |
| Hyperparameters | Configuration (temperature, top_p, max_tokens, model) |
| Training Loop | Agent Execution Framework |
| Evaluation Metrics | Agent Benchmarks & Test Suites |
| Model Checkpoints | Agent Versions & Snapshots |
ML engineers don’t manually tweak neural network weights. So why are we manually tweaking agent behavior? We should be able to:
- Version our artifacts - Track which prompts, tools, and instructions we’re using
- Measure systematically - Define evaluators that quantify agent performance
- Optimize through configuration - Run experiments across temperature, top_p, context length, tool selection
- Test rigorously - Benchmark against baselines, compare variants, ship only what passes
That’s the approach I’m taking.
How HoloDeck Works
Three design principles:
- Configuration-First - Pure YAML defines agents, not code
- Measurement-Driven - Evaluation baked in from the start
- CI/CD Native - Agents deploy like code
Architecture
┌─────────────────────────────────────────────────────────┐
│ HOLODECK PLATFORM │
└─────────────────────────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Agent │ │ Evaluation │ │ Deployment │
│ Engine │ │ Framework │ │ Engine │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
├─ LLM Providers ├─ AI Metrics ├─ FastAPI
├─ Tool System ├─ NLP Metrics ├─ Docker
├─ Memory ├─ Custom Evals ├─ Cloud Deploy
└─ Vector Stores └─ Reporting └─ Monitoring
Config-First Design
You define your entire agent in YAML. Here’s a simple example:
name: "My First Agent"
description: "A helpful AI assistant"
model:
provider: "openai"
name: "gpt-4o-mini"
temperature: 0.7
max_tokens: 1000
instructions:
inline: |
You are a helpful AI assistant.
Answer questions accurately and concisely.
No Python. No custom code. You define what your agent does, HoloDeck handles how it runs.
Then interact via the CLI:
# Initialize a new project
holodeck init my-chatbot
# Edit your agent configuration
# (customize agent.yaml as needed)
# Chat with your agent interactively
holodeck chat agent.yaml
# Run automated tests
holodeck test agent.yaml
# Deploy as a local API
holodeck deploy agent.yaml --port 8000
Pretty minimal. The docs cover more complex setups—tools, memory, evaluators.
When You Need Code
YAML isn’t everything. For programmatic test execution, dynamic configuration, or complex workflows, there’s an SDK:
from holodeck.config.loader import ConfigLoader
from holodeck.lib.test_runner.executor import TestExecutor
import os
# Load configuration with environment variable support
os.environ["OPENAI_API_KEY"] = "sk-..."
loader = ConfigLoader()
config = loader.load("agent.yaml")
# Run tests programmatically
executor = TestExecutor()
results = executor.run_tests(config)
# Access detailed results with metrics
for test_result in results.test_results:
print(f"Test: {test_result.test_name}")
print(f"Status: {test_result.status}")
print(f"Metrics: {test_result.metrics}")
Start with YAML, drop into code when you need to. The SDK gives you:
- ConfigLoader - dynamic configuration
- TestExecutor - test orchestration
- Agent Models - Pydantic validation
- Evaluators - NLP metrics and LLM-as-judge scoring
DevOps Integration
Agents should work like software:
Version Control - Agent configs are versioned. Track changes, rollback if needed.
Testing Pipeline - Run agents through test suites. Compare across versions.
holodeck test agents/customer_support.yaml
holodeck deploy agents/ --env staging --monitor
Monitoring - OpenTelemetry integration following GenAI Semantic Conventions:
- Standard trace, metric, and log collection
- GenAI attributes:
gen_ai.system,gen_ai.request.model,gen_ai.usage.* - Cost tracking
- Works with Jaeger, Prometheus, Datadog, Honeycomb, LangSmith
CI/CD - Works with GitHub Actions, GitLab CI, Jenkins, whatever you use.
# .github/workflows/deploy-agents.yml
on: [push]
jobs:
test-agents:
runs-on: ubuntu-latest
steps:
- run: holodeck test agents/
- run: holodeck deploy agents/ --env production
What I’m Going For
I got tired of:
- Needing to know 10 different frameworks to do anything
- Writing custom orchestration for every project
- Manual testing and “hope it works” deployments
What I wanted:
- Accessible - YAML-based, code optional
- Measurable - Evaluation from day one
- Reliable - Systematic testing and versioning
- Portable - Not locked to any cloud
Current State
HoloDeck is in active development. What’s working now:
- CLI - Commands for init, chat, test, validate, and deploy
- Interactive Chat - CLI chat with streaming and multimodal support
- Tools - Vector store integration, MCP (Model Context Protocol) support
- Test Cases - YAML-based test scenarios, multimodal file support
- Evaluations - NLP metrics (F1, ROUGE, BLEU, METEOR) and LLM-as-judge scoring
- Configuration Management - Environment variable substitution, config merging, validation
What’s Next
Actively building:
- API Serving - Deploy agents as REST APIs with FastAPI
- Observability - OpenTelemetry integration with GenAI semantic conventions
Down the Road
Eventually:
- Cloud Deployment - Native integration with AWS, GCP, Azure
- Multi-Agent Orchestration - Advanced patterns for agent communication
- Cost Analytics - LLM usage tracking and optimization
On Ownership
Here’s something that bothers me: we don’t outsource our entire software development lifecycle to cloud providers. We choose our own version control, CI/CD, testing frameworks, deployment targets. Why should agents be different?
| Software Development | Agent Development (today) |
|---|---|
| Git (self-hosted or any provider) | Agent definitions (locked to platform) |
| CI/CD (Jenkins, GitHub Actions, GitLab) | Testing & validation (vendor-specific) |
| Testing frameworks (Jest, pytest, JUnit) | Evaluation (proprietary metrics) |
| Deployment (your infrastructure) | Runtime (cloud-only) |
Cloud platforms are convenient, but you give up:
- Portability - Your agent definitions are tied to proprietary formats
- Flexibility - Limited to their supported models and patterns
- Cost control - Usage-based pricing that scales against you
- Data sovereignty - Your prompts and responses live on their servers
I wanted something different: portable YAML definitions, any LLM (cloud or local), your own evaluation criteria, deploy anywhere, integrate with existing CI/CD. That’s what HoloDeck is trying to be.
Try It Out
HoloDeck focuses on a few things: config-driven agents, systematic testing, and fitting into your existing workflow. Not trying to be everything to everyone.
If any of this resonates, check out the docs.
Series Recap
- Part 1: Why It Feels Broken - The problem
- Part 2: What’s Out There - The landscape
- Part 3: How HoloDeck Works (You are here)