15 Nov 2024 ~ 6 min read

HoloDeck Part 3: How I'm Approaching Agent Development


In Part 1, I talked about what feels broken in agent development. In Part 2, I looked at what’s out there. Now let me walk through what I’m building.


This is Part 3 of a 3-Part Series

  1. Why It Feels Broken - What’s wrong with agent development
  2. What’s Out There - The current landscape
  3. How HoloDeck Works (You are here)

The Core Idea

The insight that got me started: agents are systems with measurable components that can be optimized systematically. We did this for ML. Why not agents?

The Analogy

Traditional MLAgent Engineering
NN ArchitectureAgent Artifacts (prompts, instructions, tools, memory)
Loss FunctionEvaluators (NLP metrics, LLM-as-judge, custom scoring)
HyperparametersConfiguration (temperature, top_p, max_tokens, model)
Training LoopAgent Execution Framework
Evaluation MetricsAgent Benchmarks & Test Suites
Model CheckpointsAgent Versions & Snapshots

ML engineers don’t manually tweak neural network weights. So why are we manually tweaking agent behavior? We should be able to:

  • Version our artifacts - Track which prompts, tools, and instructions we’re using
  • Measure systematically - Define evaluators that quantify agent performance
  • Optimize through configuration - Run experiments across temperature, top_p, context length, tool selection
  • Test rigorously - Benchmark against baselines, compare variants, ship only what passes

That’s the approach I’m taking.


How HoloDeck Works

Three design principles:

  1. Configuration-First - Pure YAML defines agents, not code
  2. Measurement-Driven - Evaluation baked in from the start
  3. CI/CD Native - Agents deploy like code

Architecture

┌─────────────────────────────────────────────────────────┐
                    HOLODECK PLATFORM
└─────────────────────────────────────────────────────────┘

        ┌──────────────────┼──────────────────┐

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
   Agent  Evaluation  Deployment
   Engine  Framework  Engine
└──────────────┘  └──────────────┘  └──────────────┘

        ├─ LLM Providers   ├─ AI Metrics     ├─ FastAPI
        ├─ Tool System     ├─ NLP Metrics    ├─ Docker
        ├─ Memory          ├─ Custom Evals   ├─ Cloud Deploy
        └─ Vector Stores   └─ Reporting      └─ Monitoring

Config-First Design

You define your entire agent in YAML. Here’s a simple example:

name: "My First Agent"
description: "A helpful AI assistant"
model:
  provider: "openai"
  name: "gpt-4o-mini"
  temperature: 0.7
  max_tokens: 1000
instructions:
  inline: |
    You are a helpful AI assistant.
    Answer questions accurately and concisely.

No Python. No custom code. You define what your agent does, HoloDeck handles how it runs.

Then interact via the CLI:

# Initialize a new project
holodeck init my-chatbot

# Edit your agent configuration
# (customize agent.yaml as needed)

# Chat with your agent interactively
holodeck chat agent.yaml

# Run automated tests
holodeck test agent.yaml

# Deploy as a local API
holodeck deploy agent.yaml --port 8000

Pretty minimal. The docs cover more complex setups—tools, memory, evaluators.


When You Need Code

YAML isn’t everything. For programmatic test execution, dynamic configuration, or complex workflows, there’s an SDK:

from holodeck.config.loader import ConfigLoader
from holodeck.lib.test_runner.executor import TestExecutor
import os

# Load configuration with environment variable support
os.environ["OPENAI_API_KEY"] = "sk-..."
loader = ConfigLoader()
config = loader.load("agent.yaml")

# Run tests programmatically
executor = TestExecutor()
results = executor.run_tests(config)

# Access detailed results with metrics
for test_result in results.test_results:
    print(f"Test: {test_result.test_name}")
    print(f"Status: {test_result.status}")
    print(f"Metrics: {test_result.metrics}")

Start with YAML, drop into code when you need to. The SDK gives you:


DevOps Integration

Agents should work like software:

Version Control - Agent configs are versioned. Track changes, rollback if needed.

Testing Pipeline - Run agents through test suites. Compare across versions.

holodeck test agents/customer_support.yaml
holodeck deploy agents/ --env staging --monitor

Monitoring - OpenTelemetry integration following GenAI Semantic Conventions:

  • Standard trace, metric, and log collection
  • GenAI attributes: gen_ai.system, gen_ai.request.model, gen_ai.usage.*
  • Cost tracking
  • Works with Jaeger, Prometheus, Datadog, Honeycomb, LangSmith

CI/CD - Works with GitHub Actions, GitLab CI, Jenkins, whatever you use.

# .github/workflows/deploy-agents.yml
on: [push]
jobs:
  test-agents:
    runs-on: ubuntu-latest
    steps:
      - run: holodeck test agents/
      - run: holodeck deploy agents/ --env production

What I’m Going For

I got tired of:

  • Needing to know 10 different frameworks to do anything
  • Writing custom orchestration for every project
  • Manual testing and “hope it works” deployments

What I wanted:

  • Accessible - YAML-based, code optional
  • Measurable - Evaluation from day one
  • Reliable - Systematic testing and versioning
  • Portable - Not locked to any cloud

Current State

HoloDeck is in active development. What’s working now:

  • CLI - Commands for init, chat, test, validate, and deploy
  • Interactive Chat - CLI chat with streaming and multimodal support
  • Tools - Vector store integration, MCP (Model Context Protocol) support
  • Test Cases - YAML-based test scenarios, multimodal file support
  • Evaluations - NLP metrics (F1, ROUGE, BLEU, METEOR) and LLM-as-judge scoring
  • Configuration Management - Environment variable substitution, config merging, validation

What’s Next

Actively building:

  • API Serving - Deploy agents as REST APIs with FastAPI
  • Observability - OpenTelemetry integration with GenAI semantic conventions

Down the Road

Eventually:

  • Cloud Deployment - Native integration with AWS, GCP, Azure
  • Multi-Agent Orchestration - Advanced patterns for agent communication
  • Cost Analytics - LLM usage tracking and optimization

On Ownership

Here’s something that bothers me: we don’t outsource our entire software development lifecycle to cloud providers. We choose our own version control, CI/CD, testing frameworks, deployment targets. Why should agents be different?

Software DevelopmentAgent Development (today)
Git (self-hosted or any provider)Agent definitions (locked to platform)
CI/CD (Jenkins, GitHub Actions, GitLab)Testing & validation (vendor-specific)
Testing frameworks (Jest, pytest, JUnit)Evaluation (proprietary metrics)
Deployment (your infrastructure)Runtime (cloud-only)

Cloud platforms are convenient, but you give up:

  • Portability - Your agent definitions are tied to proprietary formats
  • Flexibility - Limited to their supported models and patterns
  • Cost control - Usage-based pricing that scales against you
  • Data sovereignty - Your prompts and responses live on their servers

I wanted something different: portable YAML definitions, any LLM (cloud or local), your own evaluation criteria, deploy anywhere, integrate with existing CI/CD. That’s what HoloDeck is trying to be.

Try It Out

HoloDeck focuses on a few things: config-driven agents, systematic testing, and fitting into your existing workflow. Not trying to be everything to everyone.

If any of this resonates, check out the docs.


Series Recap

  1. Part 1: Why It Feels Broken - The problem
  2. Part 2: What’s Out There - The landscape
  3. Part 3: How HoloDeck Works (You are here)

Headshot of Justin Barias

Hi, I'm Justin. I'm a Lead Software & AI Engineer at the Department of Employment and Workplace Relations. You can connect with me on LinkedIn, email me at [email protected], or download my resume.