15 Nov 2024 ~ 7 min read

HoloDeck Part 2: What's Out There for AI Agents


In Part 1, I talked about why agent development feels broken. Before building something myself, I spent time looking at what’s already out there. Here’s what I found.


This is Part 2 of a 3-Part Series

  1. Why It Feels Broken - What’s wrong with agent development
  2. What’s Out There (You are here)
  3. What I’m Building - HoloDeck’s approach and how it works

The Landscape

A bunch of platforms tackle parts of this problem. I wanted something open-source, self-hosted, and config-driven—something that fits into existing CI/CD workflows without vendor lock-in. That shaped how I evaluated these tools.


Developer Tools & Frameworks

LangSmith (LangChain Team)

LangSmith is really good at what it does—production observability and tracing for LangChain apps. If you’re already in the LangChain ecosystem and need monitoring, it’s solid.

AspectHoloDeckLangSmith
Deployment ModelSelf-hosted (open-source)SaaS only
CI/CD IntegrationCLI-based, works in any pipelineAPI-based, needs cloud connectivity
Agent DefinitionPure YAMLPython code + LangChain SDK
Primary FocusAgent experimentation & deploymentProduction observability & tracing
Agent OrchestrationMulti-agent patternsNot designed for multi-agent workflows
Agent EvaluationCustom criteria, LLM-as-judge, NLP metrics (BLEU, METEOR, ROUGE, F1)LLM-as-judge, custom evaluators
Self-Hosted LLMsNative support (Ollama, vLLM, OpenAI-compatible)Via LangChain integrations

Different tools for different problems. LangSmith is about monitoring production apps; I was looking for something to help with the build-and-test loop.


MLflow GenAI (Databricks)

MLflow is a beast for ML experiment tracking. Their GenAI additions are interesting, but it’s designed for model comparison rather than agent workflows. If you’re already using MLflow for ML ops, the GenAI features slot in nicely.

AspectHoloDeckMLflow GenAI
CI/CD IntegrationCLI-nativePython SDK + REST API
InfrastructureLightweight, portableHeavy (ML tracking server, often Databricks)
Agent SupportPurpose-built for agentsFocused on model evaluation
Multi-AgentNative orchestration patternsSingle model/variant comparison
ComplexityMinimal (YAML)Higher (ML engineering mindset)
Agent EvaluationCustom criteria, LLM-as-judge, NLP metricsLLM-as-judge, custom scorers

The infrastructure overhead was the main thing that put me off. I wanted something lighter.


Microsoft PromptFlow

PromptFlow has a nice visual approach—you can see your flows as graphs, which is great for understanding what’s happening. But it’s really about individual functions and tools, not full agent orchestration.

AspectHoloDeckPromptFlow
CI/CD IntegrationCLI-firstPython SDK, Azure-centric
ScopeFull agent lifecycleIndividual tools & functions
Design TargetMulti-agent workflowsSingle tool/AI function development
ConfigurationPure YAMLVisual flow graphs + low-code Python
Agent OrchestrationMulti-agent patternsNot designed for multi-agent
Self-HostedYesLimited (designed for Azure)
Agent EvaluationCustom criteria, LLM-as-judge, NLP metricsLLM-as-judge (GPT-based), F1/BLEU/ROUGE

If you’re building individual AI functions and live in Azure, PromptFlow makes sense. For agent-level work, it’s not quite there.


The Cloud Providers

All three major clouds have agent platforms now. They’re impressive, but they come with the obvious trade-off: you’re locked into their ecosystem.

Azure AI Foundry (Microsoft)

Azure AI Foundry is Microsoft’s enterprise play. It integrates with the whole Microsoft stack—Teams, Copilot, etc. If you’re already a Microsoft shop, there’s a lot to like.

AspectHoloDeckAzure AI Foundry
Deployment ModelSelf-hosted (open-source)SaaS (Azure-dependent)
CI/CD IntegrationCLI, works anywhereAzure DevOps/GitHub Actions
Agent DefinitionPure YAMLSemantic Kernel SDK + Logic Apps
Primary FocusExperimentation & deploymentEnterprise agent orchestration
Agent OrchestrationMulti-agent patternsMulti-agent via Semantic Kernel
Self-HostedYesNo (Azure required)
Agent EvaluationCustom criteria, LLM-as-judge, NLPLLM-as-judge, NLP metrics

The Semantic Kernel framework is interesting, but the Azure dependency is real.


Amazon Bedrock AgentCore (AWS)

Bedrock AgentCore is AWS’s managed agent service. Good for running agents at scale if you’re already on AWS and using their model offerings.

AspectHoloDeckAmazon Bedrock AgentCore
Deployment ModelSelf-hosted (open-source)SaaS (AWS-managed)
CI/CD IntegrationCLI, works anywhereAWS CodePipeline/API-based
Agent DefinitionPure YAMLCode (SDK + LangGraph, CrewAI, etc.)
Primary FocusExperimentation & deploymentEnterprise agent operations at scale
Agent OrchestrationMulti-agent patternsMulti-agent collaboration (supervisor modes)
Self-HostedYesNo (AWS required)
Agent EvaluationCustom criteria, LLM-as-judge, NLPLLM-as-judge, custom metrics, RAG eval
Self-Hosted LLMsNative support (Ollama, vLLM)Bedrock models only

If you want to use local models or run outside AWS, this isn’t really an option.


Vertex AI Agent Engine (Google Cloud)

Google’s entry into the agent space. The A2A protocol for multi-agent communication is interesting. Like the others, you’re tied to GCP.

AspectHoloDeckVertex AI Agent Engine
Deployment ModelSelf-hosted (open-source)SaaS (GCP-managed)
CI/CD IntegrationCLI, works anywhereCloud Build/GitHub Actions
Agent DefinitionPure YAMLCode (ADK, LangChain, LangGraph)
Primary FocusExperimentation & deploymentProduction agent runtime
Agent OrchestrationMulti-agent patternsMulti-agent via A2A protocol
Self-HostedYesNo (GCP required)
Agent EvaluationCustom criteria, LLM-as-judge, NLPLLM-as-judge (Gemini), ROUGE/BLEU
Self-Hosted LLMsNative support (Ollama, vLLM)vLLM in Model Garden (complex setup)

Similar story—great if you’re committed to GCP, but not portable.


What’s Missing

After looking at all of these, here’s what I couldn’t find:

  • Self-hosted and cloud-agnostic - Everything is either SaaS or tied to a specific cloud
  • Declarative agent definition - Most require SDK code, not just config
  • Vendor-neutral CI/CD - The integrations assume you’re using their ecosystem
  • Testing + evaluation + deployment in one place - Usually you’re stitching together multiple tools

This is the gap I’m trying to fill with HoloDeck. Not saying it’s better than these tools—they’re solving different problems. But if you care about portability and owning your workflow, there wasn’t much out there.


Quick Reference

If you need…Look at…
Production observability for LangChainLangSmith
ML experiment tracking at scaleMLflow
Visual prompt flow design on AzurePromptFlow
Enterprise agents in Microsoft ecosystemAzure AI Foundry
Managed agents on AWSBedrock AgentCore
Production runtime on GCPVertex AI Agent Engine
Self-hosted, config-driven, CI/CD-nativeHoloDeck

Next Up

In Part 3, I’ll walk through how HoloDeck works—the design decisions, the YAML config approach, the SDK, and what’s actually built vs. what’s still on the roadmap.

Continue to Part 3 →


Headshot of Justin Barias

Hi, I'm Justin. I'm a Lead Software & AI Engineer at the Department of Employment and Workplace Relations. You can connect with me on LinkedIn, email me at [email protected], or download my resume.