Uri Maayan

Project

Ozer-AI

DAG-based agent orchestration for production LLM workflows

Role
AI Engineer
Stack
Python, FastAPI, LangGraph, PostgreSQL, Redis, Docker
Metrics
Ongoing production deployment; evaluation harness runs 200+ test cases per iteration
  • agents
  • llm
  • orchestration
  • production-ai
  • evaluation

LLM-powered workflows fail in production in ways that are hard to anticipate: agents loop, contradict earlier steps, or confidently produce wrong outputs that downstream nodes treat as ground truth. The bottleneck isn’t building the agent. It’s making it reliable enough to deploy without a human watching every run.

Ozer-AI is a production agent orchestration system built around a DAG execution model and a structured evaluation harness. The DAG makes data flow explicit and debuggable; the harness makes regressions visible before they ship.

What I built

The orchestration layer represents workflows as directed acyclic graphs. Each node is a typed unit — a prompt template, a tool call, a validation step, or a routing decision — with declared inputs, outputs, retry policy, timeout, and fallback behavior. Edges carry typed outputs between nodes.

The engine executes nodes in topological order with configurable concurrency: independent branches run in parallel, dependent nodes block until their inputs resolve, and failed runs checkpoint at the last clean node boundary so restarts don’t replay expensive upstream work.

The evaluation harness ships alongside every workflow. It runs the graph against a curated set of inputs with expected outputs and behavioral assertions — not just “did it return a string” but “did the routing node correctly classify this edge case, did the tool-call node stay within its declared output schema, did the summarizer preserve the key constraint from the input.”

Hard parts

Routing reliability. LLM-based routing nodes — classifying whether a query goes to tool A or tool B — are the highest-variance component in the graph. Small prompt changes shift the distribution. The harness tracks routing accuracy per node across every iteration; a regression that drops any node below threshold blocks deployment.

Stateful restarts. Checkpointing at node boundaries sounds simple but requires careful handling of side-effecting nodes (tool calls that write state, API calls with rate limits). The engine marks side-effecting nodes as non-idempotent; on restart, it replays from the last idempotent checkpoint before the failed node rather than from the node itself.

Evaluation methodology

The test set is curated, not synthetic — real inputs sampled from production traffic with human-verified expected outputs. Synthetic test generation produces coverage gaps in exactly the adversarial cases where agent graphs fail. The manual curation cost is front-loaded and pays off across every subsequent iteration.

Result

200+ test cases per iteration. Refactoring agent logic with confidence — something that felt intractable without the eval infrastructure.

Production deployment ongoing. The evaluation harness is the part of the system that has paid back the most — it’s the difference between “we think this change is safe” and “we know the 200 cases it passes and the 3 it doesn’t.”