Project
Ozer-AI
DAG-based agent orchestration for production LLM workflows
- Role
- AI Engineer
- Stack
- Python, FastAPI, LangGraph, PostgreSQL, Redis, Docker
- Metrics
- Ongoing production deployment; evaluation harness runs 200+ test cases per iteration
LLM-powered workflows fail in production in ways that are hard to anticipate: agents loop, contradict earlier steps, or confidently produce wrong outputs that downstream nodes treat as ground truth. The bottleneck isn’t building the agent. It’s making it reliable enough to deploy without a human watching every run.
Ozer-AI is a production agent orchestration system built around a DAG execution model and a structured evaluation harness. The DAG makes data flow explicit and debuggable; the harness makes regressions visible before they ship.
What I built
The orchestration layer represents workflows as directed acyclic graphs. Each node is a typed unit — a prompt template, a tool call, a validation step, or a routing decision — with declared inputs, outputs, retry policy, timeout, and fallback behavior. Edges carry typed outputs between nodes. The engine executes nodes in topological order with configurable concurrency: independent branches run in parallel, dependent nodes block until their inputs resolve, and failed runs checkpoint at the last clean node boundary so restarts don’t replay expensive upstream work.
The evaluation harness ships alongside every workflow. It runs the graph against a curated set of inputs with expected outputs and behavioral assertions — not just “did it return a string” but “did the routing node correctly classify this edge case, did the tool-call node stay within its declared output schema, did the summarizer preserve the key constraint from the input.”
Hard parts
Routing reliability. LLM-based routing nodes — classifying whether a query goes to tool A or tool B — are the highest-variance component in the graph. Small prompt changes shift the distribution. The harness tracks routing accuracy per node across every iteration; a regression that drops any node below threshold blocks deployment.
Stateful restarts. Checkpointing at node boundaries sounds simple but requires careful handling of side-effecting nodes (tool calls that write state, API calls with rate limits). The engine marks side-effecting nodes as non-idempotent; on restart, it replays from the last idempotent checkpoint before the failed node rather than from the node itself.
Evaluation methodology
Result
200+ test cases per iteration. Refactoring agent logic with confidence — something that felt intractable without the eval infrastructure.
Production deployment ongoing. The evaluation harness is the part of the system that has paid back the most — it’s the difference between “we think this change is safe” and “we know the 200 cases it passes and the 3 it doesn’t.”