Crafting AI Agents: 5% AI, 100% Software Engineering

September 20, 2025

147

In the realm of production-grade AI agents, success hinges not on model selection, but on robust data plumbing, control mechanisms, and observability. Let’s delve into the ‘doc-to-chat’ pipeline, a reference architecture for agentic Q&A, copilots, and workflow automation, ensuring answers respect permissions and are audit-ready.

Understanding ‘Doc-to-Chat’ Pipeline

A ‘doc-to-chat’ pipeline ingests enterprise documents, standardizes them, enforces governance policies, indexes embeddings alongside relational features, and serves retrieval and generation capabilities behind authenticated APIs, with human-in-the-loop (HITL) checkpoints. It’s essentially a hardened version of retrieval-augmented generation (RAG) with LLM guardrails, governance, and OpenTelemetry-backed tracing.

Seamless Integration with Existing Stack

To integrate cleanly, use standard service boundaries like REST/JSON or gRPC over a trusted storage layer. For tabular data, Apache Iceberg offers ACID transactions, schema evolution, partition evolution, and snapshots, crucial for reproducible retrieval and backfills. For vectors, use systems that coexist with SQL filters like pgvector, which collocates embeddings with business keys and ACL tags in PostgreSQL, or dedicated engines like Milvus for high-QPS approximate nearest neighbor (ANN) search with disaggregated storage and compute. Many teams use both: SQL+pgvector for transactional joins and Milvus for heavy retrieval.

Coordinating Agents, Humans, and Workflows

Production agents require explicit coordination points for human approval, correction, or escalation. Services like Amazon A2I provide managed HITL loops, and frameworks like LangGraph model these human checkpoints inside agent graphs, treating approvals as first-class steps in the DAG rather than ad hoc callbacks.. The pattern is: LLM → confidence/guardrail checks → HITL gate → side-effects, with every artifact persisted for auditability and future re-runs.

Ensuring Reliability Before Model Interaction

Treat reliability as layered defenses. Implement language and content guardrails to pre-validate inputs/outputs for safety and policy compliance, using options like Bedrock Guardrails, NeMo Guardrails, or Guardrails AI. Detect and redact personally identifiable information (PII) using tools like Microsoft Presidio. Enforce row-/column-level access control and audit across catalogs using Unity Catalog, and evaluate RAG with reference-free metrics like faithfulness and context precision/recall using tools like Ragas.

Scaling Indexing and Retrieval Under Real Traffic

Two key axes matter: ingest throughput and query concurrency. Normalize ingest at the lakehouse edge, writing to Iceberg for versioned snapshots, then embedding asynchronously to enable deterministic rebuilds and point-in-time re-indexing. For vector serving, Milvus’s shared-storage, disaggregated compute architecture supports horizontal scaling with independent failure domains. Use hybrid retrieval (BM25 + ANN + reranker) for structured+unstructured fusion, and store structured features next to vectors to support filters and re-ranking features at query time.

Monitoring Beyond Logs

To monitor effectively, use traces, metrics, and evaluations stitched together. Emit OpenTelemetry spans across ingestion, retrieval, model calls, and tools using platforms like LangSmith, which natively ingests OTEL traces and interoperates with external APMs. Use LLM observability platforms like LangSmith, Arize Phoenix, LangFuse, or Datadog for tracing, evaluations, cost tracking, and enterprise readiness. Schedule RAG evaluations on canary sets and live traffic replays to track faithfulness and grounding drift over time. Add schema profiling/mapping on ingestion to keep observability attached to data shape changes and explain retrieval regressions when upstream sources shift.

Example: Doc-to-Chat Reference Flow

Ingest: connectors → text extraction → normalization → Iceberg write (ACID, snapshots). Govern: PII scan (Presidio) → redact/mask → catalog registration with ACL policies. Index: embedding jobs → pgvector (policy-aware joins) and Milvus (high-QPS ANN). Serve: REST/gRPC → hybrid retrieval → guardrails → LLM → tool use. HITL: low-confidence paths route to A2I/LangGraph approval steps. Observe: OTEL traces to LangSmith/APM + scheduled RAG evaluations.

Why ‘5% AI, 100% Software Engineering’ in Practice

Most outages and trust failures in agent systems aren’t model regressions; they’re data quality, permissioning, retrieval decay, or missing telemetry issues. The controls mentioned—ACID tables, ACL catalogs, PII guardrails, hybrid retrieval, OTEL traces, and human gates—determine whether the same base model remains safe, fast, and trustworthy for users. Invest in these controls first; swap models later if needed.

In conclusion, building reliable, scalable, and observable AI agents is predominantly about robust software engineering, with data plumbing, controls, and observability taking center stage. The ‘doc-to-chat’ pipeline serves as a blueprint for achieving this, with its focus on data standardization, governance, indexing, serving, coordination, reliability, scaling, and monitoring.