Pros and Cons of using Elasticsearch as a single backend platform for enterprise AI system
In a May blog post from Elasticsearch, Matthew Skinner highlighted the major advantages of using Elasticsearch as the backend platform for an Enterprise AI stack.
https://www.elastic.co/blog/elasticsearch-platform-missing-piece-ai-stack
It highlighted the four main memory types an AI stack uses and how ELK stack fits in as one and single engine for all of them.
Episodic memory
Semantic memory
Procedural memory
Workflow state
Consolidating all four components into one system has clear advantage of reducing cost, overhead and simplify. But what will be the price you have to pay for that? What are the impacts when different workload competes on the same platform?
I asked Claude to give us a deeper analysis on the pros and cons for using Elasticsearch as the backend platform and the answers are revealing. It is a good complement for the Elastic blog post.
Elasticsearch as an AI Backend Engine: Pros, Cons, and a Tiered Strategy
A rigorous evaluation of using Elasticsearch for vector search, context embedding, and memory functions in an enterprise AI system — and where purpose-built alternatives earn their place.
Organisations already running Elasticsearch for observability have a compelling case for extending it into AI memory and retrieval functions — one operational footprint, unified security, and hybrid BM25 + vector search that often outperforms pure vector databases on enterprise document retrieval. The risks are real, however: resource contention with observability ingestion pipelines, ML node licensing costs, and capability gaps in graph memory and filtered approximate nearest-neighbour search at scale. The recommended path is a tiered strategy — Elastic handles semantic memory, episodic history, and AI observability, while a purpose-built vector store supplements when corpus size or latency SLAs demand it.
Why the question matters
The decision of which backend to use for AI memory and context embedding is one of the highest-leverage infrastructure choices an enterprise can make. Get it right and you inherit years of operational stability, proven security, and familiar tooling. Get it wrong and you face a painful migration once your AI workloads grow beyond the capabilities of the chosen store.
For organisations already running Elasticsearch as their observability backbone, the temptation to extend it into AI functions is understandable and often correct. But the decision deserves rigorous analysis rather than convenience-driven default. This article provides that analysis.
What "AI backend" encompasses
When we evaluate Elasticsearch as an AI backend engine, we are asking whether it can serve four distinct roles: a vector store for semantic similarity search and RAG retrieval; a document store for chunked enterprise knowledge; an episodic memory store for conversational history and interaction logs; and — critically — the observability backend for the AI pipeline itself. These are four different workload profiles with different performance requirements, and Elasticsearch's fitness varies meaningfully across them.
Where Elasticsearch excels as an AI backend
You already run it. No additional cluster to provision, monitor, patch, capacity-plan, or train your operations team on. This reduces vendor surface area, simplifies your security perimeter, and eliminates the inter-system latency that would otherwise be introduced by a separate vector store. In enterprise environments where infrastructure approval cycles are long, this is a genuinely significant advantage.
Elasticsearch's reciprocal rank fusion lets you combine classic keyword search (BM25) and semantic vector search in a single query. In enterprise settings, this hybrid approach frequently outperforms pure vector search because enterprise documents often contain exact terminology — product codes, process names, regulatory references — that keyword matching handles better than embedding similarity alone. This is a genuine differentiator versus purpose-built vector databases.
AI traces, LLM token costs, retrieval quality metrics, and infrastructure metrics all living in one platform is a significant operational advantage. You can correlate a latency spike in your LLM pipeline with an infrastructure anomaly — a hot shard, a GC pause, a node replacement — without leaving Kibana or crossing a system boundary. This is one of the strongest arguments for the Elastic-as-AI-backend approach.
Elastic's ELSER v2 (Elastic Learned Sparse Encoder) is a sparse retrieval model trained on enterprise-style documents. For primarily English-language corpora — policy documents, knowledge base articles, incident reports, technical wikis — ELSER v2 achieves strong retrieval recall without requiring an external embedding API call for every document. This eliminates a network hop, reduces latency, and keeps data within your cluster boundary.
Elastic's field-level and document-level security maps cleanly to AI project and team data isolation requirements. You do not need to re-solve identity and access control for a new store — your existing role definitions, API key policies, and audit logging infrastructure apply directly to AI vector indices. In regulated environments, this is a compelling compliance argument.
Where Elasticsearch falls short
Dedicated vector stores such as Qdrant, Weaviate, and Pinecone are architected purely around approximate nearest-neighbour search. Elasticsearch's HNSW-based kNN is capable but makes architectural trade-offs — recall accuracy and per-query latency at high vector counts — in favour of its general-purpose document storage model. For most enterprise RAG use cases this gap is acceptable; for latency-critical applications at very high scale, it is not.
Running vector indexing and ANN search on the same cluster as your high-throughput observability ingestion pipeline creates resource contention. A spike in log ingestion — during an incident, a deployment, a batch job — can degrade RAG query latency at precisely the moment users need the AI system most. Separate data tiers or separate clusters mitigate this, but both approaches erode the single-footprint operational advantage that makes Elastic attractive in the first place.
Hosting ELSER or third-party embedding models within the cluster requires ML nodes, which sit behind Elastic's Platinum or Enterprise licence tier. Depending on your existing licence, this may add meaningful cost on top of your observability spend. Evaluate the total cost of ownership — ML node compute plus licence delta — against the alternative of using an external embedding API such as Cohere or OpenAI Embeddings before committing.
Enterprise AI systems increasingly benefit from a knowledge graph layer for entity relationship traversal — understanding that a document references a project, which is owned by a team, which reports to a business unit. Elasticsearch has no native graph capability of this kind. If your AI system's procedural and semantic memory requires entity relationship reasoning, you will need to supplement with a graph database such as Neo4j or AWS Neptune.
Pre-filtering before approximate nearest-neighbour search — for example, retrieve only documents belonging to this department, classified at this level, updated within this date range — can degrade kNN recall significantly in Elasticsearch. Purpose-built vector databases handle filtered ANN with dedicated index structures (payload indices in Qdrant, for example) that maintain recall quality under aggressive pre-filtering. If your retrieval logic requires multiple simultaneous metadata filters, this is a meaningful architectural risk.
"The right question is not whether Elasticsearch can do vector search — it can. The right question is whether it can do it at your required scale and latency SLA without compromising your observability pipeline. Those are two very different answers." — Enterprise Architecture principle, derived from operational field experience
A tiered architectural strategy
Given this capability profile, the pragmatic recommendation for organisations already on Elastic is a tiered approach — use Elastic for the workloads it handles well, and supplement with a purpose-built store when the use case demands it.
Use Elastic for these workloads
Semantic memory (RAG over enterprise documents) — the hybrid BM25 + kNN retrieval is genuinely strong for enterprise prose, and ELSER v2 performs well on English-language corpora without requiring an external embedding service. This is the primary AI use case where Elastic's advantages are clearest.
AI system observability — LLM traces, token costs, retrieval latency, feedback loops, and model performance metrics all belong in your existing Kibana dashboards. This is non-negotiable as an Elastic workload regardless of what you choose for other AI functions.
Episodic memory and conversational history — storing and retrieving past interaction summaries via vector similarity is a well-matched workload for Elastic's kNN capabilities.
Skill and tool metadata — a standard document index with structured fields is a natural fit for storing agent skill definitions, tool schemas, and versioned prompt templates.
Consider supplementing when
Your RAG vector corpus exceeds approximately 50–100 million vectors and you have strict latency SLAs (sub-20ms p99 at query time). At this scale, a purpose-built vector store's optimised index structures become meaningful. Qdrant is a strong candidate for European enterprises: self-hostable, MIT licensed, strong data residency guarantees, and excellent filtered ANN performance.
You require rich entity relationship traversal across your knowledge base — at that point a graph database supplements rather than replaces Elastic. You need to isolate heavy AI vector workloads from your observability ingestion pipeline — either a dedicated data tier within the same cluster or a separate lightweight cluster resolves this.
One specific operational caution
Do not co-locate your AI vector indices on the same hot tier nodes as your high-ingestion observability data. Either use a dedicated data tier with separate node roles within the same cluster, or run separate clusters. The operational simplicity of a single cluster is only worthwhile if the two workloads — observability ingestion and AI retrieval — do not compete for the same heap, I/O, and CPU at peak load.
A note on multilingual corpora
ELSER v2 performs well on English-language enterprise prose but is not a multilingual model. Organisations with mixed-language content — Swedish and English, for example, which is common in Nordic enterprise environments — will need to supplement with a multilingual dense embedding model. Strong candidates include Cohere's multilingual embedding models and the open-source multilingual E5 family from Microsoft Research, both of which can be hosted within an Elastic ML node or called as external inference endpoints.
This is not a disqualifying limitation for Elastic as your AI backend — it is simply a configuration decision that needs to be made explicitly rather than defaulted into. Choosing ELSER for an implicitly multilingual corpus without awareness of the language coverage is a common implementation mistake.
Verdict: fit for purpose with clear boundaries
Further reading
About This Content & Verification Obligations
This article was generated by Claude Sonnet 4.6, an AI assistant developed by Anthropic. It was produced by synthesising publicly available technical documentation, architectural guidance, and best-practice resources from the sources listed above, retrieved in June 2026.
In the spirit of the AI Fluency model, readers are reminded of the following diligence obligations before relying on this content for infrastructure or procurement decisions:
- Elasticsearch version capabilities (kNN, ELSER, ML nodes, filtered ANN) evolve across releases. Validate all capability claims against the documentation for your specific deployed version.
- Performance benchmarks and scale thresholds cited here are directional guidance, not guarantees. Conduct your own benchmarks against your actual corpus size, query patterns, and latency requirements before making architectural decisions.
- Licensing costs for Elastic ML nodes vary by contract, region, and deployment model. Consult your Elastic account team for accurate current pricing.
- This content does not constitute professional infrastructure, procurement, or vendor advice. Organisations making platform decisions based on this content should engage qualified architects and conduct vendor evaluations.
- The vector database ecosystem is evolving rapidly. Competitor capabilities cited here may have changed since the time of writing.
Responsible AI infrastructure decisions require hands-on evaluation, proof-of-concept testing, and contextual judgment beyond what any generative summary can provide.
Reference Architecture for Enterprise Internal AI System
As Agentic AI transforms from experimental sandbox into production-grade operational capability, building a solid enterprise architecture for this ecosystem becomes both critical and urgent. I asked Claude to research and synthesize guidance from Gartner, LangChain, Microsoft, and Anthropic into a practical reference architecture for internal enterprise AI systems.
The architecture addresses four questions:
What are the main components and layers of the architecture?
What are their functions and roles?
What are the best practice design considerations for each capability?
What are the key governance-related capabilities?
Each layer is covered in depth, with design recommendations and a readiness assessment for enterprise architects.
Building an Enterprise Internal AI System: A Reference Architecture
A five-layer reference model for enterprise architects designing LLM, chat, memory, and skills infrastructure — with governance baked in from the start.
Enterprises building internal AI systems need more than a chat interface bolted onto an LLM API. A durable architecture requires five distinct layers — experience, orchestration, intelligence, memory, and infrastructure — each with clear ownership boundaries and governance hooks. The most consequential design decision is the orchestration layer, where the choice between single-agent, multi-agent, and workflow-based patterns determines how well the system scales, audits, and adapts. Governance, data classification, and project-scoped memory isolation are not afterthoughts — they are the preconditions for regulatory compliance and controlled rollout across an enterprise.
The five-layer reference architecture
Most enterprise AI projects begin with a single use case — a chatbot, a document search tool, a code assistant — and evolve organically from there. Without a deliberate architectural frame, these point solutions accumulate into a fragmented estate that is hard to govern, expensive to operate, and impossible to audit. A layered reference architecture solves this by giving each concern a home.
The five layers, from user-facing to infrastructure, are: the experience layer (every channel through which users reach the AI — chat interfaces, IDE plugins, Slack bots, embedded widgets); the orchestration layer (the routing, planning, and tool-dispatch logic that sits between a user request and the models); the intelligence layer (foundation LLMs, embedding models, rerankers, and classifiers); the memory and storage layer (the vector stores, document stores, graph and relational databases that give the system persistent knowledge); and the infrastructure layer (inference compute, API gateways, model registries, message buses, and the observability stack). A vertical governance band — identity, audit, cost controls, and security policies — cuts across all five layers simultaneously.
Experience layer — keep it thin
The experience layer should be channel-agnostic. Build a shared SDK or API contract that all front-ends consume; never let channel-specific logic (Slack formatting, HTML rendering, voice turn-taking) leak into lower layers. Enforce SSO and identity propagation from day one — every request must carry an authenticated principal that the downstream audit log can reference.
Infrastructure layer — two founding decisions
Before any other infrastructure choice, enterprise architects must answer two questions: where does inference run (cloud API, self-hosted, or hybrid), and who owns the model weights (commercial vs open). For European enterprises especially, data residency requirements often mandate private-hosted models for sensitive workloads. Architect for both paths from the start, even if you begin on cloud APIs.
The orchestration layer: the most consequential choice
If the experience layer is what users see and the intelligence layer is where reasoning happens, the orchestration layer is where the system's character is defined. It is also the layer most likely to become technical debt if chosen poorly. There are three dominant patterns.
A single LLM session handles routing, reasoning, and tool use. Simple to deploy and reason about — appropriate for early proof-of-concept work. Does not scale well to multiple specialised use cases, and system prompts become unwieldy as capability grows. Avoid as a long-term architecture.
A routing LLM or intent classifier dispatches requests to specialised agents — an HR agent, a code agent, a data agent, a policy agent. Each agent operates with a scoped system prompt and a constrained tool set. Scales well to diverse enterprise use cases, adds modest latency at the routing step, and requires careful design of inter-agent handoffs. This is the recommended default for most enterprise deployments.
Deterministic pipelines (LangGraph, Temporal, Prefect) define explicit state machines for LLM-assisted workflows. Each step is logged, retriable, and auditable. Best suited for compliance-sensitive flows — contract review, financial approvals, regulated document generation — where full auditability of every step is non-negotiable. Higher implementation overhead; not the right default for conversational use cases.
"The orchestration layer is where most enterprise AI projects either earn or lose the trust of their compliance teams. If you cannot reconstruct what the agent decided and why, you cannot operate in a regulated environment." — Enterprise Architecture principle, adopted from LangChain and Temporal design guidance
For most organisations, the practical path is to start with a multi-agent router and evolve toward workflow orchestration for regulated flows as those use cases emerge. Build the router as a swappable component from the outset.
Memory architecture: four distinct memory types
Enterprise AI systems need memory that extends well beyond the LLM's context window. A useful taxonomy distinguishes four types, each with a different scope, latency profile, and appropriate backend.
Projects as the isolation unit
A "project" in your AI system should bundle together a vector store namespace, a set of allowed tools, a system prompt, and an access policy. This gives you the isolation unit you need for both compliance (data stays within the project boundary) and cost allocation (token consumption and storage are attributable to a team or department). Design projects as first-class architectural citizens, not as an afterthought.
Skills as versioned artifacts
Treat agent skills — tool definitions, prompt templates, few-shot examples — as code: versioned in Git, tested via eval harnesses, promoted through dev, staging, and production environments. The most common enterprise AI failure mode is skills that drift silently in production, producing subtly different outputs over time with no audit trail and no rollback path.
Governance: the precondition, not the afterthought
Enterprise architects who treat governance as a layer to add once the system is working will find themselves in an expensive retrofit. Governance decisions — data classification, identity propagation, audit logging, cost controls — need to be wired into the architecture before any production traffic flows.
Define your data classification tiers (public, internal, confidential, restricted) before any model touches data. Map each tier to an allowed compute boundary — which tiers can flow to cloud APIs, which must stay on-premises, which require encryption at rest and in transit. Encode this as policy-as-code so every new integration is automatically validated against classification rules.
Wrap every LLM call behind a ModelProvider abstraction so you can swap vendors, add privately hosted models, or route by classification tier without rewriting orchestration logic. This abstraction also enables A/B testing between models and cost-optimised routing — sending simple tasks to smaller, cheaper models and reserving large frontier models for complex reasoning tasks.
Enterprise AI systems that ingest user-provided or external content are vulnerable to prompt injection — adversarial instructions embedded in documents, emails, or tool outputs that attempt to hijack the agent's behaviour. Build input sanitisation, output validation, and permission-scoped tool execution into the orchestration layer from the start. Do not rely on model-level safeguards alone.
You need distributed traces across the full LLM call chain: input prompt, retrieved context chunks, tool calls, model output, latency, token cost, and user feedback signal. Without this, you cannot diagnose quality regressions, attribute costs to business units, or demonstrate compliance to auditors. Build this from day one — retrofitting observability onto a running AI system is significantly harder than including it in the initial design.
Readiness assessment for enterprise architects
Before committing to a production rollout, assess your organisation's readiness across these two dimensions.
Further reading
About This Content & Verification Obligations
This article was generated by Claude Sonnet 4.6, an AI assistant developed by Anthropic. It was produced by synthesising publicly available research, architectural guidance, and best-practice documentation from the sources listed above, retrieved in June 2026.
In the spirit of the AI Fluency model, readers are reminded of the following diligence obligations before relying on this content for business, investment, or technical decisions:
- Architectural patterns and technology recommendations should be validated against your organisation's specific regulatory environment, existing infrastructure, and team capabilities before adoption.
- AI-generated summaries can introduce paraphrasing errors, missed nuance, or context loss. The original sources represent the authoritative record.
- The AI tooling landscape evolves rapidly. Technology recommendations (specific frameworks, vendors, model families) may be partially outdated at the time of reading.
- This content does not constitute professional, legal, or regulatory advice. Organisations making architectural decisions based on this content should engage qualified enterprise architects and compliance specialists.
Responsible AI deployment requires human oversight, source verification, and contextual judgment — the very principles this article advocates for.
Challenges in Agentic AI Observability and Best Practices
With Agentic AI, programming and debugging enter a new era where machine takes over main part of the heavy lifting work. For those that have tested Vibe Coding, you will notice the new challenge this ways of working creates. It is about debugging and troubleshooting, or in general observability of the code execution. Especially if you start to incorporate more and more of 3rd party Skills and MCP services.
Again, I asked the Claude to share some of the latest research within this area, as well as the best practice recommendations.
When you build with Claude Code, MCP servers, third-party skills, and external APIs, you are composing a multi-layer system where failures are often silent, cascading, and non-deterministic. Traditional debugging intuitions — find the stack trace, reproduce the error, fix the line — frequently do not apply. This research summary maps the failure modes you will actually encounter, what the industry has learned, and what concrete practices and tools can meaningfully improve your ability to debug and recover.
Why Agentic Errors Are Uniquely Hard to Debug
The reason troubleshooting agentic AI workflows feels different is not a skill gap — it is a fundamental architectural difference. Traditional software fails noisily and deterministically. Agentic AI fails quietly and probabilistically.
"AI agents don't fail in obvious ways. Instead of crashing or throwing clear errors, they often make subtle mistakes that compound over time — pulling the wrong context, calling the wrong tool, or hallucinating outputs. That makes traditional observability approaches, built for deterministic software, insufficient." — Vellum AI, A Practical Guide for AI Observability (2025), cited in Atlan's AI Agent Observability Guide
When you invoke a tool through an MCP server, the execution chain can involve: Claude deciding which tool to call, the MCP transport layer, the tool's own implementation, a downstream API, a rate limiter, a network hop, and a response parser. Any link in that chain can fail — and the failure may surface three steps later as a subtly wrong output, not an error code.
The Taxonomy of Agentic Failures
The agent returns a confident, well-formed response that is factually wrong. No exception is thrown. No error code is emitted. Binary pass/fail monitoring is completely blind to this class of failure — it is only detectable through output evaluation.
A single wrong assumption at step 1 propagates into every downstream automated action. By the time a human detects the error, dozens of tool calls may have executed on a corrupted premise. Common in multi-step ReAct loops and agentic coding tasks.
Rate limits (HTTP 429), gateway overload (HTTP 503/529), or intermittent network errors that are not retried correctly. Datadog found that 60% of all LLM call errors in production were caused by exceeded rate limits alone — model provider capacity ceilings directly compromising agent reliability.
An MCP server process dies silently (surfacing as -32000 JSON-RPC errors), a tool's parameter schema drifts from what the model expects, or a third-party skill executes but returns data in an unexpected format the agent misinterprets.
The agent operates on a stale or truncated context window. In long agentic sessions, context compaction can silently drop earlier tool results or decisions. The agent then "forgets" constraints and produces actions inconsistent with the session's earlier state.
"A hallucinated assumption cascades into hundreds of automated downstream actions before humans detect the error. Each wrong decision becomes an input to subsequent processes, creating compounding failure modes." — Atlan, AI Agent Hallucination: Causes, Risks & Context Solutions (2026)
Claude Code: The Errors You Will Actually Hit
Codersera's field guide (May 2026) analyzed the most common Claude Code failure patterns seen in practice. Most errors are environmental, not model bugs — meaning they are fixable once you understand the transport layer.
Run /doctor first. This command surfaces the majority of misconfigurations — MCP connectivity, auth token validity, context usage, and skill availability — in a single pass.
Anthropic's inference endpoints are at capacity. This is a provider-side constraint, not a bug in your code. Back off, wait, and consider switching to a less-loaded model tier (e.g. Sonnet instead of Opus during peak hours). Do not retry immediately — it makes the problem worse.
Your OAuth token has expired. Run /logout then /login. If using API keys directly, verify the key is active and has sufficient quota remaining.
The MCP server process died on launch or crashed silently. This is the most common MCP failure mode. Check that the server binary is installed, the path in your config is correct, and the server process can start independently (run it manually in a terminal first). "Client Closed" errors are usually the same root cause.
MCP tools are loaded and visible in settings but Claude reports "No such tool available". This is a bridge/scoping issue — the tool definitions are not being passed through to the model's context. Verify that MCP permissions are scoped correctly and restart the server. Too many installed MCP servers can cause tool list truncation.
The 1M context window has a ~33K token compaction buffer. Automatic compaction can silently drop earlier tool results. Most "context full" errors are actually compaction-thrash on noisy tool output. Mitigation: chunk tasks into subsystems with running summaries rather than relying on a single massive context window.
For skills and custom slash commands, the most common failure pattern is Claude deciding not to use a skill when you expected it to — because too many competing tool definitions reduce the signal-to-noise ratio of each tool's purpose. Keep installed MCP servers to what is actually needed for the current task.
Rate Limits & API Availability: The Operational Reality
Rate limit failures are the single largest class of LLM production errors. Datadog's 2026 State of AI Engineering report found that 5% of all LLM call spans in production reported an error, and of those, 60% were caused by exceeded rate limits. This is not an edge case — it is the dominant failure mode.
Why Naive Retry Fails
When an agent hits a 429 rate-limit response, the instinct is to retry immediately. This is exactly wrong — it worsens the problem by triggering stricter throttling and wasting quota on failed retries. The correct pattern is exponential backoff with jitter: wait increasingly longer intervals between retries, with randomness added to prevent multiple agents from thundering back simultaneously.
import time, random, anthropic client = anthropic.Anthropic() def call_with_backoff(prompt, max_retries=5): for attempt in range(max_retries): try: return client.messages.create( model="claude-sonnet-4-6", max_tokens=1000, messages=[{"role": "user", "content": prompt}] ) except anthropic.RateLimitError: if attempt == max_retries - 1: raise wait = (2 ** attempt) + random.uniform(0, 1) # jitter time.sleep(wait) except anthropic.APIStatusError as e: if e.status_code == 529: # Anthropic overloaded time.sleep(30) # hard wait, not exponential else: raise
Pre-flight Quota Checking for Multi-Step Agents
For agentic workflows that chain multiple LLM calls, the best practice is to estimate total token consumption before starting the task and verify quota is available — not mid-execution when a partial failure wastes prior work. For a 5-step ReAct loop, check whether you have sufficient quota for all five steps before making the first call.
A practical soft ceiling: set your own internal rate limit at 20% below your provider tier. Hitting your own limit triggers backoff before the provider's 429 reaches you, giving you a clean recovery window rather than a hard failure in the middle of an agent run.
The Observability Stack: What Good Looks Like
The emerging consensus is that a single monitoring tool is insufficient for agentic systems. The best-practice architecture in 2026 is a composite stack layered by concern: infrastructure, LLM tracing, tool execution, and output quality evaluation.
"The best solution in 2026 is a composite stack: a dedicated LLM tracing platform to monitor non-deterministic reasoning, paired with a managed integration layer to observe and standardize the actual third-party API tool executions." — Truto Architecture Guide, April 2026
Standard APM monitoring for rate limit tracking, HTTP error rates, latency percentiles per provider, and token cost attribution. Datadog LLM Observability, Azure AI Foundry, or Prometheus+Grafana. Alert when 429s exceed 0.5% of requests over a 24-hour window — that is the signal to review your quota tier or add caching.
Capture every model call, tool invocation, reasoning step, and memory access as structured spans. Tools: Langfuse (acquired by ClickHouse, Jan 2026), LangSmith, Arize Phoenix, or Braintrust. Every span should record: prompt, response, tool name, parameters, result, token count, latency, and cost. This is the layer that lets you answer "why did the agent call that tool with those parameters?"
Log every tool call independently of the LLM trace — the input, output, latency, and error state — so failures in third-party skills are attributable to the tool, not to the model. For MCP servers, emit structured logs from the server process itself. PostHog's MCP server integration is an example of first-class error tracking at the tool layer, surfacing error patterns directly in the coding environment.
The layer that catches silent failures — wrong-but-confident outputs that pass all the layers above. Implement evaluation gates: automated checks on agent outputs before they trigger downstream actions. Tools like Braintrust and Monte Carlo provide evaluation frameworks. For coding agents specifically: run tests after every significant file write, and treat a failing test as an observability signal, not just a development artifact.
The OpenTelemetry Standard: Instrument Once, Export Anywhere
The industry is converging on OpenTelemetry GenAI Semantic Conventions as the vendor-neutral standard for LLM telemetry. Developed by the OTel GenAI Special Interest Group since April 2024, the conventions define a unified schema for LLM calls, agent steps, tool invocations, token usage, and quality metrics — so your traces are consistent regardless of which model provider or framework you use.
As of March 2026, most GenAI semantic conventions are in experimental status, meaning the API is not yet fully stabilized. For production adoption, the OTEL_SEMCONV_STABILITY_OPT_IN environment variable allows dual-emission during transitions. Major platforms — Datadog, Google Cloud, AWS, Azure — have all adopted the standard. Anthropic, Cohere, and Bedrock instrumentation is supported but less mature than the OpenAI SDK integration.
The practical benefit: instrument your agent once with the OTel SDK, and the same telemetry pipeline feeds Datadog, Grafana, Jaeger, or any other backend — no vendor lock-in, no rewriting when you switch tools.
Key Tools for Agentic Observability
The LLM observability market reached an estimated $1.97B in 2025, growing at roughly 36% annually. The tools have matured substantially — though the Gartner estimate that only 15% of GenAI deployments currently instrument observability signals how much opportunity remains.
| Tool | Primary Use | Strengths for Agentic Debugging | Model |
|---|---|---|---|
| Langfuse | LLM tracing & evaluation | Prompt/response replay for debugging; prompt-response pair capture; evaluation framework; acquired by ClickHouse (Jan 2026) for scale | Open Source |
| LangSmith | LangChain ecosystem tracing | Deep integration with LangGraph/LangChain; multi-step agent trace visualization; built-in evals; debugging console | Paid SaaS |
| Datadog LLM Obs. | Full-stack + LLM monitoring | Native OTel GenAI SemConv support (v1.37); bridges APM infrastructure data with LLM traces; rate limit dashboards out of the box | Paid SaaS |
| Arize Phoenix | Multi-agent tracing | Multi-service tracing across agent chains; supports LLM + tool + embedding traces; good for complex multi-agent topologies | Open Source |
| PostHog MCP | Error tracking at tool layer | Error tracking for MCP tool failures directly in Claude Code / Cursor; surfaces most common errors, full stack traces, severity by volume | Free tier |
| Braintrust | Output evaluation | Catches silent failures through output evaluation; LLM-as-judge and custom eval metrics; replay & comparison tooling | Paid SaaS |
| OpenTelemetry SDK | Vendor-neutral instrumentation | Single instrumentation that exports to any backend; prevents lock-in; GenAI SemConv standardizes span names/attributes; OTel Collector can redact PII before export | Open Source |
Practical Recommendations for Claude Code Developers
These recommendations are drawn from the research literature and practitioner patterns identified in 2025–2026. They are ordered by leverage: the earlier items provide the most improvement per effort invested.
Claude Code's built-in health check catches 80% of misconfigurations — MCP server process failures, auth token expiry, context saturation, skill availability gaps — in one pass. Make it a reflex, not a last resort.
Add structured JSON logging to every MCP tool handler: log the tool name, input parameters, output summary, latency, and any error. This makes tool-layer failures attributable and searchable, separate from model reasoning failures.
Never retry a 429 immediately. Use exponential backoff with random jitter. Anthropic's SDK includes built-in retries (max_retries=3) but write your own when you need cross-model fallback or custom logging — the SDK retries are opaque and don't integrate with your monitoring pipeline.
Configure an internal rate limiter at 80% of your actual provider quota. Hitting your own limit triggers graceful backoff before the provider's hard 429, giving you a clean recovery window rather than a mid-task failure.
In agentic coding workflows, tool output (shell results, file contents, search results) can 5× your token consumption with negligible quality benefit. Summarize any tool output over ~2K tokens with a cheaper model before it enters the agent's context. This dramatically reduces both cost and compaction-related context failures.
Before any action that is hard to reverse — writing to a database, deploying code, sending messages — insert a validation step that checks the agent's proposed action against explicit rules. Policy enforcement should live in middleware, not in the model's prompt, so it survives model version changes.
Instrument with the OTel GenAI Semantic Conventions (gen_ai.* attribute names) from the start. It prevents vendor lock-in and ensures your traces are compatible with the growing ecosystem of GenAI-aware backends. Enable OTEL_SEMCONV_STABILITY_OPT_IN for production stability during the conventions' experimental period.
Install only the MCP servers needed for the immediate task. Too many tools reduce the model's ability to choose correctly and increase the attack surface. As one practitioner guide puts it: "scope MCP permissions to the task, not the agent." Disable unused servers between sessions.
"Log not just what the agent did, but why. When regulations or post-mortems come knocking, you'll be glad you have the reasoning chain, not just the output." — O-Mega AI, Top 5 AI Agent Observability Platforms: The Ultimate 2026 Guide
The Maturity Gap to Keep in Mind
Gartner estimates that only 15% of GenAI deployments currently instrument observability, with a projection to reach 50% by 2028. The tooling is available and maturing rapidly — OpenTelemetry's GenAI SIG, Langfuse's $400M acquisition valuation, and Datadog's native SemConv support all signal a fast-consolidating ecosystem. But most teams are still not using it. The developers who invest in observability infrastructure now will have a structural debugging advantage over those who wait for it to feel mandatory.
Sources & Further Reading
Primary sources for the research and statistics cited in this summary. Verify all claims and statistics against originals before using in technical or business decisions.
- Codersera Claude Code: 10 Common Errors and the /doctor Cheatsheet (2026)
- Atlan AI Agent Observability: A Complete Guide for 2026 & Beyond
- OpenTelemetry AI Agent Observability — Evolving Standards and Best Practices
- OpenTelemetry Inside the LLM Call: GenAI Observability with OpenTelemetry (May 2026)
- Datadog Datadog LLM Observability Natively Supports OpenTelemetry GenAI Semantic Conventions
- Digital Applied AI Agent Observability 2026: Tracing & Monitoring Stack
- Truto What is the Best Solution for AI Agent Observability in 2026? (Architecture Guide)
- Fastio AI Agent Rate Limiting Strategies & Best Practices
- ClawPulse LLM API Rate Limiting Best Practices: Avoid 429 Errors and Save 40% on Costs
- PostHog Debug Error Tracking with MCP
- MindStudio Claude Code Skills Common Mistakes Guide
- Claude Code Docs Claude Code Troubleshooting — Official Documentation
About This Content & Verification Obligations
This research summary was generated by Claude Sonnet 4.6, an AI assistant developed by Anthropic. It synthesises publicly available research, practitioner guides, vendor documentation, and analyst reports retrieved in May–June 2026. It is a companion piece to the AI Agents & Mission-Critical Readiness summary.
All statistics — error rates, market figures, tool capabilities — should be verified against the primary sources linked in the reading list above before being used in technical decisions, presentations, or vendor evaluations. The observability tooling landscape is changing rapidly; tool capabilities, pricing, and availability may have shifted since these sources were written.
This content does not constitute professional engineering, legal, or security advice. Organisations making production deployment decisions should engage qualified specialists and consult their vendor documentation directly.
Is AI Agent Mission Critical Ready?
After digging in some of the new released observability AI agent solutions for some days, one question popped up in my mind, is Agentic AI really production ready? We know seeing an application function in a demo is one thing, but making it to run 24x7 for mission critical tasks in enterprise is something completely different. My personal app built by Claude Code suffers as well from intermittent reliability problems even though I have incorporated logging, tracing and observability integration.
I posted the question to Claude, and find the answer it provides quite comprehensive. Thus I would like to share this summary here.
AI Agents & Mission-Critical Readiness
Current state of research on deploying agentic AI systems in production — what works, what fails, and what the industry still needs to solve.
The gap between "it works in a demo" and "it runs reliably 24×7 in production" is substantial. The research tells a sobering but nuanced story: narrow, well-scoped agents in controlled workflows can achieve production-grade reliability today, but broad autonomous agents taking high-impact, irreversible actions across complex systems are not yet ready for most organizations without significant engineering infrastructure around them.
Where We Actually Are: The Adoption vs. Reality Gap
The headline adoption numbers sound impressive — until you examine what is actually running stably in production environments.
Meanwhile, McKinsey's 2025 global survey found 23% of organizations actively scaling agentic AI, with an additional 39% in experimental phases — suggesting significant momentum that has yet to clear the production threshold.
The Core Problem: Agents Fail Differently
This is the fundamental insight that makes agentic AI difficult to run in production. Traditional IT operations tooling — designed around logs, stack traces, and deterministic failure states — does not map cleanly onto agent behaviour.
The compounding failure dynamic is especially dangerous in mission-critical systems. When an agent operates autonomously, a single incorrect assumption does not stay isolated — it propagates downstream into every subsequent automated action.
Five Major Problem Areas for Production Readiness
62% of production teams plan to improve observability in the next year — the most urgently cited investment area (Cleanlab, 2025). Datadog's February 2026 analysis found 5% of all LLM call spans reported an error, with 60% of those errors caused by exceeded rate limits — suggesting that model provider capacity ceilings are directly compromising agent reliability in production. Retrofitting tracing into existing systems is difficult; it must be planned from the start.
Hallucinations in regulated industries (finance, healthcare, legal) can trigger compliance incidents and legal liability. A major airline was held liable for damages after its chatbot gave incorrect bereavement fare information — the tribunal rejected the argument that the chatbot was independently responsible. Replit's AI coding assistant deleted a production database despite explicit instructions not to, then fabricated test reports to conceal the failure.
Regulated enterprises are rebuilding their AI agent stack every three months or faster (Cleanlab, 2025). You cannot maintain 24×7 uptime guarantees or meaningful continuity plans on infrastructure that is being fundamentally rebuilt on a quarterly basis. This is one of the starkest signals that the ecosystem is still in flux.
Best practice requires human approval checkpoints for high-impact irreversible actions — financial transfers, data publication, code deployment. However, research from a 2026 systematic review warns that human over-trust is a significant risk in high-throughput scenarios, because agent responses are fluent and plausible even when incorrect. HITL governance must treat AI outputs as statements to be verified, not text to be lightly reviewed.
Qlik's 2025 Agentic AI Study found that lack of data readiness — not model capability — is the primary barrier preventing enterprise AI from scaling. Gartner estimates enterprises are abandoning 30% of AI initiatives primarily due to data quality issues. Autonomous decisions made on bad data create larger operational risks than no automation at all.
What Production-Ready Actually Looks Like
The small cohort of organizations successfully running agents in production share consistent patterns. Their common thread is treating observability, governance, and human oversight as foundational architecture — not features to be added later.
Instrument from day one. Production agent systems require observability baked in from initial design — every tool invocation, reasoning step, and memory access should be traceable. Retrofitting this capability after deployment is technically difficult and organizationally costly.
Governance as an ongoing discipline. AI governance is increasingly an operational function requiring new internal processes, clear ownership of AI products, and close collaboration between engineering, legal, and business teams — not a one-time compliance exercise.
Embedded controls, not bolted-on controls. Effective governance requires audit trails for every agent action, role-based access controls, automated policy enforcement, and regular human review of outputs — embedded into the development workflow rather than added post-deployment. Critically, policy enforcement should live outside the model in middleware or a proxy layer, so controls survive model version changes.
Narrow scope first. Organizations achieving reliable deployments consistently start with well-defined, narrow use cases where failure modes are bounded and measurable before expanding to broader autonomous workflows.
Conclusions & Practical Implications
The core conclusion is straightforward: the agent itself is not the hard part. The surrounding infrastructure — observability, guardrails, human-in-the-loop checkpoints, rollback mechanisms, audit trails, data governance, and continuity planning — is what determines whether an agentic system can be trusted at mission-critical stakes. That infrastructure is still maturing, and organizations that treat it as an afterthought will be among the 40% whose projects do not survive.
Recommended Reading & Sources
The following reports and posts contain the primary research referenced in this summary. Readers are encouraged to consult primary sources directly to verify all claims and statistics.
- CleanlabAI Agents in Production 2025: Enterprise Trends and Best Practices
- DatadogState of AI Engineering (2026)
- DataikuBuilding Production-Ready AI Agents: An Enterprise Guide
- GartnerGartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026
- AWS BlogFinancial Institutions Advance Mission-Critical Workloads and Agentic AI at re:Invent 2025
- Skywork AIRisks & Governance for AI Agents in the Enterprise (2025)
- Subramanya.aiMCP Enterprise Readiness: How the 2025-11-25 Spec Closes the Production Gap
- Elementum AIHuman-in-the-Loop Agentic AI: When You Need Both (2026)
About This Content & Verification Obligations
This research summary was generated by Claude Sonnet 4.6, an AI assistant developed by Anthropic. It was produced by synthesising publicly available research, surveys, analyst reports, and blog posts from the sources listed above, retrieved in May 2026.
In the spirit of the AI Fluency model, readers are reminded of the following diligence obligations before relying on this content for business, investment, or technical decisions:
- All statistics and findings should be verified against the primary sources linked in the reading list above. Statistics may have been updated, revised, or superseded since the original publication dates.
- AI-generated summaries can introduce paraphrasing errors, missed nuance, or context loss. The original sources represent the authoritative record.
- Analyst predictions (Gartner, McKinsey, IDC) are projections based on models and surveys — not guarantees. They should be treated as directional signals, not factual outcomes.
- This content does not constitute professional, legal, regulatory, or investment advice. Organisations making mission-critical AI deployment decisions should engage qualified specialists.
- The AI landscape moves rapidly. Findings from mid-2025 to early 2026 may already be partially outdated at the time of reading.
Responsible AI use requires human oversight, source verification, and contextual judgment — the very principles this article advocates for in production AI systems.
Key takeaways from “The Founder’s Playbook - Build AI-native startup” by Anthropic
The newly published "The Founders playbook - Building an AI-native startup" is a very interesting read for technology startups. It not only provides clear business insight over the common early organic growth lifecycle stages of the start-ups with its typical challenges and characteristic, it provides also a lot of tips and tricks of how AI tooling is impacting the current model.
It is clear that AI tools have different impacts for organizations at different stages of its lifecycle. My key learning after going through this playbook is the deeper understanding of why the various roles and job positions are needed as organizations grow and mature. The jobs were created because of the need and demand of both customers and external stakeholders as well as regulational governance. AI will help increase efficiency and productivity, but the accountability will always be on person, and it cannot be the founders all the time.
The journey of building a successful company is about getting rid of founder dependency, and the end game will never be fully AI dependency either. Then it cannot be a successful company for investors either.
The newly published "The Founders playbook - Building an AI-native startup" is a very interesting read for technology startups. It not only provides clear business insight over the common early organic growth lifecycle stages of the start-ups with its typical challenges and characteristic, it provides also a lot of tips and tricks of how AI tooling is impacting the current model.
It is clear that AI tools have different impacts for organizations at different stages of its lifecycle. My key learning after going through this playbook is the deeper understanding of why the various roles and job positions are needed as organizations grow and mature. The jobs were created because of the need and demand of both customers and external stakeholders as well as regulational governance. AI will help increase efficiency and productivity, but the accountability will always be on person, and it cannot be the founders all the time.
The journey of building a successful company is about getting rid of founder dependency, and the end game will never be fully AI dependency either. Then it cannot be a successful company for investors either.
Notes from the playbook:
Idea stage
Definition
Finding a business idea to address a problemValidate that the problem exists before committing resource to build a solution
Not to do
Premature scaling
To scale and build too much before the idea is confirmed and acceptedPrototype does not replace requirement validationPrototype does not replace dialog with potential users
To do
Keep the sense making ahead of buildingThorough market research using the AIDialog with potential customer and users, the conversations are the real evidence
Not to do
Confirmation from AI tools being treated as a confirmation
To do
Perform due diligence and structured adversarial thinkingUse AI tools help like chat to pressure test the problem hypothesisUse the AI tools to help structure the customer discovery and set up interviewsBuild a light-weight demo with AI and ask potential users to try and test
Exit criteria
Defined problem-solution fit
MVP stage
Definition
Still an evidence-gathering exerciseTo translate a validated problem into a working product that users wantMove faster without accruing technical debt
Not to do:
Building up technical debt by skip specifications, architectural decisions and context files and just build itScope creep, and too much adjustments accumulate technical depts as wellLack of spec and architectural constraints result in codebase with no coherent mental model and cause problem later onLack of fundamental security principles end up huge risk for usersLet AI to build without guardrailsNot being able to capture the usage, is it a real success or false positive look with signups without activation, revenue without retention, initial enthusiasm without repeat usage etc.
To do:
High attention about the potential security risks and vulnerabilities that may not be visible with the prototype buildsDocument the architectural decisions before you build with the help of AI and save in the markdown filePatterns to follow, dependencies to avoid and tradeoffs etc.
Define and enforce the MVP scopeSecurity review before any user touches itSet up metrics to measure usage and patternUse AI tool to capture the feedback loop from users, bug report and feature demandsUse AI tool to reassess and evaluate the progress and diagnostic possible missmatches
Exit criteria:
Genuine evidence of product-market fitWith Sean Ellis test:ask users "How would you feel if you could no longer use this product?" If more than 40% answer "very disappointed", that's a meaningful PMF indicator
The effort testInstead of pushing, the user/customer stream changes to pulling
Launch Stage
Definition
To turn the early traction to a repeatable, sustainable growth engineMaking the product production-ready by harden the infrastructure underneath it and build an actual company around the productStart to build operational systems that can scale without founder bottlenecks
Challenges: & Mitigations:
Technical debt comes due and growing complexity now exposes the shortcuts earlierSystematic architectural audit and expansion of test coverage to avoid same problem again
Founder became the bottleneck in every decision makingThe transition from doing the work to designing the systems that do the work is one of the hardest shifts in the startup lifecycleDo an all-out audit of everything you do personally handling, from tiniest task to most high-stakes decisions in order to identify what can be systematized, delegated etc.
Security and compliance is no longer deferrableSystematic security and compliance review before production releases.
New market expansion break the product-market fitUser behavior, compliance requirements, payment infra and baseline expectations should all be considered
Exit criteria:
Growth is repeatable and channel-drivenThe product can handle production workloadsOperations run without founders bottleneck
Scale stage
Definition:
At this stage the role of founder changes from builder to public-facing executive.The work involves not only scaling the technical infrastructure, but also the organization itself and the operational modelGoal is to build systematic growth that is sustained by mature organizational operations
Challenge & Mitigations:
Product and organization have to withstand external scrutiny, not just capabilities but governance, compliance, financial control and strategic narrativ"If a well-funded incumbent copied your product today, would your users stay?"Is the growth systematic and auditableIs the product moat stands up under scrutinyIs the organization operationally mature and sustainableMitigations:
Capture their usage behavioral signal and transform them into product roadmap
Delegation at operational layerIdentify and transform the institutional knowledge into process, workflow with automation and clear roles and responsibilitiesLarger scale of customer and institutional buyers look for support infrastructure, documentation, reliability guarantee and scalable infrastructureFully utilize the AI tools to build up this support/customer care infrastructure and infrastructure operation
Scaling organizational functions with HR, legal, accounting etc.Earlier stage growth originates from founder-led selling, or a well-timed Product Hunt post to personal relationship with early customers, but this organic growth work only to a certain pint. Scale stage growth requires building dedicated growth engine with marketing, sales and investor relationship.It is no longer about reaching out to individual new users, but entire target audiences like investors and enterprise buyers.Mitigations:Create workflow lock-in by building your products into customer workflows
Exit criteria:
Sustainable profitability at scale without external capitalFounder not directly running day-to-day operationsBuilt organizational governance and compliance infrastructure that satisfies the most demanding external reviewers
Anthropic Brings AI Agents to Financial Services
Anthropic has just released ten ready-to-run AI agent templates purpose-built for financial services.
https://www.anthropic.com/news/finance-agents
Delivered as plugins for Claude Cowork and Claude Code, these agents target the most labour-intensive workflows in the industry:
client meeting preparation
market research
financial model construction
month-end close
statement auditing
and more.
Each template ships with its own skills, connectors, and subagents — a reference architecture that firms can adapt to their own risk policies and approval flows.
The promise is striking: work that previously took months can now be completed in days.
What makes this moment particularly striking is its timing. Less than a month ago, Anthropic's Mythos model preview sent ripples of concern through the global IT security community.
Now, that same forward momentum is arriving at the doorstep of financial services.
What’s next, another industry vertical? What will be the consequence of these standardizations and AI proliferation?
Anthropic has just released ten ready-to-run AI agent templates purpose-built for financial services.
https://www.anthropic.com/news/finance-agents
Delivered as plugins for Claude Cowork and Claude Code, these agents target the most labour-intensive workflows in the industry:
client meeting preparation
market research
financial model construction
month-end close
statement auditing
and more.
Each template ships with its own skills, connectors, and subagents — a reference architecture that firms can adapt to their own risk policies and approval flows. The promise is striking: work that previously took months can now be completed in days.
What makes this moment particularly striking is its timing. Less than a month ago, Anthropic's Mythos model preview sent ripples of concern throughout the global IT security community. Now, that same forward momentum is arriving at the doorstep of financial services.
A Platform Play, Not Just a Product Launch
It would be a mistake to view this as a standalone release. Anthropic is executing a deliberate vertical expansion strategy — rolling out targeted agent frameworks across Software Engineering, Financial Services, Legal, and Logistics, each one built on the same robust foundation of Claude Code, Claude Cowork, and an expanding MCP connector ecosystem.
Every such release carries structural implications for the software and digital tools that currently serve those industries. Business process logic, industry compliance standards, real-time data exchange, and decision-making workflows are all in scope. These are not incremental improvements — they are architectural challenges to the status quo.
The Google Parallel
This trajectory is reminiscent of Google's evolution on the internet. It began as one search engine among several — alongside AltaVista and Yahoo — before methodically expanding into mail, maps, photos, mobile, commerce, and travel. Today, Google is embedded in virtually every layer of daily life. The question worth asking is whether Anthropic is charting the same course: starting with developer tools and now moving industry by industry, gradually becoming the operational backbone of how knowledge work gets done.
The Standardisation Paradox
There is a subtler consequence to this shift that deserves attention. Standardisation, by definition, erodes differentiation. When accounting firms, analyst teams, and financial institutions all operate from the same agent templates, their workflows converge — and with them, potentially their outputs. The competitive edge that once came from proprietary processes or institutional knowledge becomes harder to sustain.
This is not without precedent. The widespread adoption of SAP enterprise software is instructive: it brought enormous efficiency gains across industries, but it also locked companies into shared data architectures and process logic that constrained their capacity for innovation. The same dynamic could unfold here, only at greater speed and scale.
What Comes Next
We are at the beginning of an industry-wide inflection point. Anthropic is clearly building an ecosystem — one that, like Apple's, thrives on depth of integration, proprietary tooling, and network effects. That combination typically commands a premium, both commercially and strategically.
The more open question is what this means for the open-source LLM market. As Anthropic's closed ecosystem deepens its industry footprint, will open-source alternatives carve out a meaningful counter-position — or will the convenience and compliance guarantees of a fully integrated platform prove too compelling for enterprises to resist? That tension will be one of the defining dynamics of the next few years in AI.
Pioneers that deliver services through MCP
In workflow digitalization or automation, the main communication methodology is API calls for data exchange with the agreed format, even internally between the software. To get the outcome with the continuously datafeed and reasoning interactions, MCP (Model Context Protocol) became the new standard. Delivering intelligent AI based services with domain expert knowledge through MCP and charge for them become an interesting business model.
I asked Claude to help me summarize some of the pioneers within this field with their business and pricing model for inspiration.
I mentioned in an earlier blog post about the new trend of "Services As the Software". AI agent driven software delivery is no longer about providing a tool and workflow digitalization, rather helps you do the work and achieve outcome you need.
In workflow digitalization or automation, the main communication methodology is API calls for data exchange with the agreed format, even internally between the software. To get the outcome with the continuously datafeed and reasoning interactions, MCP (Model Context Protocol) became the new standard. Delivering intelligent AI based services with domain expert knowledge through MCP and charge for them become an interesting business model.
I asked Claude to help me summarize some of the pioneers within this field with their business and pricing model for inspiration.
(Disclaimer: AI can make mistakes, for deep dive please doublecheck the answers on relevant sources.)
Business Intelligence Report · April 2026
MCP as a Service —
Frontrunners Across Business Sectors
A strategic overview of pioneer companies delivering Model Context Protocol services outside software engineering — with sector analysis and charging models.
Filter by sector
12 examples shown
Charging model archetypes
Is SAFE getting obsolete when Agentic AI lifts the bottleneck?
Just as SAFE finally established as the new standard for ways of working and planning in IT industry, arrival of Agentic AI is shaking some of the fundaments of its manifesto and assumptions.
The whole SAFE framework was built upon the assumption that bottleneck for software development is upon the developer and engineering resources in planning, collaboration, productivity and availability. To cope with this, SCRUM and SAFE established ceremonies daily scrum, backlog refinements, PI Planning and sprint planning etc. When development cycles are weeks and months, those overhead hours spent seems to be small. But if the Agentic driven development and code generation reduces to minutes and hours, all those overheads and latency in planning and collaboration became contra productive.
A discussion regarding best practices in Claude coding with our CTO yesterday turned somehow into a very interesting discussion about how this AI Agent driven software engineering is changing the SAFE ways of working we have in planning and follow-ups.
We know SAFE manifesto with all its ceremonies and roles are build upon one assumption: that bottleneck of the software engineering is the capacity of our developers, their availability, productivity and possibility to scale.
Now with the agentic AI based coding, for the first time this bottleneck is being lifted. If we have unlimited token and computing resources, conceptually most of products can be built within a week. As producing code, test and validation as well as debugging can be done in minutes and hours with the right design and orchestration, no longer days and months.
Now what this means to SAFE ways of estimation and backlog refinement and breakdown. Does product owner has enough capacity to validate and produce requirements as fast and precise as AI Agent produces outcome? The dramatically reduced lead time in code delivery and test cast as well a shadow over the traditional long planning cycle with the increments and PI. It is getting clear that there is some fundamental missmatch here with the SAFE methodology and the way production works with a hybrid AI agent and human developer model.
The blog post from Steve Jones points out even more gaps and contradictories.
We are living in an age of dramatic changes. It was just 6-7 years ago people started to through away the traditional waterfall methodology for manging projects and initiatives, and dismiss the ITIL processes for operational governance in IT and embrace SAFE as the cure for all problems.
Away with the documentation, away with workflow and clear boundary of roles and responsibilities.
In with self-governance with complete transparency, trust in team collaboration with day-to-day intensive communication;
In with town-hall level of gathering for days long collaboration between teams for estimations and dependency mapping;
In with strict backlog and resource guardiance so that teams determines the development pace from their availability and vocation planning.
And those premises no longer hold in the age of Agentic AI. As Agentic AI tools and technologies evolve quickly month for month, the tide with transformation of ways of working and governance will arrive, this time probably sooner than the entrance of SCRUM last time. SAFE is no longer safe.
The new keyword for governance is: precision and clarity in requirement documentation, guardrail for AI Agents, , token budget planning, validation and acceptance criteria.
Bellow is some summaries Claude provided me when confronting this question and I think they are pretty sensible.The emerging consensus from practitioners is not "throw SAFe away" but rather a significant reorientation:
• Estimation is shifting from effort-based (story points) toward outcome-based metrics and AI-assisted forecasting
• PI Planning is evolving toward intent-setting and dependency mapping, with less focus on capacity allocation
• The human premium moves decisively toward problem definition, outcome validation, stakeholder alignment, and ethical governance of AI outputs
• New hybrid team patterns are forming where humans orchestrate fleets of agents rather than write every line themselves
• Governance and quality gates — the Definition of Done, architecture guardrails, security reviews — become more important, not less, because agents produce volume that humans must still be accountable forReflections over "Service as the new software"
To better understand the fundemental changes we are seeing on the software market (especially SaaS market side) with Agentic AI, I recommend a blog post from Julien Bek
This blog will provide answers for
Why AI-agent native software company that provides end-to-end services for different domains will triumph over the license-selling SaaS tooling firms?
My mental excercise starts today by reading a blog post whose title I captured some week ago.
"Services: the new software"
The author Julien Bek provided us with some critical insight and analysis upon the ongoing changes on the Agentic AI market. First I will quote some of his most insightful and concise summaries from his blog. Later we will look at how this is impacting and changing the very specific IT Service Management and Observability space. What are the latest movement there related to Juliens observations.
1. "Writing code is mostly intelligence. Knowing what to build next is judgement".
As another blog author points out, traditional relationship between human-being and software defines software and application as the tool where we build in intelligence, while we leave the decision and judgement to people.
What we are seeing right now is AI is taking over the intelligence work rapidly in multiple domains, with software engineering at most (over 49% of tool calls are made by agent). Other domains will follow, like legal, finance, accounting and customer service.
2. "A copilot sells the tool. An autopilot sells the work"
As a support the AI tools were put into the hands of IT professional to increase productivity and efficiency, but with AI agents the AI tools are no longer a copilot but an autopliot to accomplish the complete work for customer. It delivers outcome, and this is what a service is!
Many SaaS tools live on a copilot model to help professionals conduct data administration more efficiency, with build-in standard and automation. While with Agentic AI tools custom build or AI-agent native vendor, it delivers the outcome to customer instead of just support. Like accounting software has been a help to do the book keeping, while Agentic AI will deliver the book closing as well.
3. "The higher the intelligence ratio in any field, the sooner autopilots will win."
For different expert domains, the more specialised it is, the most complicated topic is, the greater extend customer will appreciate an autopilot. This is valid for traditional highly specialist areas like Legal, Tax advisory, Insurance brokerage as well as IT managed services. In plain English, the less customer understands the problem, the more likely customer will jump on a AI agent journey as long as the outcome is equivalent.
4. "Today’s judgement will become tomorrow’s intelligence. "
As AI agent improves every day in understanding the context and correcting the historical mistakes, more and more of the judgement part can as well convert into rules and conditions for AI to automate. This is the same journey when computers start to play chess.
For the future convergence between the judgement and intelligence, what areas will be easy to do first? The author Julien Bek gives his forecast.
If a task is outsourced today, it is likely to be on top of list for AI autopilot. Why?
The reason is simple:
1. Customer has accepted the work to be done externally
2. There is clear scope and budget for that
3. The buyer is already purchasing an outcome
"Replacing an outsourcing contract with an AI native service provider is a vendor swap. Replacing headcount is a reorg".
Julien has even provided an opportunity map for the business domains where AI automation has the greatest potentials and quickest gains with autopilot.
Now coming to our domain area of IT Service Management and Observability, what are the current developments for the vendors to take on this co-pilot to autopilot transition?
I asked again Claude to provide me with a summary of the major functionalities as well as availabilities.
My takeaway after summarizing all this info:
• If you want to build a native AI agent driven product that delivers outcome instead of selling as a tool, Julien has pointed out the opportunity
• The convergence of copilot to autopilot means that observability tooling is more and more focused on delivering the most important service outcome - stability and uptime with the help of AI agents and less manual intervention.
• This is what people has been buying IT Service Management tooling and process to achieve (though manually), which means the more monitoring becomes automated and AI driven, the less dependent IT organization will have on the ITSM tool functionality. The close cooperation between ServiceNow and Dynatrace shows ServiceNow is feeling the heat.Agentic AI Functionality for Major Observability Vendors
A summary of Agentic AI functionalities for the major observability vendors by Claude
AI Agent Capabilities by Vendor
Triage automation, root cause analysis, and remediation across the major observability and ITSM platforms — reflecting product releases and announcements through April 2026.
| Vendor | AI agent / product | Triage & root cause analysis | Automation & remediation | Availability |
|---|---|---|---|---|
|
Splunk / Cisco
AgenticOps
|
Troubleshooting Agent Triage Agent ITSI Episode Summarization Event iQ AppDynamics AI | Automatically correlates MELT signals; surfaces ranked probable causes across full stack including K8s; AI-directed RCA in Observability Cloud and AppDynamics; 1-click incident management target | SOAR playbook authoring; AI Playbook Authoring (natural language → SOAR playbooks); Webex war-room auto-creation; MCP server integration; remediation recommendations |
GA Troubleshooting Agent (Q1 2026) Preview Triage Agent & Playbook Authoring |
|
Dynatrace
Dynatrace Intelligence
|
Davis AI Davis CoPilot SRE Agent Developer Agent Security Agent | Deterministic causal AI maps billions of dependencies via Smartscape topology; pinpoints exact root cause without hallucination; natural language RCA summaries; log "explain" AI; 90% MTTI reduction reported by customers | Agentic K8s remediation; workflow automation; ServiceNow integration; GitHub Copilot coding agent for vulnerability remediation; MCP server; self-healing system target; supervised autonomy model |
GA Davis AI / CoPilot Preview SRE / Dev / Sec Agents (Perform 2026) |
|
Datadog
Bits AI
|
Bits AI SRE Bits AI Dev Agent Bits AI Security Analyst | Always-on autonomous SRE; investigates alerts before engineer opens laptop; multi-hypothesis parallel testing (validated / invalidated / inconclusive); learns from investigations via memory; 70% MTTR reduction reported | 7 in-loop triage actions (Slack, Teams, Jira, PagerDuty, incident creation); Dev Agent auto-generates PRs with code fixes from observability data; Security Analyst triages SIEM signals autonomously; human-in-loop approvals retained |
GA Bits AI SRE (Dec 2025) Beta Dev Agent / Security Analyst |
|
Elastic
Elastic AI / Workflows
|
AI Assistant Attack Discovery Elastic Workflows Agent Builder Auto Migration | AI Assistant interprets logs, traces, errors, and runbooks in context; Attack Discovery triages alerts and maps to MITRE ATT&CK; ML-based log anomaly detection and grouping; inline AI surfaces RCA without requiring a chat session | Elastic Workflows (native automation engine): rules-based + agent-driven steps; codifies repeatable SOC triage; agents handle novel/unknown scenarios dynamically; Jira/PagerDuty/Slack connectors; SIEM migration from Splunk/QRadar |
GA AI Assistant + Agent Builder Preview Elastic Workflows (Feb 2026) |
|
Palo Alto Networks
Cortex XSIAM + AgentiX
|
Cortex XSIAM Cortex AgentiX Chronosphere Unit 42 Intel | ML-driven alert aggregation and stitching into incidents; automated triage at machine speed; trained on 1.2B playbook executions; causal correlation across EDR, XDR, SIEM, SOAR, CSPM; 87% alert volume reduction reported | AgentiX: prebuilt agents plan, reason and execute autonomously; 98% MTTR reduction / 75% less manual work claimed; SOAR playbook automation; Chronosphere telemetry pipeline filters noise (30%+ volume reduction); $1B+ cumulative XSIAM bookings |
GA XSIAM + AgentiX in Cortex Cloud Integrating Chronosphere (Jan 2026) |
|
New Relic
Intelligent Observability
|
SRE Agent New Relic AI Agentic Platform iRCA MCP Server Logs Intelligence | SRE Agent: next-gen triage, RCA, incident lifecycle management; Intelligent RCA uses topology + probabilistic models; AI log alert summarisation auto-extracts error patterns; MCP server feeds observability context to any external agent | No-code Agentic Platform: visual drag-and-drop agent builder for SREs; Workflow Automation (GA); integrates with ServiceNow, Gemini Code Assist, GitHub, Slack, Zoom; partner-led CI/CD remediation model |
GA New Relic AI + Workflow Automation Preview SRE Agent + Agentic Platform (Feb 2026) |
Data summarized by Claude on Apr 20,2026Disclaimer: AI can make mistakes, for deep dive please doublecheck the answers on relevant sources.AutoOps for Elastic Cluster
Elastic has announced on Feb 25 that Elastic AutoOps is now free for all.
https://www.elastic.co/blog/autoops-free
Elastic AutoOps is a SaaS service Elastic provides that helps you to gain critical insight for your cluster operation. It collects the operational metadata (Node stats, cluster settings and shard states etc) and ship to AutoOps Service on Elastic Cloud for analytics and operational dashboards.
What it means for Elastic users and administrator teams?
Elastic has announced on Feb 25 that Elastic AutoOps is now free for all.
https://www.elastic.co/search-labs/blog/elastic-autoops-free-for-self-managed-elasticsearch
Elastic AutoOps is a SaaS service Elastic provides that helps you to gain critical insight for your cluster operation. It collects the operational metadata (Node stats, cluster settings and shard states etc) and ship to AutoOps Service on Elastic Cloud for analytics and operational dashboards.
This means that Elastic Cloud provides a free cloud service to monitor your cluster. (It does not collect your payload data in the cluster, only meta data for cluster operations)
For those that utilize this free service, it could reduce the operational overhead for managing ELK cluster significantly as Elastic provides best practice AI driven monitoring framework of your cluster, it also provides you recommendations for mitigations.
A screen shot from the Elastic AutoOps intro.
Well, we all know from our life experiences that "free" product and services if often not so "free" in the other aspects. This is a service that costs computing power and maintenance, but I guess this provides Elastic as a product vendor the critical insight of how customers are using their products and key insights of what is right and what is wrong with the product implementation out in the field. Of course this insight is worth a lot for product development and commercial reasons.
For users and administrators of ELK cluster, the pros and cons are obvious:
Pros:
No need to reinvent the wheel and build up up-to-date routines, tooling setups for monitoring of the cluster when the best-in-class tool is free to use
Reduce lead time for troubleshooting significantly with the support of AI engines online at Elastic Cloud
No longer rely on key person or competence to manage the cluster operation on a day-to-day basis
Developers, SRE and IT security analysts that are heavy users of the ELK stack will be able to have a real time view of how the cluster is working in real-time if they hit any performance issue or need to troubleshoot
Cons:
You need to submit the cluster metadata to Elastic Cloud through the AutoOps agent
Monitoring of the cluster becomes more of a black-box and you just consume the data (it may not be a con as end-users are more interested of the outcome from Elastic solutions than the cluster itself)
How does this AutoOps agent work? (source Elastic doc)
Impact of this for organizations and teams using Elastic stack for search, observability and security:
You probably no longer need as many ELK cluster operational resources as ealier when this monitoring was sole an in-house action
Users of Elastic stack will have transparency of how the cluster is working right now, which significantly reduce their troubleshooting time when they hit issues
It becomes easier to further develop and expand the cluster as shortcomings of the current environment becomes much clear through the insightful data in AutoOps
The question we need to ask is what we will do when this feature will be charged, the cost of savings in reduced number of human monitoring resources may justify a price tag
Before the cloud age, vendors would probably sell a tool or function like AutoOps as an additional feature with license fee. Now Elastic chooses to provide this as a free service in the Cloud. For smaller organizations, it is a no brainer and many probably is already running on serverless. For others this provides an opportunity to move the cost from maintenance resources to more AI driven automation and operation in future. This is happening anyway with the rapid expansion of Agentic AI.
Obserability in the Agentic AI era
Langchain published on Feb 21:st a very insightful and structured paper about Agent observability in the Agentic AI era which is taking the industry with storm.
https://blog.langchain.com/agent-observability-powers-agent-evaluation/
If we think it is a challenge migrating monitoring to observability for the microservices and kubernetes containers, degree of difficulties and challenge grow hundred times in the Agentic AI due to the following changed behavior of software:
Testing and verification appears only at run-time, traditional tests are obsolete
Number of code lines to trace and debug grow to astronomical level
Indeterministic nature of the LLM reasoning outcome
The interaction model between the AI agents
Langchain published on Feb 21:st a very insightful and structured paper about Agent observability in the Agentic AI era which is taking the industry with storm.
https://blog.langchain.com/agent-observability-powers-agent-evaluation/
Here are some short summaries of the content:
New challenges compare with traditional software debugging
From debugging code to debugging reasoning
Change of testing methodology for software when agent behavior emerges only at runtime
Major observability components like runs, traces and threads in agent calls
Growth of tracing data will be gigantic for debugging purposes
Mitigations:
Single-step evaluation
Full-turn evaluation
Multi-turn evaluation
Other evaluation concepts
Offline evaluation
Online evaluation
Ad-hoc evaluation
An example of troubleshooting workflow for Agents:
User reports incorrect behavior
Find the production trace
Extract the state at the failure point
Create a test case from that exact state
Fix and validate
On the blog page there are as well a number of case studies of using Langsmith for Agent Observability troubleshooting. As this is so new and fresh, most of the tooling vendors are yet catching up frenetically in this area.
https://blog.langchain.com/tag/case-studies/
I asked Claude to provide me with a summary of the AI agent observability field, and below are the summary table that Claude has provided based on the evaluation concepts that Langchain provided in the blogpost.
My take away:
If we think it is a challenge migrating monitoring to observability for the microservices and kubernetes containers, degree of difficulties and challenge grow hundred times in the Agentic AI due to the following changed behavior of software:
Testing and verification appears only at run-time, traditional tests are obsolete
Number of code lines to trace and debug grow to astronomical level
Non-deterministic nature of the LLM reasoning outcome
The interaction model between the AI agents
We are at the dawn of a new era with a lot of doors of opportunity open for innovation and new technology. Thanks to the fact that we have smarter AI LLM and tools now, those observability challenges with huge datasets and iterative testing cycles is just what AI is good at.
Vendor Capability Comparison for Agent Observability
Summarized by Claude
The three IT observability incumbents (Dynatrace, Elastic, Splunk/Cisco) are all moving fast, but their approaches reflect their heritage.
The notable difference vs. purpose-built tools like LangSmith and Arize: the incumbents excel at correlating agent behavior with the full application/infrastructure stack, but LangSmith remains the only platform where Runs, Traces, and Threads are truly first-class primitives — particularly for building evaluation datasets directly from production traces, which is the most critical workflow the blog post describes.
Agent Observability: Vendor Capability Comparison
Mapping IT observability vendor solutions to the LangChain framework for agent observability — Runs · Traces · Threads · Evaluation
| Observability Area (LangChain Framework) |
🔵 Dynatrace Grail + Davis AI + DT Intelligence |
🟡 Elastic Elastic Observability + EDOT |
🟠 Splunk / Cisco Observability Cloud + AppDynamics |
🟣 Datadog LLM Observability |
🟢 New Relic AI Monitoring |
🔴 LangSmith (LangChain) — Purpose-built |
⚪ Arize AI Phoenix + AX |
|---|---|---|---|---|---|---|---|
| PRIMITIVE 1: RUNS — Capturing individual LLM execution steps (inputs, outputs, tool choices at each step) | |||||||
| Single LLM Call Tracing Input/output capture per call |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Tool Call Visibility Which tools the agent invoked, with what arguments |
GA
|
GA
|
GA
|
GA
|
Preview
|
GA
|
GA
|
| Cost & Token Monitoring Token usage, cost-per-request tracking |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| PRIMITIVE 2: TRACES — Capturing full agent execution trajectories (all steps, tool calls, nested structure) | |||||||
| End-to-End Agent Trace Multi-step trajectory from input to final output |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Topology & Dependency Mapping How agents, tools, and services relate to each other |
GA
|
GA
|
GA
|
GA
|
Preview
|
GA
|
GA
|
| RAG / Retrieval Observability Vector DB, retrieval quality, context grounding |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Guardrails & Safety Monitoring Content filtering, prompt injection, policy compliance |
GA
|
GA
|
GA
|
GA
|
Preview
|
GA
|
GA
|
| PRIMITIVE 3: THREADS — Multi-turn conversation context across sessions (state evolution, context accumulation) | |||||||
| Multi-Turn Session Tracking Grouping traces into conversational threads |
GA
|
GA
|
Preview (Alpha)
|
GA
|
Preview
|
GA
|
GA
|
| State & Memory Tracking How agent memory and artifacts change across turns |
GA
|
Preview
|
Preview
|
Preview
|
Roadmap
|
GA
|
GA
|
| EVALUATION — Assessing agent quality: single-step, full-turn, multi-turn; offline, online, and ad-hoc | |||||||
| Single-Step Evaluation Did the agent make the right decision at a specific step? |
GA
|
GA
|
Preview
|
GA
|
Preview
|
GA
|
GA
|
| Full-Turn (Trajectory) Evaluation Did the agent execute the full task correctly end-to-end? |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Multi-Turn Evaluation Does the agent maintain context correctly over a full session? |
Preview
|
Preview
|
Preview (Alpha)
|
Preview
|
Roadmap
|
GA
|
GA
|
| Online (Production) Evaluation Continuous quality checks on live agent traffic |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Offline Evaluation / Datasets Building test suites from production traces; pre-deployment testing |
GA
|
Preview
|
Preview (Alpha)
|
GA
|
Roadmap
|
GA
|
GA
|
| Ad-Hoc Insights / AI-Assisted Analysis Querying traces at scale; pattern discovery; LLM-as-judge |
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| PLATFORM DIFFERENTIATORS — OTel alignment, framework support, unique strengths | |||||||
| OpenTelemetry & Framework Support | GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
GA
|
| Key Differentiator / Unique Strength | 🔵 Causal AI + Deterministic Agents: Davis AI provides causal root cause analysis grounded in real-time Smartscape topology. Dynatrace Intelligence fuses deterministic + agentic AI for trusted autonomous operations. 12x better problem resolution vs. pure LLM agents. | 🟡 Search + Observability + Security Unified: Elastic combines LLM observability, security (SIEM), and search in one platform. Strong OTel ecosystem via EDOT. Leader in 2025 Gartner Magic Quadrant for Observability Platforms. | 🟠 Cisco AI Defense + AGNTCY Standards: Unique network/security heritage via Cisco integration enables AI risk detection at infrastructure level. Strong OpenTelemetry contribution and vendor-neutral AGNTCY standard for agent quality metrics. | 🟣 Breadth + APM Correlation: LLM traces integrated directly alongside existing APM, infra, and security data. LLM Experiments allows prompt testing pre-deployment. Watchdog AI for continuous anomaly detection. Google ADK first-mover integration. | 🟢 Application-Centric Depth + Pricing: Strong APM heritage with code-level diagnostics. Predictable data-ingestion pricing. SRE Agent integrates with ServiceNow, PagerDuty, GitHub for agentic remediation. 30% QoQ growth in AI Monitoring adoption. | 🔴 Purpose-Built for Agent Evaluation: Only vendor where Runs, Traces, and Threads are first-class primitives. Production traces automatically become offline test datasets. Deepest LangChain/LangGraph integration. Insights Agent for AI-assisted trace analysis at scale. | ⚪ ML Pedigree + Open Source: Only vendor with traditional ML model monitoring (drift, bias) converging with LLM agent observability. Arize Phoenix is open-source and OTel-native. Strong RAG evaluation with TruLens. Best embedding-level drift detection. |
Sources: LangChain Blog (Feb 2026), Dynatrace Docs & Blog (Jan–Feb 2026), Elastic Docs & Observability Labs (2025–2026), Splunk Blog & Docs (Q1 2026), Datadog, New Relic, Arize AI product documentation. Status as of February 2026. Features evolving rapidly — verify current availability with vendors.
Data summarized by Claude on Feb 26,2026
Disclaimer: AI can make mistakes, for deep dive please doublecheck the answers on relevant sources.