Iris: Open-Source MCP-Native Eval & Observability Tool for AI Agents

Iris, a newly released open-source project published under the MIT license, is positioning itself as the first evaluation and observability tool built natively on the Model Context Protocol (MCP). Rather than instrumenting agents through SDK decorators or OpenTelemetry sidecars — the two dominant paradigms in the crowded LLMOps market — Iris operates as an MCP server itself. This means any MCP-compatible agent framework can automatically discover and invoke its observability capabilities through the same protocol used to call any other tool, with no custom integration code required. The project, available on npm as @iris-eval/mcp-server and via Docker, exposes three core MCP tools: log_trace for recording execution spans, tool calls, latency, and token costs; evaluate_output for quality assessment across completeness, relevance, safety, and custom rule dimensions; and get_traces for querying stored trace history with filtering and pagination. Trace data persists in a local SQLite database, with an optional web dashboard for visual inspection.

The architectural distinction matters in a market where incumbents are racing to add MCP support. Datadog, New Relic, and Grafana Cloud have all added MCP instrumentation in recent months, but their implementations observe the MCP client — adding telemetry to the agent side that calls MCP tools. Iris inverts this model entirely: it is the MCP server, so observability becomes a first-class, auto-discoverable capability for any MCP-native agent in the same way a search or database tool would be. The project supports native integration with CrewAI and LangChain, and can be added to Claude Desktop's MCP configuration with a few lines of JSON, enabling Claude to natively call Iris tools for logging and evaluation. HTTP transport mode ships with API key authentication via Bearer tokens, rate limiting, Helmet security headers, Zod-validated inputs, and ReDoS-safe regex validation for custom eval rules.

Iris enters a well-funded competitive landscape. LangSmith, LangChain's commercial observability platform, has raised $260 million at a $1.25 billion valuation and dominates within the LangChain ecosystem. Arize Phoenix raised a $70 million Series C in February 2025 and runs roughly 50 million evaluations per month. Langfuse, the open-source OTel-native alternative with over 19,000 GitHub stars, was acquired by ClickHouse in January 2026. Against these incumbents, Iris's current evaluation model — rule-based checks with shallow semantic scoring — lacks the LLM-as-judge depth, annotation queues, and team collaboration features that drive revenue for the mid-tier platforms. Its local SQLite persistence also places it firmly in the developer and self-hosted tier, with no cloud storage or human review workflow.

At version 0.1.2 with a single GitHub star as of mid-March 2026, Iris is early. Its protocol-native architecture is a real moat — one that grows as MCP adoption expands. The OpenTelemetry GenAI Semantic Conventions, backed by Anthropic, OpenAI, and Microsoft, are gaining traction across Claude, ChatGPT, VS Code, and Cursor. If that convergence holds, incumbents bolting MCP support onto existing architectures will be playing catch-up with a tool built from the protocol up.