hidai25 / eval-view
Regression testing for AI agents. Snapshot behavior, diff tool calls, catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.
AI Architecture Analysis
This repository is indexed by RepoMind. By analyzing hidai25/eval-view in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.
Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context on-demand, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.
Repository Overview (README excerpt)
Crawler viewRegression testing for AI agents. Snapshot behavior, detect regressions, block broken agents before production. --- EvalView sends test queries to your agent, records everything (tool calls, parameters, sequence, output, cost, latency), and diffs it against a golden baseline. When something changes, you know immediately. Normal tests catch crashes. Tracing shows what happened after the fact. EvalView catches the harder class: the agent returns but silently takes the wrong tool path, skips a clarification, or degrades output quality after a model update. Quick Start **Already have a local agent running?** **No agent yet?** **Want a real working agent?** Starter repo: evalview-support-automation-template An LLM-backed support automation agent with built-in EvalView regression tests. **Other entry paths:** How It Works • ** ** — detects your running agent, creates a starter test suite • ** ** — runs tests, saves traces as baselines (picks judge model on first run) • ** ** — replays tests, diffs against baselines, opens HTML report with results • ** ** — runs checks continuously with optional Slack alerts **Your data stays local.** Nothing is sent to EvalView servers. Two Modes, One CLI EvalView has two complementary ways to test your agent: Regression Gating — *"Did my agent change?"* Snapshot known-good behavior, then detect when something drifts. Evaluation — *"How good is my agent?"* Auto-generate tests and score your agent's quality right now. Both modes start the same way: → → then pick your path. What It Catches | Status | Meaning | Action | |--------|---------|--------| | ✅ **PASSED** | Behavior matches baseline | Ship with confidence | | ⚠️ **TOOLS_CHANGED** | Different tools called | Review the diff | | ⚠️ **OUTPUT_CHANGED** | Same tools, output shifted | Review the diff | | ❌ **REGRESSION** | Score dropped significantly | Fix before shipping | Four Scoring Layers | Layer | What it checks | Needs API key? | Cost | |-------|---------------|:--------------:|------| | **Tool calls + sequence** | Exact tool names, order, parameters | No | Free | | **Code-based checks** | Regex, JSON schema, contains/not_contains | No | Free | | **Semantic similarity** | Output meaning via embeddings | | ~$0.00004/test | | **LLM-as-judge** | Output quality scored by LLM | Any provider key | ~$0.01/test | The first two layers alone catch most regressions — fully offline, zero cost. The LLM judge supports GPT-5.4, Claude Opus/Sonnet, Gemini, DeepSeek, Grok, and Ollama (free local). EvalView asks which model to use on first run, or use . Multi-Turn Testing Each turn is evaluated independently. EvalView checks per-turn tool usage, forbidden tools, and output content — then tells you exactly which turn broke: Capture multi-turn conversations from real traffic: Key Features | Feature | Description | Docs | |---------|-------------|------| | **Baseline diffing** | Tool call + parameter + output regression detection | Docs | | **Multi-turn testing** | Per-turn tool, forbidden_tools, and output checks | Docs | | **Multi-turn capture** | records conversations as tests | Docs | | **Multi-reference baselines** | Up to 5 variants for non-deterministic agents | Docs | | ** ** | Safety contracts — hard-fail on any violation | Docs | | **Semantic similarity** | Embedding-based output comparison | Docs | | **Production monitoring** | with Slack alerts and JSONL history | Docs | | **A/B comparison** | | Docs | | **Test generation** | — auto-create test suites | Docs | | **Silent model detection** | Alerts when LLM provider updates the model version | Docs | | **Gradual drift detection** | Trend analysis across check history | Docs | | **Statistical mode (pass@k)** | Run N times, require a pass rate | Docs | | **HTML trace replay** | Auto-opens after check with full trace details | Docs | | **Verified cost tracking** | Token breakdown (in/out) with model pricing rates | Docs | | **Judge model picker** | Choose GPT, Claude, Gemini, DeepSeek, or Ollama (free) | Docs | | **Pytest plugin** | fixture for standard pytest | Docs | | **PR comments + alerts** | Cost/latency spikes, model changes, collapsible diffs | Docs | | **GitHub Actions job summary** | Results visible in Actions UI, not just PR comments | Docs | | **Git hooks** | Pre-push regression blocking, zero CI config | Docs | | **LLM judge caching** | ~80% cost reduction in statistical mode | Docs | | **Skills testing** | E2E testing for Claude Code, Codex, OpenClaw | Docs | Supported Frameworks Works with **LangGraph, CrewAI, OpenAI, Claude, Mistral, HuggingFace, Ollama, MCP, and any HTTP API**. | Agent | E2E Testing | Trace Capture | |-------|:-----------:|:-------------:| | LangGraph | ✅ | ✅ | | CrewAI | ✅ | ✅ | | OpenAI Assistants | ✅ | ✅ | | Claude Code | ✅ | ✅ | | Ollama | ✅ | ✅ | | Any HTTP API | ✅ | ✅ | Framework details → | Flagship starter → | Starter examples → CI/CD Integration Block broken agents in every PR. EvalView posts a comment with pass/fail, diffs, cost/latency spike warnings, and model change alerts — plus a job summary visible in the Actions UI. **What lands on your PR:** When something breaks: Also works with pre-push hooks — zero CI config: Full CI/CD guide → Production Monitoring New regressions trigger Slack alerts. Recoveries send all-clear. No spam on persistent failures. Monitor config options → Pytest Plugin Claude Code (MCP) 8 tools: , , , , , , , MCP setup details Then just ask Claude: "did my refactor break anything?" and it runs inline. Why EvalView? | | LangSmith | Braintrust | Promptfoo | **EvalView** | |---|:---:|:---:|:---:|:---:| | **Primary focus** | Observability | Scoring | Prompt comparison | **Regression detection** | | Tool call + parameter diffing | — | — | — | **Yes** | | Golden baseline regression | — | Manual | — | **Automatic** | | PR comments with alerts | — | — | — | **Cost, latency, model change** | | Works without API keys | No | No | Partial | **Yes** | | Productio…