What is a coding agent and how is it evaluated? A coding agent is an AI system that operates in a software engineering loop — it reads a problem description, writes code, executes it in a sandbox, observes the output or error, and iterates until the code is correct. Unlike one-shot code generation, coding agents use tool use (file read/write, terminal execution) and multi-step planning to handle complex tasks. SWE-bench is the standard evaluation benchmark — it tests whether an agent can resolve real GitHub issues by producing a code patch that passes the repository's existing test suite. Pass@k (the probability that at least one of k generated solutions passes all tests) is the primary metric for code generation quality.
What Is Coding Agent Research?
A coding agent is an AI system that doesn't just generate code snippets — it operates in a software engineering loop: reads a problem, writes code, runs it, observes the output or error, and iterates until it produces something correct.
The research field accelerated with SWE-bench, a benchmark that evaluates whether AI models can resolve real GitHub issues — full repository context, real test suites, real pull requests. It exposed a large gap between "generates plausible-looking code" and "actually fixes the bug."
Building a capable coding agent requires three things working together:
A fine-tuned code model — a base model trained specifically on code (CodeLlama, DeepSeek-Coder, Qwen-Coder) further adapted to your specific codebase, language, or task type through SFT.
An execution environment — a sandboxed runtime that actually runs the generated code and returns real output: test pass/fail, error messages, stdout. Without execution feedback, the agent can't self-correct.
An agentic loop — the architecture that ties model and environment together: tool use for file reading/writing, multi-step planning, error parsing, and retry logic. LangGraph is the standard framework for building stateful, multi-step agent workflows.
The evaluation side is equally important. Pass@k on HumanEval tells you about function-level generation. SWE-bench tells you about real software engineering tasks. Most teams need both, plus custom benchmarks for their specific codebase and task distribution.
Who This Is For
Coding agent startups that need evaluation infrastructure and fine-tuned models
Enterprises building internal AI dev tools (code review, bug fix, PR generation)
AI labs benchmarking coding ability across model versions
Research teams implementing software engineering agent papers
What We Build
SWE-bench Style Evaluation Harness
Implement GitHub issue → code fix evaluation pipelines modeled on SWE-bench. Repository setup, issue parsing, patch generation, test execution, and pass@k scoring. Adaptable to your own codebase or benchmark.
Code Generation Model Fine-Tuning
Fine-tune CodeLlama, DeepSeek-Coder, StarCoder, and Qwen-Coder on domain-specific code corpora — internal APIs, proprietary frameworks, specific languages. Includes before/after evaluation on HumanEval and custom benchmarks.
Self-Correcting Code Agent Pipeline
Generate → test → fix loop implementation. Model generates code, execution environment runs tests, model receives error feedback and re-attempts. Configurable retry budget and termination conditions.
Execution-Based Evaluation
Evaluate generated code by running it — not by string matching. Unit test pass rates, functional correctness scoring, timeout and error categorization. Works with Python, JavaScript, SQL, and Bash.
Agentic Coding Workflow with Tool Use
LangGraph-based coding agent with tool use — file read/write, terminal execution, web search, documentation lookup. Multi-step task decomposition with state management.
Repository-Level Code Understanding
Build pipelines for understanding large codebases — AST parsing, dependency graph extraction, semantic search over code, function-level summarization. Foundation layer for repo-level coding agents.
Tech Stack
Python · LangGraph · LangChain · CodeLlama · DeepSeek-Coder · StarCoder · HumanEval · SWE-bench · Docker(sandboxed execution) · Tree-sitter · W&B
Deliverables
Evaluation harness codebase with scoring pipeline
Fine-tuned code model weights + evaluation report
Agent pipeline implementation (LangGraph)
Benchmark results vs. base model
Full documentation and run instructions
How to Work With Us
We offer two ways to engage, depending on whether you have a defined deliverable or ongoing capacity needs.
Option 1 — Scoped Sprint Contract
A fixed-scope engagement for a defined deliverable.
Best for: One-time projects with a clear endpoint — a benchmark suite, a fine-tuning run, an eval harness
Timeline: 4–16 weeks depending on scope
Structure: Scoping call → fixed deliverable, timeline, and acceptance criteria → delivery
Pricing: Project-based, scoped after a short call
Option 2 — Dedicated Research Pod (Monthly Retainer)
An ongoing team of research engineers working full-time on coding agent & SE research for your organization.
Best for: AI labs and startups with continuous post-training work — not a single deliverable, but an evolving backlog
Structure: A dedicated pod (2–3 engineers + senior lead) directed by you month-to-month. Output shifts with your priorities — a SWE-bench-style eval harness this month, something else next.
Billing: Monthly retainer, Net 7/15
Pricing: From $12,000–$24,000/month for a 3-engineer pod (per-engineer rates below)
Frequently Asked Questions
What is SWE-bench and why does it matter for coding agents? SWE-bench is a benchmark that tests whether an AI model can resolve real GitHub issues from popular open-source repositories. Each task gives the model access to the full repository codebase and a natural language description of a bug or feature request — and the model must produce a code patch that passes the existing test suite. It's considered the hardest standard coding benchmark because it requires real repository navigation, multi-file edits, and understanding of existing code structure — not just generating isolated functions. Top models score 40–60% on SWE-bench Verified; typical fine-tuned models without agent scaffolding score much lower.
What's the difference between a code generation model and a coding agent? A code generation model takes a prompt and produces code — one shot. A coding agent runs in a loop: it generates code, executes it in a sandbox, reads the output or error, and uses that feedback to revise its output in subsequent turns. The agent architecture allows self-correction, which dramatically improves success rates on complex tasks. Most production coding tools (Cursor, GitHub Copilot Workspace, Devin) are agents, not one-shot generators.
Which code model should I fine-tune for my use case? DeepSeek-Coder V2 is currently the strongest open-weight code model across most benchmarks. CodeLlama is well-studied with extensive community tooling. Qwen-Coder is strong on multilingual code tasks. For most enterprise use cases involving a specific internal codebase, the base model matters less than the quality of the domain-specific fine-tuning data. We benchmark multiple candidates on your tasks before selecting one for a full fine-tuning run.
How do you sandbox code execution safely? We use Docker containers with resource limits (CPU, memory, time), network isolation, and read-only filesystem mounts for the codebase. Each execution runs in a fresh container to prevent state pollution between test cases. This makes it safe to run untrusted generated code at scale — which is required for both training (RL with execution rewards) and evaluation (pass@k scoring).
Can you build a coding agent for a private internal codebase? Yes. Repository-level coding agents require understanding your specific codebase — its conventions, internal APIs, and project structure. We build semantic search indices over your codebase, fine-tune on internal code examples, and configure the agent's tool use for your specific development environment. All work is done under NDA with no data leaving your infrastructure if required.
What evaluation metrics matter most for coding agents? Pass@k — the probability that at least one of k generated solutions passes all tests — is the standard metric for code generation quality. For agents, you also want to track: number of execution turns required to reach a correct solution (efficiency), failure mode distribution (syntax errors vs. logic errors vs. test mismatches), and performance by task difficulty tier. We deliver structured evaluation reports across all these dimensions, not just a single pass@k number.
Related Services
LLM Benchmark & Evaluation
RL Environment Design
Supervised Fine-Tuning (SFT) Research & Implementation
Most coding agent engagements start with the evaluation harness. Get the benchmark working first — then we build the agent.






