SkillOpt: Training AI Agent Skills Like Neural Networks

AI agents have a skill problem. You give a language model a system prompt — or “skill” — and it performs a task. You tweak the instructions, test again, tweak more. It’s slow, manual, and unreliable. Hand-written skills don’t generalize. One-shot LLM-generated skills are hit-or-miss. Self-revision loops drift without discipline. None of these approaches reliably improves over the starting point.

A new paper from Microsoft Research, SkillOpt, proposes a different approach entirely: stop tuning the model and start tuning the skill document. Treat the natural-language skill as a trainable parameter, subject it to the same optimization discipline that makes neural network training reproducible, and ship a compact markdown file that makes any frozen model perform dramatically better.

The results are striking. Across 52 evaluated combinations of target models, benchmarks, and execution harnesses, SkillOpt is best or tied-best in every single one. On GPT-5.5, the optimized skill lifts average accuracy by +23.5 points in direct chat, +24.8 inside the Codex agentic loop, and +19.1 inside Claude Code. The deployed artifact? A single markdown file, typically 300–2,000 tokens, with zero additional inference overhead.

The Core Idea: Skills as External State

SkillOpt reframes agent optimization. Instead of fine-tuning model weights or endlessly hand-editing prompts, it treats the skill document itself as the external trainable state of a frozen agent. The model stays fixed. The skill evolves.

The optimization loop mirrors a standard deep learning training cycle, but operates entirely in text space:

1. Rollout — The Forward Pass

The frozen target model executes a batch of tasks using the current skill document. It records complete trajectories — messages, tool calls, verifier feedback, metadata, and final scores. This is the training data for skill optimization.

2. Reflect — The Backward Pass

A separate optimizer model (often a larger or same-scale LLM) analyzes two minibatches: one of failures and one of successes. By examining what went wrong and what worked, the optimizer identifies concrete, reusable procedural improvements — not vague advice, but specific instructions to add, delete, or replace in the skill.

3. Edit — Bounded Updates

Candidate edits are merged and ranked under a budget. This budget acts as a textual learning rate: it limits how much the skill can change in a single step, preventing a single bad reflection from rewriting the entire document. The paper uses a default learning rate of 4 with cosine decay across epochs — familiar territory for anyone who’s trained a neural network.

4. Gate — Validation

The candidate skill is evaluated on a held-out selection split. If it doesn’t strictly improve the validation score, the edit is rejected outright. This turns reflection from unconditional self-editing into a proper propose-and-test optimization loop. Rejected edits aren’t thrown away — they’re stored in a buffer that serves as negative feedback for future reflections.

Stability Mechanisms

What makes SkillOpt more than just “ask an LLM to improve a prompt” is its emphasis on training stability — the same concerns that govern weight-space optimization:

Textual Learning Rate

The edit budget limits how many modifications can be applied per step. Without it, the optimizer can overwrite useful rules with broad rewrites. The ablation study shows this matters: removing the learning rate budget drops SearchQA by 2.5 points, SpreadsheetBench by 1.8, and LiveMath by 4.0.

Rejected-Edit Buffer

When a proposed edit fails validation, it goes into a buffer. Future reflections see this buffer and learn which directions are harmful. Without the buffer, SpreadsheetBench drops by 4.6 points — the optimizer keeps proposing the same failed modifications.

Epoch-Wise Slow/Meta Update

At epoch boundaries, the system computes a broader update based on accumulated evidence. The “meta skill” is an optimizer-side document that tracks what types of edits have historically worked. This provides longer-horizon feedback without bloating the deployed skill. Without both slow and meta updates, SpreadsheetBench drops by a dramatic 22.5 points.

Benchmark Results

The paper evaluates across seven target models (GPT-5.5, GPT-5.4, GPT-5.4-mini, GPT-5.4-nano, GPT-5.2, Qwen3.5-4B, and Qwen3.6-35B-A3B), six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMath, and ALFWorld), and three execution harnesses (direct chat, Codex CLI, and Claude Code CLI). Key results on direct chat:

GPT-5.5: +23.5 average gain across all benchmarks
GPT-5.4-nano: +26.7 average gain (notably +49.4 on DocVQA)
Qwen3.5-4B: +19.2 average gain (notably +50.7 on ALFWorld)
Qwen3.6-35B-A3B: +9.1 average gain (smaller but consistent)

Inside agentic coding harnesses, the gains are even larger on tool-heavy benchmarks: GPT-5.5 sees +57.5 on SpreadsheetBench inside Codex and +58.3 inside Claude Code. The skill teaches the coding agent not just what to do, but how to use its tools effectively.

SkillOpt also clears every baseline — human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill — on all six benchmarks.

Skill Transfer

One of the most practical findings is that optimized skill artifacts transfer across contexts without re-optimization:

Cross-model: A skill optimized for GPT-5.4 on LiveMath transferred to GPT-5.4-nano, gaining +5.6 points.
Cross-harness: A SpreadsheetBench skill trained inside Codex transferred to Claude Code, gaining +59.7 points.
Self-optimizer: Even GPT-5.4-nano can optimize its own skills — using itself as both optimizer and target yielded +11.9 on SpreadsheetBench.

This means you can invest optimization compute once, using a strong optimizer model, then deploy the resulting best_skill.md to cheaper or smaller models in production. The deployment artifact is a single file with zero inference overhead — the target model doesn’t call any additional models or APIs at runtime.

Getting Started

The SkillOpt code is open-source under MIT and supports Azure OpenAI, standard OpenAI-compatible endpoints, Anthropic Claude, Qwen (via local vLLM), and MiniMax as backends. A minimal training run looks like this:

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .

# Configure API credentials (Azure OpenAI endpoint is required)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_KEY="YOUR_API_KEY"

# Train a skill on SearchQA
python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/searchqa_split \
    --azure_openai_endpoint "$AZURE_OPENAI_ENDPOINT" \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

The default protocol runs 4 epochs with a batch size of 40, reflection minibatches of 8, textual learning rate of 4 with cosine decay, and strict hard validation gating. Each run produces a structured output directory containing the best validated skill, per-step snapshots, and full training history.

To evaluate a trained skill without re-running optimization:

python scripts/eval_only.py \
    --config configs/searchqa/default.yaml \
    --skill ckpt/searchqa/gpt5.5_skill.md \
    --split all \
    --split_dir /path/to/searchqa_split \
    --azure_openai_endpoint "$AZURE_OPENAI_ENDPOINT"

Why This Matters

SkillOpt challenges the assumption that improving agent behavior requires improving the model. Instead, it demonstrates that the procedure an agent follows — expressed as natural language — is itself a trainable artifact. The training loop has the same properties that make weight-space optimization reliable: bounded updates, validation gating, and memory of what doesn’t work.

For practitioners, the practical implications are significant. You can optimize a skill once using a strong (and expensive) model, then deploy the resulting markdown file to cheaper models in production with zero additional latency. The skill transfers across model scales, execution environments, and even to related benchmarks. This is a fundamentally different deployment model from fine-tuning — lighter, more portable, and easier to iterate on.

The codebase also ships with a set of pre-trained skill artifacts for GPT-5.5 across multiple benchmarks, so you can evaluate the approach immediately without running your own optimization. Whether you’re building coding agents, tool-using assistants, or embodied AI, SkillOpt offers a systematic path from “write a prompt and hope” to “train a skill and ship.”