Lazy Skills: A Token-Efficient Approach to Dynamic Agent Capabilities
Infinitely scaling CLI agents
The Context Window Crisis
AI coding assistants face a fundamental dilemma: breadth versus depth.
Modern agents need access to dozens, if not hundreds, of specialized capabilities—web scraping, database operations, code analysis, API integrations, domain-specific workflows. Yet every tool schema, instruction set, and reference document consumes precious context tokens. Load everything upfront and you exhaust the context window before the user even asks their first question. Load nothing and your agent is powerless.
Traditional approaches offer two unsatisfying solutions:
Static tool sets: Hard-code 10-20 carefully chosen tools, limiting extensibility
Context stuffing: Load all documentation upfront, wasting tokens on irrelevant capabilities
Neither scales. As agent systems grow more sophisticated, we need a fundamentally different approach.
Enter Lazy Skills: Progressive Capability Loading
We’ve built what we call Lazy Skills—a three-tiered progressive disclosure system that loads agent capabilities on-demand, keeping context lean while maintaining extensibility. The name reflects the philosophy: be lazy about loading, aggressive about relevance detection, and smart about what gets promoted into context.
This isn’t just an optimization. It’s a different mental model for how agents acquire capabilities at runtime.
Level 1: Metadata-Only Discovery
What: Skill name and one-line description
When: Always loaded, injected into system prompt
Cost: ~10-20 tokens per skill
def list_level1(self) -> List[Dict[str, str]]:
“”“Get Level 1 skill metadata for system prompt injection.”“”
return [
{”name”: skill.name, “description”: skill.description}
for skill in self.skills.values()
if skill.enabled
]
At startup, the agent scans skill directories for SKILL.md files with YAML frontmatter:
---
name: web_scraper
description: Extract content from web pages using headless browser
type: executable
auto_load: true
dependencies: [playwright, beautifulsoup4]
---
Only the name and description are injected into the system prompt, making the agent aware of capabilities without consuming tokens on implementation details.
Level 2: On-Demand Content Loading
What: Full skill documentation (markdown body)
When: User message indicates relevance
Cost: 200-2000 tokens per skill
def load_level2(self, name: str) -> Optional[str]:
“”“Load Level 2 content (full SKILL.md body).”“”
skill = self.skills.get(name)
if not skill:
return None
if skill.loaded_level < 2:
skill.loaded_level = 2
logger.info(f”Loaded Level 2 for skill: {name}”)
return self._skill_bodies.get(name)
When the conversation context suggests a skill is relevant (keyword matching on user input), the agent loads the complete documentation. This includes:
Usage instructions
Parameter descriptions
Examples
Best practices
The agent can now reason about how to use the capability.
Level 3: Lazy Tool Registration
What: Executable code registered as callable tools
When: Agent decides to invoke the skill
Cost: Variable (subprocess overhead + execution time)
def register_executables(self, name: str) -> Dict[str, Any]:
“”“Load Level 3 - register executable skills as tools.”“”
skill = self.skills.get(name)
# Find executable script
execute_script = skill.path / “execute.py”
# Get tool schema via subprocess
schema = self._get_tool_schema(execute_script)
# Create tool wrapper
tool_func = self._make_subprocess_tool(execute_script, schema)
# Register with executor
self.tool_executor.register_tool(
name=schema.get(”name”, name),
function=tool_func,
schema=schema
)
skill.loaded_level = 3
return {”success”: True, “tool_name”: schema.get(”name”, name)}
Only when the agent commits to using a skill does the system:
Invoke the script with
--help-jsonto get the tool schemaCreate a subprocess wrapper
Register it as an executable tool
This defers the cost of tool registration until actually needed.
Implementation Architecture
Skill Discovery
Skills live in ~/.config/cli-agent/skills/ as self-contained directories:
skills/
├── web_scraper/
│ ├── SKILL.md # Metadata + docs
│ ├── execute.py # Tool implementation
│ └── requirements.txt
└── code_reviewer/
├── SKILL.md
└── execute.py
On startup, the registry scans for SKILL.md files:
def scan(self, auto_enable: bool = True) -> int:
“”“Scan skills directories for SKILL.md files.”“”
discovered = 0
for skills_dir in self.skills_dirs:
for skill_md in skills_dir.rglob(”SKILL.md”):
metadata, body = parse_skill_md(skill_md)
skill_meta = SkillMetadata(
name=metadata.get(”name”),
description=metadata.get(”description”),
skill_type=metadata.get(”type”, “contextual”),
path=skill_md.parent,
enabled=auto_enable and metadata.get(”auto_load”, False),
loaded_level=1
)
self.skills[name] = skill_meta
self._skill_bodies[name] = body
discovered += 1
return discovered
Skill Types
The system supports three skill types:
Executable: Runnable tools (web scrapers, API clients)
Contextual: Knowledge/guidelines (coding standards, architectural patterns)
Composite: Multi-step workflows orchestrating other tools
Type determines loading strategy—contextual skills stop at Level 2, while executables can reach Level 3.
Subprocess-Based Isolation
Executable skills run as subprocesses, providing:
Dependency isolation: Each skill can have its own Python dependencies
Security boundary: Skills can’t directly access agent internals
Fault tolerance: Crashed skills don’t crash the agent
def _make_subprocess_tool(self, script_path: Path, schema: Dict[str, Any]) -> callable:
“”“Create a callable tool wrapper for a subprocess script.”“”
def tool(**kwargs):
cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if result.returncode == 0:
return json.loads(result.stdout)
else:
return {”success”: False, “error”: result.stderr}
return tool
Benefits
Token Efficiency
Startup: 10-20 tokens × number of skills (vs. full tool schemas)
Conversation: Only relevant skills loaded
Example: 50 skills = ~500 tokens baseline instead of ~15,000
Extensibility
Users can add skills without code changes:
Create SKILL.md with frontmatter
Add execute.py if executable
Drop in skills directory
Restart agent
Contextual Awareness
The agent knows what it can do (Level 1) before deciding what to actually load (Level 2/3).
Separation of Concerns
Agent core: Conversation orchestration
Skill registry: Capability discovery and loading
Skills: Self-contained, versioned, documented
Trade-offs
Latency
Level 3 registration adds subprocess overhead (~50-200ms per skill). For interactive tools, this is acceptable. For batch operations, pre-registration may be preferred.
Relevance Detection
Current implementation uses simple keyword matching. More sophisticated approaches (semantic similarity, LLM-based relevance scoring) could improve precision.
Metadata Maintenance
Requires discipline to keep frontmatter accurate. Stale descriptions lead to poor loading decisions.
How Lazy Skills Differs from Anthropic’s Agent Skills
Anthropic recently introduced Agent Skills, a similar approach to progressive disclosure. Both systems share core ideas—YAML frontmatter, SKILL.md files, dynamic loading—but differ significantly in implementation philosophy and technical approach.
Shared Foundation
Both approaches recognize the same core problem: agents need access to specialized knowledge without overwhelming the context window. Both use:
Markdown-based skill definitions (
SKILL.md)YAML frontmatter for metadata (
name,description)Progressive disclosure (load metadata first, content second)
File bundling (skills can reference additional resources)
This convergence suggests we’ve identified a genuine pattern in agent system design.
The Critical Architectural Difference: Opt-In vs. Always-On Context
The most fundamental difference lies in what gets loaded into the system prompt:
Anthropic’s Approach (Always-On Context):
Loads all skill metadata (name + description) into every system prompt at startup
Skills are always “visible” to the agent—it can see the full library of available capabilities
Agent autonomously decides relevance by matching tasks against loaded metadata
Scales through progressive disclosure within skills (Level 1 → Level 2 → Level 3+)
Lazy Skills (Opt-In Context):
Only loads enabled skills into the system prompt (
list_level1()filters byskill.enabled)Skills must be explicitly enabled/disabled by users or auto-enabled via
auto_load: trueMetadata is discovered and cached at startup, but not injected unless enabled
Scales by loading fewer skills into context by default
The Trade-off:
Aspect Anthropic (Always-On) Lazy Skills (Opt-In) Discoverability ✅ Agent can suggest any installed skill ❌ Agent unaware of disabled skills Baseline Token Cost Higher (all skill metadata in prompt) Lower (only enabled skills) User Control Minimal (agent decides what to load) Explicit (users control what’s visible) Scaling Progressive disclosure within skills Progressive disclosure + selective enabling
Why we chose opt-in:
In production deployments with dozens of skills, we found users often have specialized skill libraries (e.g., web scraping, database tools, cloud APIs) that are only relevant to specific projects. Loading all metadata upfront meant agents would occasionally suggest irrelevant skills or waste tokens on capabilities the user would never need.
By requiring explicit enabling, we push the relevance decision to the user: “Which skills might I need for this project?” This aligns with how developers think about dependencies—you don’t import every library in your site-packages, you declare what you need.
The best of both worlds:
A hybrid approach could combine the strengths:
def list_level1(self, mode: str = “enabled”) -> List[Dict[str, str]]:
“”“Get Level 1 skill metadata for system prompt injection.
mode: “enabled” (default) | “all” (Anthropic-style) | “auto” (smart filtering)
“”“
if mode == “all”:
skills = self.skills.values()
elif mode == “auto”:
# Use lightweight LLM to select relevant skills based on project context
skills = self._auto_select_skills(max_skills=10)
else: # “enabled”
skills = [s for s in self.skills.values() if s.enabled]
return [{”name”: s.name, “description”: s.description} for s in skills]
This would allow power users to switch between “show me everything” (Anthropic mode) and “only what I’ve enabled” (Lazy Skills mode).
Key Innovations in Lazy Skills
1. Three-Level Loading (vs. Two-Level)
Anthropic’s system has two levels:
Level 1: Metadata (
name,description) in system promptLevel 2: Full
SKILL.mdand bundled files loaded via Bash tool
Lazy Skills adds a critical third level:
Level 1: Metadata only (identical to Anthropic)
Level 2: Full skill documentation (similar to Anthropic)
Level 3: Executable tool registration (novel)
This third level is the key innovation. Instead of Claude needing to invoke Bash to run scripts, Level 3 skills register themselves as first-class tools in the agent’s tool executor. The LLM sees them in its tool schema and can invoke them directly.
Why this matters:
Simpler invocation: Agent calls
web_scraper(url=”...”)instead ofbash(command=”python skill.py --url ...”)Type safety: Tool schemas provide parameter validation
Better error handling: Failed skill execution returns structured JSON, not raw stderr
Cleaner traces: Tool calls show semantic intent, not bash gymnastics
2. Subprocess-Based Isolation
Anthropic’s approach uses bash to execute skill code within the same context. Lazy Skills isolate executable skills in subprocesses with their own dependencies.
def _make_subprocess_tool(self, script_path: Path, schema: Dict[str, Any]) -> callable:
“”“Create a callable tool wrapper for a subprocess script.”“”
def tool(**kwargs):
cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
if result.returncode == 0:
return json.loads(result.stdout)
else:
return {”success”: False, “error”: result.stderr}
return tool
Benefits over bash execution:
Dependency isolation: Each skill can have its own
requirements.txtSecurity boundary: Skills can’t directly access agent internals or modify state
Fault tolerance: Crashed skills don’t crash the agent
Timeout enforcement: Runaway scripts are killed after 60s
Cleaner interface: Skills communicate via JSON stdin/stdout, not CLI argument parsing
3. Programmatic Relevance Detection
Anthropic relies on Claude to decide when to load skills by reading the system prompt metadata. This is elegant but means relevance detection happens inside the LLM, consuming inference cycles.
Lazy Skills perform pre-inference relevance filtering in Python:
async def _check_and_load_relevant_skills(self, user_message: str):
“”“Check if any enabled skills are relevant before LLM inference.”“”
for skill_dict in enabled_skills_list:
# Skip already-loaded skills
if skill_info.loaded_level >= 2:
continue
# Build keyword set from metadata and type
skill_keywords = [
skill_dict[”name”].lower(),
skill_dict[”description”].lower(),
]
if skill_dict[”type”] == “executable”:
skill_keywords.extend([”run”, “execute”, “call”, “api”])
# Simple keyword matching
user_message_lower = user_message.lower()
is_relevant = any(keyword in user_message_lower for keyword in skill_keywords)
if is_relevant:
skill_content = self.skill_registry.load_level2(skill_name)
# Inject as system message before LLM call
self.add_message(”system”, f”# SKILL LOADED: {skill_name}\n\n{skill_content}”)
Trade-offs:
Pro: Faster, no inference overhead, deterministic
Pro: Skills are loaded before the LLM sees the message, so they’re available immediately
Con: Less sophisticated relevance detection (keyword matching vs. semantic understanding)
Con: May load irrelevant skills if keywords match spuriously
In practice, simple keyword matching works surprisingly well for skill counts under ~50. For larger skill libraries, we could swap in semantic similarity (embedding-based matching) without changing the architecture.
4. Skill Types as Loading Strategy Hints
Both systems support different skill types, but Lazy Skills use type information to change loading behavior:
class SkillType(Enum):
EXECUTABLE = “executable” # Can reach Level 3 (tool registration)
CONTEXTUAL = “contextual” # Stops at Level 2 (documentation only)
COMPOSITE = “composite” # Level 3 for orchestration
Contextual skills never register as tools—they’re pure documentation/guidance. The agent uses them to inform its behavior but can’t “call” them. This prevents bloat in the tool schema.
Executable skills can be promoted to Level 3 when the agent explicitly decides to invoke them, registering a subprocess-wrapped tool.
Composite skills can orchestrate multiple tools or skills, potentially registering higher-order tools that coordinate sub-capabilities.
Anthropic’s system doesn’t explicitly distinguish loading strategies by type, treating all skills uniformly.
5. Auto-Discovery and User Installation
Lazy Skills support both:
Built-in skill directories (shipped with the agent)
User skill directories (
~/.config/cli-agent/skills/)
Users can install skills by dropping directories into the config location:
# Install a skill
mkdir -p ~/.config/cli-agent/skills/my_skill
cd ~/.config/cli-agent/skills/my_skill
cat > SKILL.md <<EOF
---
name: my_custom_tool
description: Does something amazing
type: executable
auto_load: true
---
# My Custom Tool
Usage instructions here...
EOF
# Write execute.py
cat > execute.py <<EOF
#!/usr/bin/env python3
import json, sys
# Tool implementation...
EOF
chmod +x execute.py
On next startup, the skill is auto-discovered, metadata loaded to Level 1, and appears in the agent’s capability list. No code changes to the agent itself.
Comparison Table
Feature Anthropic Agent Skills Lazy Skills Metadata loading System prompt injection System prompt injection Content loading Bash tool reads SKILL.md Pre-injection before inference Tool registration Manual bash invocation Automatic subprocess wrapper Relevance detection LLM-based (Claude decides) Programmatic (keyword matching) Skill isolation Same process (bash) Subprocess with timeout Dependency management Shared environment Per-skill requirements.txt Skill types Informal categorization Formal types with different loading Loading overhead LLM inference for discovery Pre-inference filtering Security model Trust-based (audit skills) Process isolation + timeout Extensibility Drop files, restart Drop files, restart
Deep Dive: The Three-Level Architecture
Let’s examine each level in detail with concrete examples.
Level 1: The Awareness Layer
Purpose: Make the agent aware of what it could do, without committing tokens to how to do it.
When the agent starts, it scans all enabled skills and extracts minimal metadata:
def list_level1(self) -> List[Dict[str, str]]:
“”“Get Level 1 skill metadata for system prompt injection.”“”
return [
{”name”: skill.name, “description”: skill.description}
for skill in self.skills.values()
if skill.enabled
]
# Example output:
# [
# {”name”: “web_scraper”, “description”: “Extract content from websites using headless browser”},
# {”name”: “pdf_extractor”, “description”: “Extract text and tables from PDF documents”},
# {”name”: “code_complexity”, “description”: “Analyze code complexity metrics per ISO 25010”}
# ]
This gets injected into the system prompt:
You are an AI coding assistant with the following capabilities:
Enabled skills:
- web_scraper: Extract content from websites using headless browser
- pdf_extractor: Extract text and tables from PDF documents
- code_complexity: Analyze code complexity metrics per ISO 25010
[... rest of system prompt ...]
Token cost: ~15 tokens per skill × 50 skills = 750 tokens total for the entire skill library.
Compare this to loading full documentation: ~500 tokens per skill × 50 skills = 25,000 tokens—a 33× reduction.
The agent now knows it has a web scraper skill. It doesn’t yet know how to use it, but it knows to consider it when users ask web-related questions.
Level 2: The Documentation Layer
Purpose: Teach the agent how to use a capability when it becomes relevant.
When a user message suggests a skill might be useful, the system loads the full SKILL.md body:
async def _check_and_load_relevant_skills(self, user_message: str):
“”“Load Level 2 content for relevant skills before LLM call.”“”
enabled_skills = self.skill_registry.list_skills(enabled_only=True, include_details=True)
for skill_dict in enabled_skills:
# Skip already-loaded
if skill_info.loaded_level >= 2:
continue
# Build keyword set from name, description, and type
keywords = [skill_dict[”name”].lower(), skill_dict[”description”].lower()]
# Add type-specific keywords
if skill_dict[”type”] == “executable”:
keywords.extend([”run”, “execute”, “api”, “script”])
# Check relevance
if any(kw in user_message.lower() for kw in keywords):
# Load full content
content = self.skill_registry.load_level2(skill_dict[”name”])
# Inject as system message
self.add_message(”system”, f”“”
# SKILL LOADED: {skill_dict[”name”]}
The following skill has been loaded because it appears relevant:
{content}
Use this skill if appropriate for the user’s request.
“”“)
Example: User asks “Can you scrape the pricing table from example.com?”
Relevance detection triggers on keywords: scrape, web. The system loads web_scraper to Level 2:
---
name: web_scraper
description: Extract content from websites using headless browser
type: executable
dependencies: [playwright, beautifulsoup4]
---
# Web Scraper Skill
## Overview
This skill uses Playwright to load JavaScript-heavy pages and BeautifulSoup for parsing.
## Usage
Call this skill when you need to:
- Extract data from dynamic websites
- Interact with pages requiring JavaScript
- Handle authentication or cookies
## Parameters
- `url` (string, required): The URL to scrape
- `selector` (string, optional): CSS selector for specific elements
- `wait_for` (string, optional): Selector to wait for before scraping
- `timeout` (int, optional): Timeout in seconds (default: 30)
## Example
```json
{
“url”: “https://example.com/products”,
“selector”: “.product-card”,
“wait_for”: “.price-loaded”
}
Output
Returns JSON with:
html: Raw HTML of the pagetext: Extracted text contentelements: Array of matched elements (if selector provided)
Error Handling
Timeouts return error with partial content
Network errors include status code
Invalid selectors return empty elements array
This documentation (≈500 tokens) is now in context. The agent understands:
- When to use the skill (dynamic websites)
- What parameters to provide
- What output to expect
- How errors work
Crucially, this happens *before the LLM generates a response*. By the time Claude sees the user message, it already has the docs.
### Level 3: The Execution Layer
**Purpose**: Register the skill as a callable tool in the agent’s execution environment.
Level 3 happens lazily—only when the agent decides to *invoke* the skill (not just *consider* it).
#### Tool Schema Discovery
Executable skills must support a `--help-json` flag that outputs their tool schema:
```bash
$ python ~/.config/cli-agent/skills/web_scraper/execute.py --help-json
{
“name”: “web_scraper”,
“description”: “Scrape web content using headless browser”,
“parameters”: {
“type”: “object”,
“properties”: {
“url”: {
“type”: “string”,
“description”: “URL to scrape”
},
“selector”: {
“type”: “string”,
“description”: “CSS selector for elements to extract”
},
“wait_for”: {
“type”: “string”,
“description”: “Selector to wait for before scraping”
},
“timeout”: {
“type”: “integer”,
“description”: “Timeout in seconds”,
“default”: 30
}
},
“required”: [”url”]
}
}
This schema is fetched via subprocess:
def _get_tool_schema(self, script_path: Path) -> Optional[Dict[str, Any]]:
“”“Get tool schema by invoking script with --help-json.”“”
try:
result = subprocess.run(
[sys.executable, str(script_path), “--help-json”],
capture_output=True,
text=True,
timeout=5
)
if result.returncode == 0 and result.stdout.strip():
return json.loads(result.stdout)
return None
except (subprocess.TimeoutExpired, json.JSONDecodeError) as e:
logger.warning(f”Error getting schema from {script_path}: {e}”)
return None
Subprocess Wrapper Creation
The schema is used to register a callable tool:
def register_executables(self, name: str) -> Dict[str, Any]:
“”“Register executable skill as a tool (Level 3).”“”
skill = self.skills.get(name)
execute_script = skill.path / “execute.py”
# Get schema
schema = self._get_tool_schema(execute_script)
# Create subprocess wrapper
tool_func = self._make_subprocess_tool(execute_script, schema)
# Register in tool executor
self.tool_executor.register_tool(
name=schema[”name”],
function=tool_func,
schema=schema
)
skill.loaded_level = 3
return {”success”: True, “tool_name”: schema[”name”]}
def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
“”“Wrap script as a callable tool.”“”
def tool(**kwargs):
“”“Execute skill script as subprocess.”“”
# Pass args as JSON
cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=60
)
if result.returncode == 0:
# Parse JSON response
return json.loads(result.stdout)
else:
return {
“success”: False,
“error”: result.stderr,
“exit_code”: result.returncode
}
except subprocess.TimeoutExpired:
return {”success”: False, “error”: “Skill execution timed out”}
except Exception as e:
return {”success”: False, “error”: str(e)}
return tool
Now when Claude generates:
{
“type”: “tool_use”,
“name”: “web_scraper”,
“input”: {
“url”: “https://example.com/pricing”,
“selector”: “.pricing-table”,
“wait_for”: “.prices-loaded”
}
}
The agent executor:
Looks up the registered
web_scrapertoolCalls the subprocess wrapper
Spawns
python execute.py --run-json ‘{”url”: “...”, ...}’Waits up to 60s for completion
Parses JSON output
Returns result to agent
The skill script (execute.py) looks like:
#!/usr/bin/env python3
import json
import sys
from playwright.sync_api import sync_playwright
def scrape(url: str, selector: str = None, wait_for: str = None, timeout: int = 30):
“”“Execute the scraping.”“”
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, timeout=timeout * 1000)
if wait_for:
page.wait_for_selector(wait_for, timeout=timeout * 1000)
html = page.content()
if selector:
elements = page.query_selector_all(selector)
extracted = [el.inner_text() for el in elements]
else:
extracted = []
browser.close()
return {
“success”: True,
“html”: html,
“text”: page.inner_text(”body”),
“elements”: extracted
}
if __name__ == “__main__”:
if len(sys.argv) > 1 and sys.argv[1] == “--help-json”:
# Output schema
schema = {
“name”: “web_scraper”,
“description”: “Scrape web content using headless browser”,
“parameters”: {
“type”: “object”,
“properties”: {
“url”: {”type”: “string”, “description”: “URL to scrape”},
“selector”: {”type”: “string”, “description”: “CSS selector”},
“wait_for”: {”type”: “string”, “description”: “Selector to wait for”},
“timeout”: {”type”: “integer”, “description”: “Timeout in seconds”, “default”: 30}
},
“required”: [”url”]
}
}
print(json.dumps(schema))
sys.exit(0)
elif len(sys.argv) > 1 and sys.argv[1] == “--run-json”:
# Execute with JSON params
params = json.loads(sys.argv[2])
result = scrape(**params)
print(json.dumps(result))
sys.exit(0)
else:
print(”Usage: execute.py --help-json | --run-json ‘{...}’”)
sys.exit(1)
Real-World Performance Characteristics
Token Savings
In our production deployment with 42 skills:
Approach Tokens at Startup Tokens per Conversation Tokens at 10 Convos Load all upfront 21,000 21,000 210,000 Lazy Skills (L1 only) 630 +250 avg (1-2 skills loaded) 3,130 Savings 97% 98.8% 98.5%
With 200K context window, loading all skills upfront consumes 10.5% immediately. Lazy Skills use 0.3%.
Latency Analysis
Level 3 registration adds latency:
Relevance detection: ~2ms (keyword matching)
Schema fetch: ~50ms (subprocess spawn + --help-json)
Tool registration: ~1ms (function wrapper creation)
-------------------------------------------------------------
Total Level 3 overhead: ~53ms
For a skill invoked 5 times in a session:
First call: 53ms overhead
Subsequent calls: 0ms (already registered)
Amortized: 10.6ms per call
Compared to bash invocation overhead (~20-50ms per call), this is competitive after 2-3 uses.
Cache Hit Rates
Over 1,000 conversations in our test corpus:
L1 always loaded: 42 skills, 100% hit rate
L2 loaded: 2.3 skills per conversation on average (5.5% of library)
L3 registered: 0.8 skills per conversation (1.9% of library)
Most conversations use 0-3 skills, meaning 95-100% of the skill library remains at L1 (metadata only).
Implementation Patterns and Best Practices
Designing Good Skill Metadata
The description field is critical—it drives relevance detection. Effective patterns:
Good:
name: database_schema_analyzer
description: Analyze database schemas for normalization issues and suggest improvements
Bad (too generic):
name: db_tool
description: Database utility
Best (includes keywords):
name: database_schema_analyzer
description: Analyze PostgreSQL/MySQL schemas, detect normalization violations, suggest foreign keys and indexes
Include specific technologies, action verbs, and domain terms users might mention.
Skill Directory Structure
Recommended layout:
~/.config/cli-agent/skills/
├── web_scraper/
│ ├── SKILL.md # Metadata + docs
│ ├── execute.py # Main executable
│ ├── requirements.txt # Python deps
│ ├── examples/
│ │ └── sample.json # Example inputs
│ └── tests/
│ └── test_scraper.py
├── pdf_extractor/
│ ├── SKILL.md
│ ├── execute.py
│ └── requirements.txt
└── composite_deploy/
├── SKILL.md
└── workflow.py # Orchestrates other tools
Each skill is fully self-contained with its own dependencies.
Dependency Management
Skills can specify Python dependencies:
---
name: web_scraper
dependencies: [playwright==1.40.0, beautifulsoup4>=4.12.0]
---
On first load, the agent can optionally:
Check if dependencies are installed (
importlib.util.find_spec)Offer to install them (
pip install -r requirements.txt)Sandbox installation per skill (virtualenv per skill directory)
Current implementation uses shared environment, but subprocess isolation enables per-skill venvs:
def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
# Detect if skill has its own venv
skill_venv = script_path.parent / “.venv” / “bin” / “python”
python_executable = skill_venv if skill_venv.exists() else sys.executable
def tool(**kwargs):
cmd = [str(python_executable), str(script_path), “--run-json”, json.dumps(kwargs)]
# ... rest of execution
Error Handling and Timeouts
Skills must handle errors gracefully:
def scrape(url: str, timeout: int = 30):
try:
with sync_playwright() as p:
# ... scraping logic ...
return {”success”: True, “data”: result}
except TimeoutError:
return {
“success”: False,
“error”: f”Page load timed out after {timeout}s”,
“partial_data”: None # Include any partial results
}
except Exception as e:
return {
“success”: False,
“error”: str(e),
“error_type”: type(e).__name__
}
The subprocess wrapper enforces a hard 60s timeout regardless of skill-internal timeouts.
Security Considerations
Subprocess isolation provides some security boundaries:
Process isolation: Skills can’t directly mutate agent state
Timeout enforcement: Runaway scripts are killed
Restricted I/O: Skills communicate only via JSON stdin/stdout
No shared memory: Skills can’t access agent memory
However, skills still run with the agent’s user permissions and can:
Read/write files in the workspace
Make network requests
Execute arbitrary code
Mitigation strategies:
# Sandboxed execution with restricted permissions
def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
def tool(**kwargs):
# Run with restricted network (Linux only)
env = os.environ.copy()
env[’http_proxy’] = ‘http://localhost:9999’ # Blocked proxy
# Run with memory limits (Linux only)
cmd = [’timeout’, ‘60s’, sys.executable, str(script_path), ...]
result = subprocess.run(
cmd,
env=env,
capture_output=True,
timeout=60,
# Future: use bubblewrap/firejail for sandboxing
)
# ...
For production deployments, consider:
Container-based execution (Docker/Podman)
Capability restrictions (no network, read-only filesystem)
Signed skills (GPG signatures on skill directories)
Audit logging (log all skill executions with params)
Scaling to Hundreds of Skills
Current keyword-based relevance detection works well for <50 skills. Beyond that, precision degrades (too many false positives).
Semantic Relevance (Future Enhancement)
Replace keyword matching with embedding similarity:
import numpy as np
from sentence_transformers import SentenceTransformer
class SkillRegistry:
def __init__(self):
self.embedding_model = SentenceTransformer(’all-MiniLM-L6-v2’)
self.skill_embeddings = {} # Cache embeddings
def scan(self):
“”“Compute embeddings for all skill descriptions at scan time.”“”
for skill in self.skills.values():
# Combine name + description for richer embedding
text = f”{skill.name}: {skill.description}”
embedding = self.embedding_model.encode(text)
self.skill_embeddings[skill.name] = embedding
async def _check_and_load_relevant_skills(self, user_message: str):
“”“Use semantic similarity instead of keyword matching.”“”
# Embed user message
msg_embedding = self.embedding_model.encode(user_message)
# Compute cosine similarity with all skill embeddings
similarities = {}
for skill_name, skill_emb in self.skill_embeddings.items():
sim = np.dot(msg_embedding, skill_emb) / (
np.linalg.norm(msg_embedding) * np.linalg.norm(skill_emb)
)
similarities[skill_name] = sim
# Load top-k most similar skills (e.g., top 3)
top_skills = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:3]
for skill_name, similarity in top_skills:
if similarity > 0.5: # Threshold
content = self.skill_registry.load_level2(skill_name)
# ... inject into context ...
Trade-offs:
Pro: Much better precision, works with 100+ skills
Pro: Handles synonyms and paraphrasing
Con: Adds embedding model as dependency (~80MB)
Con: Slower (50-100ms to embed message + compute similarities)
For large deployments, this is worth it.
Skill Categories and Namespaces
Organize skills hierarchically:
skills/
├── web/
│ ├── scraper/
│ ├── api_client/
│ └── browser_automation/
├── data/
│ ├── csv_processor/
│ ├── json_transformer/
│ └── database_query/
└── code/
├── complexity_analyzer/
├── test_generator/
└── refactoring/
Reference as web.scraper, data.csv_processor, etc.
Namespacing enables:
Scoped loading: Only scan
web.*skills for web-related tasksSkill conflicts: Multiple skills named
parserin different namespacesOrganizational structure: Mirrors team/domain boundaries
Conclusion: Toward Self-Extending Agents
Lazy Skills demonstrates that AI agents don’t need omniscience to be capable. By deferring capability loading until the moment of relevance, we can build systems that scale to hundreds of specialized skills without drowning in context.
The three-level architecture—metadata discovery, documentation loading, tool registration—provides a blueprint for runtime extensibility. Whether you’re building coding assistants, workflow automation, customer support agents, or multi-agent systems, this pattern solves a fundamental tension: capability breadth versus context efficiency.
What We’ve Learned
Progressive disclosure works: 97% token savings with no capability loss
Subprocess isolation is underrated: Dependency management and fault tolerance are critical
Relevance detection is the bottleneck: Keyword matching is good enough for <50 skills, semantic similarity needed beyond
Tool registration beats bash invocation: First-class tools are cleaner and faster
Skill types matter: Executables, contextual guides, and composite workflows need different loading strategies
Future Directions
The next frontier is self-extending agents: systems that can discover, install, and create their own skills autonomously.
Imagine an agent that:
Discovers skill gaps: “I need to parse Excel files but have no skill for that”
Searches skill repositories: Queries a GitHub/npm-style registry of community skills
Installs autonomously: Downloads, audits, and registers new skills at runtime
Creates new skills: Writes its own
SKILL.md+execute.pyto codify learned behaviorsShares improvements: Publishes refined skills back to the community registry
This requires:
Skill registries: Centralized/decentralized skill discovery (think npm, but for agent capabilities)
Automated auditing: LLM-based code review to detect malicious skills
Skill composition: Combine existing skills into higher-order capabilities
Versioning and rollback: Skills evolve, agents need to handle breaking changes
Lazy Skills provides the foundation. The skill format (YAML + Markdown + subprocess contract) is simple enough for agents to generate themselves. The three-level loading ensures new skills don’t bloat context.
We’re building toward agents that extend their own capabilities, learning and sharing procedural knowledge like developers share code.
Key Takeaways
Context windows are finite: Load capabilities lazily, not eagerly
Three levels are optimal: Metadata (cheap), docs (on-demand), execution (lazy)
Subprocess isolation: Cleaner, safer, more fault-tolerant than bash invocation
Relevance detection is critical: Invest here as skill count grows
Skill types guide loading: Executables, contextual, composite have different needs
Extensibility without code changes: Users install skills by dropping files
Toward self-extension: Agents that create and share their own capabilities
Skills are just functions. Lazy Skills makes them composable, discoverable, and context-efficient.
