Lazy Skills: A Token-Efficient Approach to Dynamic Agent Capabilities

Infinitely scaling CLI agents

Nov 15, 2025

The Context Window Crisis

AI coding assistants face a fundamental dilemma: breadth versus depth.

Modern agents need access to dozens, if not hundreds, of specialized capabilities—web scraping, database operations, code analysis, API integrations, domain-specific workflows. Yet every tool schema, instruction set, and reference document consumes precious context tokens. Load everything upfront and you exhaust the context window before the user even asks their first question. Load nothing and your agent is powerless.

Traditional approaches offer two unsatisfying solutions:

Static tool sets: Hard-code 10-20 carefully chosen tools, limiting extensibility
Context stuffing: Load all documentation upfront, wasting tokens on irrelevant capabilities

Neither scales. As agent systems grow more sophisticated, we need a fundamentally different approach.

Enter Lazy Skills: Progressive Capability Loading

We’ve built what we call Lazy Skills—a three-tiered progressive disclosure system that loads agent capabilities on-demand, keeping context lean while maintaining extensibility. The name reflects the philosophy: be lazy about loading, aggressive about relevance detection, and smart about what gets promoted into context.

This isn’t just an optimization. It’s a different mental model for how agents acquire capabilities at runtime.

Level 1: Metadata-Only Discovery

What: Skill name and one-line description
When: Always loaded, injected into system prompt
Cost: ~10-20 tokens per skill

def list_level1(self) -> List[Dict[str, str]]:
    “”“Get Level 1 skill metadata for system prompt injection.”“”
    return [
        {”name”: skill.name, “description”: skill.description}
        for skill in self.skills.values()
        if skill.enabled
    ]

At startup, the agent scans skill directories for SKILL.md files with YAML frontmatter:

---
name: web_scraper
description: Extract content from web pages using headless browser
type: executable
auto_load: true
dependencies: [playwright, beautifulsoup4]
---

Only the name and description are injected into the system prompt, making the agent aware of capabilities without consuming tokens on implementation details.

Level 2: On-Demand Content Loading

What: Full skill documentation (markdown body)
When: User message indicates relevance
Cost: 200-2000 tokens per skill

def load_level2(self, name: str) -> Optional[str]:
    “”“Load Level 2 content (full SKILL.md body).”“”
    skill = self.skills.get(name)
    if not skill:
        return None
    
    if skill.loaded_level < 2:
        skill.loaded_level = 2
        logger.info(f”Loaded Level 2 for skill: {name}”)
    
    return self._skill_bodies.get(name)

When the conversation context suggests a skill is relevant (keyword matching on user input), the agent loads the complete documentation. This includes:

Usage instructions
Parameter descriptions
Examples
Best practices

The agent can now reason about how to use the capability.

Level 3: Lazy Tool Registration

What: Executable code registered as callable tools
When: Agent decides to invoke the skill
Cost: Variable (subprocess overhead + execution time)

def register_executables(self, name: str) -> Dict[str, Any]:
    “”“Load Level 3 - register executable skills as tools.”“”
    skill = self.skills.get(name)
    
    # Find executable script
    execute_script = skill.path / “execute.py”
    
    # Get tool schema via subprocess
    schema = self._get_tool_schema(execute_script)
    
    # Create tool wrapper
    tool_func = self._make_subprocess_tool(execute_script, schema)
    
    # Register with executor
    self.tool_executor.register_tool(
        name=schema.get(”name”, name), 
        function=tool_func, 
        schema=schema
    )
    
    skill.loaded_level = 3
    return {”success”: True, “tool_name”: schema.get(”name”, name)}

Only when the agent commits to using a skill does the system:

Invoke the script with --help-json to get the tool schema
Create a subprocess wrapper
Register it as an executable tool

This defers the cost of tool registration until actually needed.

Implementation Architecture

Skill Discovery

Skills live in ~/.config/cli-agent/skills/ as self-contained directories:

skills/
├── web_scraper/
│   ├── SKILL.md          # Metadata + docs
│   ├── execute.py        # Tool implementation
│   └── requirements.txt
└── code_reviewer/
    ├── SKILL.md
    └── execute.py

On startup, the registry scans for SKILL.md files:

def scan(self, auto_enable: bool = True) -> int:
    “”“Scan skills directories for SKILL.md files.”“”
    discovered = 0
    
    for skills_dir in self.skills_dirs:
        for skill_md in skills_dir.rglob(”SKILL.md”):
            metadata, body = parse_skill_md(skill_md)
            
            skill_meta = SkillMetadata(
                name=metadata.get(”name”),
                description=metadata.get(”description”),
                skill_type=metadata.get(”type”, “contextual”),
                path=skill_md.parent,
                enabled=auto_enable and metadata.get(”auto_load”, False),
                loaded_level=1
            )
            
            self.skills[name] = skill_meta
            self._skill_bodies[name] = body
            discovered += 1
    
    return discovered

Skill Types

The system supports three skill types:

Executable: Runnable tools (web scrapers, API clients)
Contextual: Knowledge/guidelines (coding standards, architectural patterns)
Composite: Multi-step workflows orchestrating other tools

Type determines loading strategy—contextual skills stop at Level 2, while executables can reach Level 3.

Subprocess-Based Isolation

Executable skills run as subprocesses, providing:

Dependency isolation: Each skill can have its own Python dependencies
Security boundary: Skills can’t directly access agent internals
Fault tolerance: Crashed skills don’t crash the agent

def _make_subprocess_tool(self, script_path: Path, schema: Dict[str, Any]) -> callable:
    “”“Create a callable tool wrapper for a subprocess script.”“”
    
    def tool(**kwargs):
        cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
        
        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            return {”success”: False, “error”: result.stderr}
    
    return tool

Benefits

Token Efficiency

Startup: 10-20 tokens × number of skills (vs. full tool schemas)
Conversation: Only relevant skills loaded
Example: 50 skills = ~500 tokens baseline instead of ~15,000

Extensibility

Users can add skills without code changes:

Create SKILL.md with frontmatter
Add execute.py if executable
Drop in skills directory
Restart agent

Contextual Awareness

The agent knows what it can do (Level 1) before deciding what to actually load (Level 2/3).

Separation of Concerns

Agent core: Conversation orchestration
Skill registry: Capability discovery and loading
Skills: Self-contained, versioned, documented

Trade-offs

Latency

Level 3 registration adds subprocess overhead (~50-200ms per skill). For interactive tools, this is acceptable. For batch operations, pre-registration may be preferred.

Relevance Detection

Current implementation uses simple keyword matching. More sophisticated approaches (semantic similarity, LLM-based relevance scoring) could improve precision.

Metadata Maintenance

Requires discipline to keep frontmatter accurate. Stale descriptions lead to poor loading decisions.

How Lazy Skills Differs from Anthropic’s Agent Skills

Anthropic recently introduced Agent Skills, a similar approach to progressive disclosure. Both systems share core ideas—YAML frontmatter, SKILL.md files, dynamic loading—but differ significantly in implementation philosophy and technical approach.

Shared Foundation

Both approaches recognize the same core problem: agents need access to specialized knowledge without overwhelming the context window. Both use:

Markdown-based skill definitions (SKILL.md)
YAML frontmatter for metadata (name, description)
Progressive disclosure (load metadata first, content second)
File bundling (skills can reference additional resources)

This convergence suggests we’ve identified a genuine pattern in agent system design.

The Critical Architectural Difference: Opt-In vs. Always-On Context

The most fundamental difference lies in what gets loaded into the system prompt:

Anthropic’s Approach (Always-On Context):

Loads all skill metadata (name + description) into every system prompt at startup
Skills are always “visible” to the agent—it can see the full library of available capabilities
Agent autonomously decides relevance by matching tasks against loaded metadata
Scales through progressive disclosure within skills (Level 1 → Level 2 → Level 3+)

Lazy Skills (Opt-In Context):

Only loads enabled skills into the system prompt (list_level1() filters by skill.enabled)
Skills must be explicitly enabled/disabled by users or auto-enabled via auto_load: true
Metadata is discovered and cached at startup, but not injected unless enabled
Scales by loading fewer skills into context by default

The Trade-off:

Aspect Anthropic (Always-On) Lazy Skills (Opt-In) Discoverability ✅ Agent can suggest any installed skill ❌ Agent unaware of disabled skills Baseline Token Cost Higher (all skill metadata in prompt) Lower (only enabled skills) User Control Minimal (agent decides what to load) Explicit (users control what’s visible) Scaling Progressive disclosure within skills Progressive disclosure + selective enabling

Why we chose opt-in:

In production deployments with dozens of skills, we found users often have specialized skill libraries (e.g., web scraping, database tools, cloud APIs) that are only relevant to specific projects. Loading all metadata upfront meant agents would occasionally suggest irrelevant skills or waste tokens on capabilities the user would never need.

By requiring explicit enabling, we push the relevance decision to the user: “Which skills might I need for this project?” This aligns with how developers think about dependencies—you don’t import every library in your site-packages, you declare what you need.

The best of both worlds:

A hybrid approach could combine the strengths:

def list_level1(self, mode: str = “enabled”) -> List[Dict[str, str]]:
    “”“Get Level 1 skill metadata for system prompt injection.
    
    mode: “enabled” (default) | “all” (Anthropic-style) | “auto” (smart filtering)
    “”“
    if mode == “all”:
        skills = self.skills.values()
    elif mode == “auto”:
        # Use lightweight LLM to select relevant skills based on project context
        skills = self._auto_select_skills(max_skills=10)
    else:  # “enabled”
        skills = [s for s in self.skills.values() if s.enabled]
    
    return [{”name”: s.name, “description”: s.description} for s in skills]

This would allow power users to switch between “show me everything” (Anthropic mode) and “only what I’ve enabled” (Lazy Skills mode).

Key Innovations in Lazy Skills

1. Three-Level Loading (vs. Two-Level)

Anthropic’s system has two levels:

Level 1: Metadata (name, description) in system prompt
Level 2: Full SKILL.md and bundled files loaded via Bash tool

Lazy Skills adds a critical third level:

Level 1: Metadata only (identical to Anthropic)
Level 2: Full skill documentation (similar to Anthropic)
Level 3: Executable tool registration (novel)

This third level is the key innovation. Instead of Claude needing to invoke Bash to run scripts, Level 3 skills register themselves as first-class tools in the agent’s tool executor. The LLM sees them in its tool schema and can invoke them directly.

Why this matters:

Simpler invocation: Agent calls web_scraper(url=”...”) instead of bash(command=”python skill.py --url ...”)
Type safety: Tool schemas provide parameter validation
Better error handling: Failed skill execution returns structured JSON, not raw stderr
Cleaner traces: Tool calls show semantic intent, not bash gymnastics

2. Subprocess-Based Isolation

Anthropic’s approach uses bash to execute skill code within the same context. Lazy Skills isolate executable skills in subprocesses with their own dependencies.

def _make_subprocess_tool(self, script_path: Path, schema: Dict[str, Any]) -> callable:
    “”“Create a callable tool wrapper for a subprocess script.”“”
    
    def tool(**kwargs):
        cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=60)
        
        if result.returncode == 0:
            return json.loads(result.stdout)
        else:
            return {”success”: False, “error”: result.stderr}
    
    return tool

Benefits over bash execution:

Dependency isolation: Each skill can have its own requirements.txt
Security boundary: Skills can’t directly access agent internals or modify state
Fault tolerance: Crashed skills don’t crash the agent
Timeout enforcement: Runaway scripts are killed after 60s
Cleaner interface: Skills communicate via JSON stdin/stdout, not CLI argument parsing

3. Programmatic Relevance Detection

Anthropic relies on Claude to decide when to load skills by reading the system prompt metadata. This is elegant but means relevance detection happens inside the LLM, consuming inference cycles.

Lazy Skills perform pre-inference relevance filtering in Python:

async def _check_and_load_relevant_skills(self, user_message: str):
    “”“Check if any enabled skills are relevant before LLM inference.”“”
    
    for skill_dict in enabled_skills_list:
        # Skip already-loaded skills
        if skill_info.loaded_level >= 2:
            continue
        
        # Build keyword set from metadata and type
        skill_keywords = [
            skill_dict[”name”].lower(),
            skill_dict[”description”].lower(),
        ]
        
        if skill_dict[”type”] == “executable”:
            skill_keywords.extend([”run”, “execute”, “call”, “api”])
        
        # Simple keyword matching
        user_message_lower = user_message.lower()
        is_relevant = any(keyword in user_message_lower for keyword in skill_keywords)
        
        if is_relevant:
            skill_content = self.skill_registry.load_level2(skill_name)
            # Inject as system message before LLM call
            self.add_message(”system”, f”# SKILL LOADED: {skill_name}\n\n{skill_content}”)

Trade-offs:

Pro: Faster, no inference overhead, deterministic
Pro: Skills are loaded before the LLM sees the message, so they’re available immediately
Con: Less sophisticated relevance detection (keyword matching vs. semantic understanding)
Con: May load irrelevant skills if keywords match spuriously

In practice, simple keyword matching works surprisingly well for skill counts under ~50. For larger skill libraries, we could swap in semantic similarity (embedding-based matching) without changing the architecture.

4. Skill Types as Loading Strategy Hints

Both systems support different skill types, but Lazy Skills use type information to change loading behavior:

class SkillType(Enum):
    EXECUTABLE = “executable”    # Can reach Level 3 (tool registration)
    CONTEXTUAL = “contextual”    # Stops at Level 2 (documentation only)
    COMPOSITE = “composite”      # Level 3 for orchestration

Contextual skills never register as tools—they’re pure documentation/guidance. The agent uses them to inform its behavior but can’t “call” them. This prevents bloat in the tool schema.

Executable skills can be promoted to Level 3 when the agent explicitly decides to invoke them, registering a subprocess-wrapped tool.

Composite skills can orchestrate multiple tools or skills, potentially registering higher-order tools that coordinate sub-capabilities.

Anthropic’s system doesn’t explicitly distinguish loading strategies by type, treating all skills uniformly.

5. Auto-Discovery and User Installation

Lazy Skills support both:

Built-in skill directories (shipped with the agent)
User skill directories (~/.config/cli-agent/skills/)

Users can install skills by dropping directories into the config location:

# Install a skill
mkdir -p ~/.config/cli-agent/skills/my_skill
cd ~/.config/cli-agent/skills/my_skill

cat > SKILL.md <<EOF
---
name: my_custom_tool
description: Does something amazing
type: executable
auto_load: true
---

# My Custom Tool

Usage instructions here...
EOF

# Write execute.py
cat > execute.py <<EOF
#!/usr/bin/env python3
import json, sys
# Tool implementation...
EOF

chmod +x execute.py

On next startup, the skill is auto-discovered, metadata loaded to Level 1, and appears in the agent’s capability list. No code changes to the agent itself.

Comparison Table

Feature Anthropic Agent Skills Lazy Skills Metadata loading System prompt injection System prompt injection Content loading Bash tool reads SKILL.md Pre-injection before inference Tool registration Manual bash invocation Automatic subprocess wrapper Relevance detection LLM-based (Claude decides) Programmatic (keyword matching) Skill isolation Same process (bash) Subprocess with timeout Dependency management Shared environment Per-skill requirements.txt Skill types Informal categorization Formal types with different loading Loading overhead LLM inference for discovery Pre-inference filtering Security model Trust-based (audit skills) Process isolation + timeout Extensibility Drop files, restart Drop files, restart

Deep Dive: The Three-Level Architecture

Let’s examine each level in detail with concrete examples.

Level 1: The Awareness Layer

Purpose: Make the agent aware of what it could do, without committing tokens to how to do it.

When the agent starts, it scans all enabled skills and extracts minimal metadata:

def list_level1(self) -> List[Dict[str, str]]:
    “”“Get Level 1 skill metadata for system prompt injection.”“”
    return [
        {”name”: skill.name, “description”: skill.description}
        for skill in self.skills.values()
        if skill.enabled
    ]

# Example output:
# [
#   {”name”: “web_scraper”, “description”: “Extract content from websites using headless browser”},
#   {”name”: “pdf_extractor”, “description”: “Extract text and tables from PDF documents”},
#   {”name”: “code_complexity”, “description”: “Analyze code complexity metrics per ISO 25010”}
# ]

This gets injected into the system prompt:

You are an AI coding assistant with the following capabilities:

Enabled skills:
- web_scraper: Extract content from websites using headless browser
- pdf_extractor: Extract text and tables from PDF documents  
- code_complexity: Analyze code complexity metrics per ISO 25010

[... rest of system prompt ...]

Token cost: ~15 tokens per skill × 50 skills = 750 tokens total for the entire skill library.

Compare this to loading full documentation: ~500 tokens per skill × 50 skills = 25,000 tokens—a 33× reduction.

The agent now knows it has a web scraper skill. It doesn’t yet know how to use it, but it knows to consider it when users ask web-related questions.

Level 2: The Documentation Layer

Purpose: Teach the agent how to use a capability when it becomes relevant.

When a user message suggests a skill might be useful, the system loads the full SKILL.md body:

async def _check_and_load_relevant_skills(self, user_message: str):
    “”“Load Level 2 content for relevant skills before LLM call.”“”
    
    enabled_skills = self.skill_registry.list_skills(enabled_only=True, include_details=True)
    
    for skill_dict in enabled_skills:
        # Skip already-loaded
        if skill_info.loaded_level >= 2:
            continue
        
        # Build keyword set from name, description, and type
        keywords = [skill_dict[”name”].lower(), skill_dict[”description”].lower()]
        
        # Add type-specific keywords
        if skill_dict[”type”] == “executable”:
            keywords.extend([”run”, “execute”, “api”, “script”])
        
        # Check relevance
        if any(kw in user_message.lower() for kw in keywords):
            # Load full content
            content = self.skill_registry.load_level2(skill_dict[”name”])
            
            # Inject as system message
            self.add_message(”system”, f”“”
# SKILL LOADED: {skill_dict[”name”]}

The following skill has been loaded because it appears relevant:

{content}

Use this skill if appropriate for the user’s request.
“”“)

Example: User asks “Can you scrape the pricing table from example.com?”

Relevance detection triggers on keywords: scrape, web. The system loads web_scraper to Level 2:

---
name: web_scraper
description: Extract content from websites using headless browser
type: executable
dependencies: [playwright, beautifulsoup4]
---

# Web Scraper Skill

## Overview
This skill uses Playwright to load JavaScript-heavy pages and BeautifulSoup for parsing.

## Usage
Call this skill when you need to:
- Extract data from dynamic websites
- Interact with pages requiring JavaScript
- Handle authentication or cookies

## Parameters
- `url` (string, required): The URL to scrape
- `selector` (string, optional): CSS selector for specific elements
- `wait_for` (string, optional): Selector to wait for before scraping
- `timeout` (int, optional): Timeout in seconds (default: 30)

## Example
```json
{
  “url”: “https://example.com/products”,
  “selector”: “.product-card”,
  “wait_for”: “.price-loaded”
}

Output

Returns JSON with:

html: Raw HTML of the page
text: Extracted text content
elements: Array of matched elements (if selector provided)

Error Handling

Timeouts return error with partial content
Network errors include status code
Invalid selectors return empty elements array


This documentation (≈500 tokens) is now in context. The agent understands:
- When to use the skill (dynamic websites)
- What parameters to provide
- What output to expect
- How errors work

Crucially, this happens *before the LLM generates a response*. By the time Claude sees the user message, it already has the docs.

### Level 3: The Execution Layer

**Purpose**: Register the skill as a callable tool in the agent’s execution environment.

Level 3 happens lazily—only when the agent decides to *invoke* the skill (not just *consider* it).

#### Tool Schema Discovery

Executable skills must support a `--help-json` flag that outputs their tool schema:

```bash
$ python ~/.config/cli-agent/skills/web_scraper/execute.py --help-json
{
  “name”: “web_scraper”,
  “description”: “Scrape web content using headless browser”,
  “parameters”: {
    “type”: “object”,
    “properties”: {
      “url”: {
        “type”: “string”,
        “description”: “URL to scrape”
      },
      “selector”: {
        “type”: “string”,
        “description”: “CSS selector for elements to extract”
      },
      “wait_for”: {
        “type”: “string”,
        “description”: “Selector to wait for before scraping”
      },
      “timeout”: {
        “type”: “integer”,
        “description”: “Timeout in seconds”,
        “default”: 30
      }
    },
    “required”: [”url”]
  }
}

This schema is fetched via subprocess:

def _get_tool_schema(self, script_path: Path) -> Optional[Dict[str, Any]]:
    “”“Get tool schema by invoking script with --help-json.”“”
    try:
        result = subprocess.run(
            [sys.executable, str(script_path), “--help-json”],
            capture_output=True,
            text=True,
            timeout=5
        )
        
        if result.returncode == 0 and result.stdout.strip():
            return json.loads(result.stdout)
        
        return None
    
    except (subprocess.TimeoutExpired, json.JSONDecodeError) as e:
        logger.warning(f”Error getting schema from {script_path}: {e}”)
        return None

Subprocess Wrapper Creation

The schema is used to register a callable tool:

def register_executables(self, name: str) -> Dict[str, Any]:
    “”“Register executable skill as a tool (Level 3).”“”
    
    skill = self.skills.get(name)
    execute_script = skill.path / “execute.py”
    
    # Get schema
    schema = self._get_tool_schema(execute_script)
    
    # Create subprocess wrapper
    tool_func = self._make_subprocess_tool(execute_script, schema)
    
    # Register in tool executor
    self.tool_executor.register_tool(
        name=schema[”name”],
        function=tool_func,
        schema=schema
    )
    
    skill.loaded_level = 3
    return {”success”: True, “tool_name”: schema[”name”]}


def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
    “”“Wrap script as a callable tool.”“”
    
    def tool(**kwargs):
        “”“Execute skill script as subprocess.”“”
        # Pass args as JSON
        cmd = [sys.executable, str(script_path), “--run-json”, json.dumps(kwargs)]
        
        try:
            result = subprocess.run(
                cmd, 
                capture_output=True, 
                text=True, 
                timeout=60
            )
            
            if result.returncode == 0:
                # Parse JSON response
                return json.loads(result.stdout)
            else:
                return {
                    “success”: False,
                    “error”: result.stderr,
                    “exit_code”: result.returncode
                }
        
        except subprocess.TimeoutExpired:
            return {”success”: False, “error”: “Skill execution timed out”}
        
        except Exception as e:
            return {”success”: False, “error”: str(e)}
    
    return tool

Now when Claude generates:

{
  “type”: “tool_use”,
  “name”: “web_scraper”,
  “input”: {
    “url”: “https://example.com/pricing”,
    “selector”: “.pricing-table”,
    “wait_for”: “.prices-loaded”
  }
}

The agent executor:

Looks up the registered web_scraper tool
Calls the subprocess wrapper
Spawns python execute.py --run-json ‘{”url”: “...”, ...}’
Waits up to 60s for completion
Parses JSON output
Returns result to agent

The skill script (execute.py) looks like:

#!/usr/bin/env python3
import json
import sys
from playwright.sync_api import sync_playwright

def scrape(url: str, selector: str = None, wait_for: str = None, timeout: int = 30):
    “”“Execute the scraping.”“”
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        
        page.goto(url, timeout=timeout * 1000)
        
        if wait_for:
            page.wait_for_selector(wait_for, timeout=timeout * 1000)
        
        html = page.content()
        
        if selector:
            elements = page.query_selector_all(selector)
            extracted = [el.inner_text() for el in elements]
        else:
            extracted = []
        
        browser.close()
        
        return {
            “success”: True,
            “html”: html,
            “text”: page.inner_text(”body”),
            “elements”: extracted
        }

if __name__ == “__main__”:
    if len(sys.argv) > 1 and sys.argv[1] == “--help-json”:
        # Output schema
        schema = {
            “name”: “web_scraper”,
            “description”: “Scrape web content using headless browser”,
            “parameters”: {
                “type”: “object”,
                “properties”: {
                    “url”: {”type”: “string”, “description”: “URL to scrape”},
                    “selector”: {”type”: “string”, “description”: “CSS selector”},
                    “wait_for”: {”type”: “string”, “description”: “Selector to wait for”},
                    “timeout”: {”type”: “integer”, “description”: “Timeout in seconds”, “default”: 30}
                },
                “required”: [”url”]
            }
        }
        print(json.dumps(schema))
        sys.exit(0)
    
    elif len(sys.argv) > 1 and sys.argv[1] == “--run-json”:
        # Execute with JSON params
        params = json.loads(sys.argv[2])
        result = scrape(**params)
        print(json.dumps(result))
        sys.exit(0)
    
    else:
        print(”Usage: execute.py --help-json | --run-json ‘{...}’”)
        sys.exit(1)

Real-World Performance Characteristics

Token Savings

In our production deployment with 42 skills:

Approach Tokens at Startup Tokens per Conversation Tokens at 10 Convos Load all upfront 21,000 21,000 210,000 Lazy Skills (L1 only) 630 +250 avg (1-2 skills loaded) 3,130 Savings 97% 98.8% 98.5%

With 200K context window, loading all skills upfront consumes 10.5% immediately. Lazy Skills use 0.3%.

Latency Analysis

Level 3 registration adds latency:

Relevance detection:     ~2ms   (keyword matching)
Schema fetch:           ~50ms   (subprocess spawn + --help-json)
Tool registration:       ~1ms   (function wrapper creation)
-------------------------------------------------------------
Total Level 3 overhead: ~53ms

For a skill invoked 5 times in a session:

First call: 53ms overhead
Subsequent calls: 0ms (already registered)
Amortized: 10.6ms per call

Compared to bash invocation overhead (~20-50ms per call), this is competitive after 2-3 uses.

Cache Hit Rates

Over 1,000 conversations in our test corpus:

L1 always loaded: 42 skills, 100% hit rate
L2 loaded: 2.3 skills per conversation on average (5.5% of library)
L3 registered: 0.8 skills per conversation (1.9% of library)

Most conversations use 0-3 skills, meaning 95-100% of the skill library remains at L1 (metadata only).

Implementation Patterns and Best Practices

Designing Good Skill Metadata

The description field is critical—it drives relevance detection. Effective patterns:

Good:

name: database_schema_analyzer
description: Analyze database schemas for normalization issues and suggest improvements

Bad (too generic):

name: db_tool
description: Database utility

Best (includes keywords):

name: database_schema_analyzer
description: Analyze PostgreSQL/MySQL schemas, detect normalization violations, suggest foreign keys and indexes

Include specific technologies, action verbs, and domain terms users might mention.

Skill Directory Structure

Recommended layout:

~/.config/cli-agent/skills/
├── web_scraper/
│   ├── SKILL.md              # Metadata + docs
│   ├── execute.py            # Main executable
│   ├── requirements.txt      # Python deps
│   ├── examples/
│   │   └── sample.json       # Example inputs
│   └── tests/
│       └── test_scraper.py
├── pdf_extractor/
│   ├── SKILL.md
│   ├── execute.py
│   └── requirements.txt
└── composite_deploy/
    ├── SKILL.md
    └── workflow.py           # Orchestrates other tools

Each skill is fully self-contained with its own dependencies.

Dependency Management

Skills can specify Python dependencies:

---
name: web_scraper
dependencies: [playwright==1.40.0, beautifulsoup4>=4.12.0]
---

On first load, the agent can optionally:

Check if dependencies are installed (importlib.util.find_spec)
Offer to install them (pip install -r requirements.txt)
Sandbox installation per skill (virtualenv per skill directory)

Current implementation uses shared environment, but subprocess isolation enables per-skill venvs:

def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
    # Detect if skill has its own venv
    skill_venv = script_path.parent / “.venv” / “bin” / “python”
    python_executable = skill_venv if skill_venv.exists() else sys.executable
    
    def tool(**kwargs):
        cmd = [str(python_executable), str(script_path), “--run-json”, json.dumps(kwargs)]
        # ... rest of execution

Error Handling and Timeouts

Skills must handle errors gracefully:

def scrape(url: str, timeout: int = 30):
    try:
        with sync_playwright() as p:
            # ... scraping logic ...
            return {”success”: True, “data”: result}
    
    except TimeoutError:
        return {
            “success”: False,
            “error”: f”Page load timed out after {timeout}s”,
            “partial_data”: None  # Include any partial results
        }
    
    except Exception as e:
        return {
            “success”: False,
            “error”: str(e),
            “error_type”: type(e).__name__
        }

The subprocess wrapper enforces a hard 60s timeout regardless of skill-internal timeouts.

Security Considerations

Subprocess isolation provides some security boundaries:

Process isolation: Skills can’t directly mutate agent state
Timeout enforcement: Runaway scripts are killed
Restricted I/O: Skills communicate only via JSON stdin/stdout
No shared memory: Skills can’t access agent memory

However, skills still run with the agent’s user permissions and can:

Read/write files in the workspace
Make network requests
Execute arbitrary code

Mitigation strategies:

# Sandboxed execution with restricted permissions
def _make_subprocess_tool(self, script_path: Path, schema: Dict) -> callable:
    def tool(**kwargs):
        # Run with restricted network (Linux only)
        env = os.environ.copy()
        env[’http_proxy’] = ‘http://localhost:9999’  # Blocked proxy
        
        # Run with memory limits (Linux only)
        cmd = [’timeout’, ‘60s’, sys.executable, str(script_path), ...]
        
        result = subprocess.run(
            cmd,
            env=env,
            capture_output=True,
            timeout=60,
            # Future: use bubblewrap/firejail for sandboxing
        )
        # ...

For production deployments, consider:

Container-based execution (Docker/Podman)
Capability restrictions (no network, read-only filesystem)
Signed skills (GPG signatures on skill directories)
Audit logging (log all skill executions with params)

Scaling to Hundreds of Skills

Current keyword-based relevance detection works well for <50 skills. Beyond that, precision degrades (too many false positives).

Semantic Relevance (Future Enhancement)

Replace keyword matching with embedding similarity:

import numpy as np
from sentence_transformers import SentenceTransformer

class SkillRegistry:
    def __init__(self):
        self.embedding_model = SentenceTransformer(’all-MiniLM-L6-v2’)
        self.skill_embeddings = {}  # Cache embeddings
    
    def scan(self):
        “”“Compute embeddings for all skill descriptions at scan time.”“”
        for skill in self.skills.values():
            # Combine name + description for richer embedding
            text = f”{skill.name}: {skill.description}”
            embedding = self.embedding_model.encode(text)
            self.skill_embeddings[skill.name] = embedding
    
    async def _check_and_load_relevant_skills(self, user_message: str):
        “”“Use semantic similarity instead of keyword matching.”“”
        
        # Embed user message
        msg_embedding = self.embedding_model.encode(user_message)
        
        # Compute cosine similarity with all skill embeddings
        similarities = {}
        for skill_name, skill_emb in self.skill_embeddings.items():
            sim = np.dot(msg_embedding, skill_emb) / (
                np.linalg.norm(msg_embedding) * np.linalg.norm(skill_emb)
            )
            similarities[skill_name] = sim
        
        # Load top-k most similar skills (e.g., top 3)
        top_skills = sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:3]
        
        for skill_name, similarity in top_skills:
            if similarity > 0.5:  # Threshold
                content = self.skill_registry.load_level2(skill_name)
                # ... inject into context ...

Trade-offs:

Pro: Much better precision, works with 100+ skills
Pro: Handles synonyms and paraphrasing
Con: Adds embedding model as dependency (~80MB)
Con: Slower (50-100ms to embed message + compute similarities)

For large deployments, this is worth it.

Skill Categories and Namespaces

Organize skills hierarchically:

skills/
├── web/
│   ├── scraper/
│   ├── api_client/
│   └── browser_automation/
├── data/
│   ├── csv_processor/
│   ├── json_transformer/
│   └── database_query/
└── code/
    ├── complexity_analyzer/
    ├── test_generator/
    └── refactoring/

Reference as web.scraper, data.csv_processor, etc.

Namespacing enables:

Scoped loading: Only scan web.* skills for web-related tasks
Skill conflicts: Multiple skills named parser in different namespaces
Organizational structure: Mirrors team/domain boundaries

Conclusion: Toward Self-Extending Agents

Lazy Skills demonstrates that AI agents don’t need omniscience to be capable. By deferring capability loading until the moment of relevance, we can build systems that scale to hundreds of specialized skills without drowning in context.

The three-level architecture—metadata discovery, documentation loading, tool registration—provides a blueprint for runtime extensibility. Whether you’re building coding assistants, workflow automation, customer support agents, or multi-agent systems, this pattern solves a fundamental tension: capability breadth versus context efficiency.

What We’ve Learned

Progressive disclosure works: 97% token savings with no capability loss
Subprocess isolation is underrated: Dependency management and fault tolerance are critical
Relevance detection is the bottleneck: Keyword matching is good enough for <50 skills, semantic similarity needed beyond
Tool registration beats bash invocation: First-class tools are cleaner and faster
Skill types matter: Executables, contextual guides, and composite workflows need different loading strategies

Future Directions

The next frontier is self-extending agents: systems that can discover, install, and create their own skills autonomously.

Imagine an agent that:

Discovers skill gaps: “I need to parse Excel files but have no skill for that”
Searches skill repositories: Queries a GitHub/npm-style registry of community skills
Installs autonomously: Downloads, audits, and registers new skills at runtime
Creates new skills: Writes its own SKILL.md + execute.py to codify learned behaviors
Shares improvements: Publishes refined skills back to the community registry

This requires:

Skill registries: Centralized/decentralized skill discovery (think npm, but for agent capabilities)
Automated auditing: LLM-based code review to detect malicious skills
Skill composition: Combine existing skills into higher-order capabilities
Versioning and rollback: Skills evolve, agents need to handle breaking changes

Lazy Skills provides the foundation. The skill format (YAML + Markdown + subprocess contract) is simple enough for agents to generate themselves. The three-level loading ensures new skills don’t bloat context.

We’re building toward agents that extend their own capabilities, learning and sharing procedural knowledge like developers share code.

Key Takeaways

Context windows are finite: Load capabilities lazily, not eagerly
Three levels are optimal: Metadata (cheap), docs (on-demand), execution (lazy)
Subprocess isolation: Cleaner, safer, more fault-tolerant than bash invocation
Relevance detection is critical: Invest here as skill count grows
Skill types guide loading: Executables, contextual, composite have different needs
Extensibility without code changes: Users install skills by dropping files
Toward self-extension: Agents that create and share their own capabilities

Skills are just functions. Lazy Skills makes them composable, discoverable, and context-efficient.

Coding Ramblings

Discussion about this post