MCP Tool Integration as Systems Thinking (Part 3): System Behavior & Policies
In Part 1, we built architectural foundations. In Part 2, we designed for resilience. Now we address system-wide behavior: how tools are discovered, how errors are handled consistently, how performance emerges, and how tool selection becomes strategic.
Policy beats improvisation at scale.
Series Navigation
- Part 1: Foundation & Architecture
- Part 2: Resilience & Runtime Behavior
- Part 3: System Behavior & Policies (this article)
- Part 4: Advanced Patterns & Production
Tool Discovery Is a Governance Problem
As systems grow, the question shifts from how do we call tools to which tools should exist at all.
Dynamic discovery and registration enable flexibility, but they also require governance. A tool registry becomes a source of truth, not just a convenience.
Effective registries capture intent:
- What the tool does
- What guarantees it provides
- How expensive or slow it is
- What permissions it requires
This metadata later enables smarter routing, better fallbacks, and informed deprecation decisions.
Tool Metadata Structure
Tool Metadata Schema:
┌────────────────────────────────────────────────────┐
│ IDENTIFICATION │
│ • toolId: unique identifier │
│ • name: human-readable name │
│ • version: semantic version │
│ • description: purpose and capabilities │
├────────────────────────────────────────────────────┤
│ CAPABILITIES │
│ • capabilities: ['search', 'realtime-data'] │
│ • tags: ['production-ready', 'external'] │
├────────────────────────────────────────────────────┤
│ PERFORMANCE CHARACTERISTICS │
│ • estimatedLatency: fast|medium|slow │
│ (fast<100ms, medium<1s, slow>1s) │
│ • rateLimit: {requests: 100, period: '1m'} │
│ • costPerCall: 0.001 USD │
├────────────────────────────────────────────────────┤
│ RELIABILITY GUARANTEES │
│ • sla: '99.5%' │
│ • retryable: true │
│ • idempotent: true │
├────────────────────────────────────────────────────┤
│ SECURITY REQUIREMENTS │
│ • requiredPermissions: ['network.external'] │
│ • dataClassification: public|internal|sensitive │
│ • piiHandling: none|anonymize|encrypt │
├────────────────────────────────────────────────────┤
│ INPUT/OUTPUT SCHEMA │
│ • input: type definitions + validation rules │
│ • output: expected structure │
├────────────────────────────────────────────────────┤
│ OPERATIONAL │
│ • fallbacks: ['cached_search', 'wiki_search'] │
│ • healthCheckEndpoint: URL │
└────────────────────────────────────────────────────┘
Tool Discovery Algorithm
FUNCTION discoverTools(sourcePath):
toolDefinitions = scanDirectory(sourcePath)
FOR EACH definition IN toolDefinitions:
TRY:
validateToolMetadata(definition)
registry.register(definition)
LOG_INFO("Tool discovered", {
toolId: definition.toolId,
version: definition.version,
capabilities: definition.capabilities
})
CATCH error:
LOG_ERROR("Tool registration failed", {
toolId: definition.toolId,
error: error
})
FUNCTION findToolsByCapability(capability):
RETURN registry.query({
where: {
capabilities CONTAINS capability,
tags CONTAINS 'production-ready'
},
orderBy: 'reliability.sla' DESC
})
FUNCTION routeToOptimalTool(intent, constraints):
candidates = findToolsByCapability(intent.capability)
// Filter by constraints
IF constraints.maxLatency:
candidates = filter(candidates, latency < constraints.maxLatency)
IF constraints.maxCost:
candidates = filter(candidates, cost < constraints.maxCost)
IF constraints.requiredSLA:
candidates = filter(candidates, sla >= constraints.requiredSLA)
// Score and rank
scored = scoreTools(candidates, intent.priority)
RETURN scored[0] // Best match
Governance Questions
Before adding a tool:
- Does this capability already exist?
- What’s the cost per invocation?
- What’s the expected failure rate?
- Who owns maintenance?
- What’s the deprecation plan?
Before removing a tool:
- What depends on it?
- What’s the migration path?
- Are there usage analytics?
- What’s the communication plan?
Error Handling Is a Policy Decision
Error handling should not be improvised at call sites. It should be a policy applied consistently across the system.
That policy answers questions like:
- Which errors trigger retries, and how often
- Which errors alert humans
- Which errors are safe to surface to agents
- When a tool should be disabled automatically
When these rules are centralized, the system behaves coherently under stress. When they aren’t, behavior becomes unpredictable and hard to trust.
Error Classification Decision Tree
graph TD
A[Error Occurred] --> B{Error Type?}
B -->|Timeout/Network| C[TRANSIENT]
B -->|HTTP 429| D[RATE_LIMIT]
B -->|HTTP 401/403| E[AUTHENTICATION]
B -->|HTTP 4xx| F[VALIDATION]
B -->|HTTP 5xx| C
B -->|Unknown| G[UNKNOWN]
C --> H{Circuit open?}
H -->|Yes| I[FAIL_FAST +<br/>Use Fallback]
H -->|No| J{Retries<br/>exhausted?}
J -->|Yes| K[FAIL +<br/>Use Fallback]
J -->|No| L[RETRY +<br/>Exponential Backoff]
D --> M[RETRY +<br/>Linear Backoff<br/>Honor retry-after]
E --> N{Credentials<br/>refreshed?}
N -->|No| O[Refresh +<br/>RETRY once]
N -->|Yes| P[FAIL +<br/>Alert Operator<br/>Disable Tool]
F --> Q[FAIL +<br/>Surface to Agent<br/>Validation Error]
G --> R[FAIL +<br/>Alert Operator]
Error Handling Algorithm
FUNCTION handleError(error, context):
classification = classifyError(error)
SWITCH classification.category:
CASE TRANSIENT:
RETURN handleTransient(error, context, classification)
CASE AUTHENTICATION:
RETURN handleAuth(error, context)
CASE RATE_LIMIT:
RETURN handleRateLimit(error, context, classification)
CASE VALIDATION:
RETURN {action: FAIL, surfaceToAgent: true, guidance: error.message}
DEFAULT:
RETURN {action: FAIL, alertOperator: true}
FUNCTION classifyError(error):
IF error.type IN [TIMEOUT, CONNECTION_REFUSED]:
RETURN {category: TRANSIENT, retryable: true, maxRetries: 3, backoff: EXPONENTIAL}
IF error.statusCode == 429:
RETURN {category: RATE_LIMIT, retryable: true, delayMs: error.retryAfter OR 60000}
IF error.statusCode IN [401, 403]:
RETURN {category: AUTHENTICATION, retryable: false, alertOperator: true}
IF error.statusCode >= 500:
RETURN {category: TRANSIENT, retryable: true, maxRetries: 3}
IF error.statusCode >= 400:
RETURN {category: VALIDATION, retryable: false, surfaceToAgent: true}
RETURN {category: UNKNOWN, retryable: false, alertOperator: true}
FUNCTION handleTransient(error, context, classification):
IF isCircuitOpen(context.toolId):
RETURN {action: FAIL_FAST, useFallback: true}
IF context.retryCount >= classification.maxRetries:
recordFailure(context.toolId)
RETURN {action: FAIL, useFallback: true}
delayMs = 2^(context.retryCount) * 1000 // Exponential backoff
RETURN {action: RETRY, delayMs: delayMs}
FUNCTION handleAuth(error, context):
IF NOT context.credentialsRefreshed:
refreshCredentials(context.toolId)
RETURN {action: RETRY, delayMs: 0}
alertOperator(severity=HIGH, toolId=context.toolId)
RETURN {action: FAIL, disableTool: true}
// Circuit Breaker Pattern
FUNCTION recordFailure(toolId):
breaker = circuitBreakers.get(toolId)
breaker.failures += 1
breaker.lastFailure = NOW()
IF breaker.failures >= THRESHOLD:
breaker.state = OPEN
LOG_ERROR("Circuit breaker opened", {toolId})
// Auto-reset after timeout
scheduleTask(after=CIRCUIT_TIMEOUT, action=() => {
breaker.state = HALF_OPEN
breaker.failures = 0
})
Policy Configuration Example
errorHandling:
retryPolicy:
maxAttempts: 3
backoffStrategy: exponential
baseDelayMs: 1000
circuitBreaker:
failureThreshold: 5
timeoutMs: 30000
halfOpenRequests: 3
authentication:
autoRefresh: true
maxRefreshAttempts: 1
alertOnFailure: true
validation:
surfaceToAgent: true
includeFieldErrors: true
rateLimiting:
honorRetryAfter: true
defaultBackoffMs: 60000
Performance Emerges From Architecture
In multi-tool environments, performance is not the result of fast tools alone. It emerges from how tools are composed, cached, and orchestrated.
Small inefficiencies multiply when:
- Tools are called redundantly
- Connections are not reused
- Results are not cached
- Orchestration is overly sequential
Good performance engineering focuses less on micro-optimizations and more on flow: minimizing unnecessary work and making latency predictable.
Performance Optimization Patterns
flowchart LR
A[Tool Request] --> B{In-flight<br/>request?}
B -->|Yes| C[Wait for<br/>existing]
B -->|No| D{Cached?}
D -->|Yes & Fresh| E[Return cached]
D -->|No/Expired| F[Get pooled<br/>connection]
F --> G[Execute]
G --> H[Cache result]
H --> I[Release connection]
I --> J[Return result]
C --> J
E --> J
Key Patterns
1. REQUEST DEDUPLICATION
Problem: Same request called multiple times simultaneously
Solution: Track in-flight requests, share result
cacheKey = hash(toolId + normalizedInput)
IF inflightRequests.contains(cacheKey):
RETURN AWAIT inflightRequests.get(cacheKey)
promise = executeActual(toolId, input)
inflightRequests.set(cacheKey, promise)
TRY:
result = AWAIT promise
RETURN result
FINALLY:
inflightRequests.remove(cacheKey)
2. CONNECTION POOLING
Problem: Creating connections is expensive
Solution: Reuse idle connections
pool = {connections: [], maxSize: 10, activeCount: 0}
IF pool.hasIdleConnection():
RETURN pool.pop()
ELSE IF pool.activeCount < pool.maxSize:
pool.activeCount++
RETURN createNewConnection()
ELSE:
WAIT_FOR availableConnection()
3. SMART CACHING (TTL by tool characteristics)
Problem: One-size-fits-all caching is inefficient
Solution: Adaptive TTL based on metadata
FUNCTION getCacheTTL(toolId):
metadata = registry.getMetadata(toolId)
IF metadata.latency == 'slow':
RETURN 5_minutes // Expensive, cache longer
ELSE IF 'realtime-data' IN metadata.capabilities:
RETURN 30_seconds // Fresh data needed
ELSE:
RETURN 1_minute // Default
4. PARALLEL EXECUTION WITH CONCURRENCY LIMITS
Problem: Unlimited parallelism overwhelms system
Solution: Sliding window concurrency control
FUNCTION executeParallel(tasks[], maxConcurrency=5):
results = []
executing = []
FOR EACH task IN tasks:
promise = execute(task)
results.APPEND(promise)
executing.APPEND(promise)
IF length(executing) >= maxConcurrency:
AWAIT any(executing) // Wait for one to complete
executing.remove(completed)
RETURN AWAIT all(results)
5. CACHE KEY NORMALIZATION
Problem: Same input, different key (order, formatting)
Solution: Normalize before hashing
FUNCTION getCacheKey(toolId, input):
// Sort keys for consistent ordering
normalized = stringify(input, sortKeys=true)
hash = hashFunction(normalized)
RETURN toolId + ":" + hash
Performance Metrics to Track
• Cache Hit Rate: hits / (hits + misses)
Target: >80% for cacheable operations
• Connection Pool Utilization: active / maxSize
Target: 60-80% (headroom for spikes)
• Deduplication Rate: deduplicated / totalRequests
Indicates redundant call patterns
• P50/P95/P99 Latency: Response time percentiles
Watch for bimodal distributions
• Throughput: Requests/second sustained
Should scale with concurrency
Tool Selection Is an Exercise in Restraint
One of the most mature signals in an MCP system is not how many tools it uses, but how many it chooses not to.
Tool selection is where strategy shows up:
- Community tools are excellent defaults for standard capabilities
- Custom tools make sense when differentiation matters
- Redundancy should exist for resilience, not indecision
Every tool added increases operational surface area. Thoughtful systems earn complexity deliberately.
Tool Evaluation Scorecard
Before adding a tool, evaluate:
Need (0-10 points)
- Does this solve a real user problem?
- Is there an existing tool that could work?
- What’s the cost of not having this?
Quality (0-10 points)
- What’s the documented SLA?
- How well-maintained is it?
- Are there test cases or examples?
Operational Cost (0-10 points, inverted)
- How complex is integration?
- What dependencies does it add?
- What’s the monitoring burden?
Strategic Fit (0-10 points)
- Does this align with platform direction?
- Will this still matter in 6 months?
- Does this enable future capabilities?
Threshold: Require 30+ points to add a tool.
Tool Deprecation Signals
Remove tools when:
- Usage drops below 1% of total tool calls for 30 days
- Better alternatives exist with higher satisfaction
- Maintenance cost exceeds value delivered
- Strategy shifts away from the capability
Policy Checklist
Your MCP system demonstrates good governance when:
- Tool metadata is comprehensive and up-to-date
- Discovery is automated with validation
- Error handling follows consistent policies
- Circuit breakers protect against cascading failures
- Performance patterns are applied uniformly
- Tool selection has clear criteria
- Deprecation has a defined process
- Operators can query tool health programmatically
Coming Next: Part 4 — Advanced Patterns & Production
In the final part, we’ll explore:
- Tool composition and orchestration patterns
- Security as structural design
- Testing for failure at scale
- Production readiness principles
Continue to Part 4: Advanced Patterns & Production
Reflection
Policies scale where improvisation fails. By centralizing decisions about discovery, errors, performance, and selection, you create systems that behave predictably under stress.
The best systems make governance invisible to users but obvious to operators.