Caching System¶
Understanding Substation's multi-level caching architecture.
Overview¶
Substation achieves 60-80% API call reduction through MemoryKit, a sophisticated multi-level caching system that dramatically improves performance while maintaining data freshness.
The Problem: OpenStack APIs are slow (2+ seconds per call). Your workflow requires hundreds of API calls. Without caching, you'd be waiting minutes for simple operations.
The Solution: Intelligent caching with resource-specific TTLs and multi-level hierarchy.
Cache Architecture¶
Multi-Level Hierarchy¶
Substation implements a three-tier cache system, similar to CPU cache architecture:
Request → L1 Cache → L2 Cache → L3 Cache → API
(< 1ms) (~5ms) (~20ms) (2+ sec)
80% hit 15% hit 3% hit 2% miss
L1 Cache (Memory - Hot Data)¶
- Speed: < 1ms retrieval
- Hit Rate: 80% of requests
- Storage: In-memory (RAM)
- Persistence: Cleared on restart
- Purpose: Frequently accessed data
Characteristics:
- Lightning fast access
- Limited size (memory constrained)
- Most recent and frequently used data
- First to be evicted under memory pressure
L2 Cache (Larger Memory - Warm Data)¶
- Speed: ~5ms retrieval
- Hit Rate: 15% of requests
- Storage: Larger in-memory pool
- Persistence: Cleared on restart
- Purpose: Less frequently accessed data
Characteristics:
- Still fast, slightly slower than L1
- Larger capacity than L1
- Recently used but not hot data
- Second priority for eviction
L3 Cache (Disk - Cold Data)¶
- Speed: ~20ms retrieval
- Hit Rate: 3% of requests
- Storage: On-disk cache
- Persistence: Survives restarts
- Purpose: Historical data and startup acceleration
Characteristics:
- Slowest cache tier (but still faster than API)
- Survives application restarts
- Persistent storage
- Enables fast startup with warm cache
Total Cache Performance¶
Combined Hit Rate: 98% (L1 + L2 + L3) Cache Miss Rate: 2% (requires API call)
Result: Only 2% of operations hit the slow OpenStack API.
Resource-Specific TTLs¶
Different resource types have different volatility, so we cache them differently:
TTL Strategy¶
Resource Type | TTL | Rationale |
---|---|---|
Authentication Tokens | 3600s (1 hour) | Keystone token lifetime |
Service Endpoints | 1800s (30 min) | Semi-static infrastructure |
Flavors, Images | 900s (15 min) | Rarely change in production |
Networks, Subnets, Routers | 300s (5 min) | Moderately dynamic |
Security Groups | 300s (5 min) | Change occasionally |
Servers, Volumes, Ports | 120s (2 min) | Highly dynamic (state changes frequently) |
Why These TTLs?¶
Long TTL (15+ minutes):
- Flavors: Admins add new sizes occasionally
- Images: OS images rarely change once uploaded
- Benefit: Fewer API calls, better performance
Medium TTL (5 minutes):
- Networks: Created occasionally, stable once created
- Security Groups: Rules change but not constantly
- Benefit: Balance between freshness and performance
Short TTL (2 minutes):
- Servers: State changes frequently (building, active, error)
- Volumes: Attach/detach operations common
- Ports: Network interfaces dynamic
- Benefit: Reasonably fresh data
Very Long TTL (1 hour):
- Auth Tokens: Keystone tokens last 1 hour anyway
- Service Endpoints: These never change (until they do)
- Benefit: Minimal auth overhead
Cache Operations¶
Cache Hit (The Fast Path)¶
When data is in cache:
- Request arrives
- L1 cache checked (< 1ms)
- If found and fresh (TTL not expired):
- Data returned immediately
- No API call needed
- 80% of requests take this path
Cache Miss (The Slow Path)¶
When data is not in cache or expired:
- Request arrives
- L1 cache miss
- L2 cache checked (~5ms)
- L2 cache miss
- L3 cache checked (~20ms)
- L3 cache miss
- OpenStack API called (2+ seconds)
- Response stored in all cache levels
- Data returned to user
- Future requests hit cache
Only 2% of requests take this full path.
Cache Invalidation¶
Manual Invalidation:
- Press
c
in Substation to purge ALL caches - Clears L1, L2, and L3
- Next operations slower while cache rebuilds
Automatic Invalidation:
- TTL expiration (resource-specific timeouts)
- Memory pressure (automatic eviction at 85% usage)
- Explicit updates (after create/delete operations)
When to Purge Manually:
- Data looks stale or wrong
- Just made major changes outside Substation
- Debugging data issues
- After OpenStack cluster issues
Memory Management¶
Memory Pressure Handling¶
Substation monitors memory usage and automatically manages cache:
Thresholds:
- Normal Operation: < 85% memory usage
- Eviction Starts: 85% memory usage
- Target After Eviction: 75% memory usage
Eviction Order:
- L1 cache entries (oldest first)
- L2 cache entries (oldest first)
- L3 cache preserved (on-disk)
Why This Approach:
- Prevents out-of-memory (OOM) crashes
- Maintains system stability
- Preserves disk cache for restart
- Automatic and transparent
Expected Memory Usage¶
Base Application: ~200MB Cache for 10,000 resources: ~100MB Total Typical: 200-400MB
For Large Deployments:
- 50,000 resources: ~500MB
- 100,000 resources: ~800MB
This is normal and expected.
Cache Statistics¶
Monitoring Cache Performance¶
Press h
in Substation for the Health Dashboard:
Key Metrics:
- Cache Hit Rate: Target 80%+, typical 85-90%
- Memory Usage: Target < 85%, eviction starts at 85%
- Average Response Time: < 100ms cached, 2+ seconds uncached
- Eviction Count: Should be low in normal operation
Performance Indicators¶
Good Performance:
- Cache hit rate: 80%+
- Memory usage: 50-75%
- Low eviction count
- Response times < 100ms
Degraded Performance:
- Cache hit rate: < 60%
- Memory usage: > 85%
- High eviction count
- Frequent cache misses
Action Required:
- Hit rate < 60%: Check TTL configuration, may need adjustment
- Memory > 85%: Close other apps, increase system RAM
- High evictions: Reduce cache sizes or increase memory
Cache Tuning¶
Adjusting TTLs for Your Environment¶
Stable Environments (Dev/Staging):
- Increase TTLs (less API load)
- Servers: 300s (5 min) instead of 120s
- Networks: 600s (10 min) instead of 300s
- Flavors/Images: Keep at 900s (15 min)
Chaotic Environments (Production with Auto-Scaling):
- Decrease TTLs (fresher data)
- Servers: 60s (1 min) instead of 120s
- Networks: 180s (3 min) instead of 300s
- Accept lower cache hit rates (60%+ is good)
Current Implementation:
TTLs are hardcoded in CacheManager.swift:100
. Future versions may expose this in configuration.
Memory Tuning¶
Increase Cache Size (if you have RAM):
- More memory = more cached items
- Better hit rates
- Fewer API calls
Decrease Cache Size (if memory constrained):
- Less memory usage
- More evictions
- Lower hit rates
- More API calls
Current Implementation:
Memory limits are auto-calculated based on available system RAM. Future versions may expose manual configuration.
Implementation Details¶
MemoryKit Components¶
Located in /Sources/MemoryKit/
:
Component | Purpose |
---|---|
MultiLevelCacheManager.swift |
L1/L2/L3 orchestration |
CacheManager.swift |
Core caching logic |
MemoryManager.swift |
Memory pressure handling |
TypedCacheManager.swift |
Type-safe cache ops |
PerformanceMonitor.swift |
Metrics tracking |
MemoryKit.swift |
Public API |
MemoryKitLogger.swift |
Logging |
ComprehensiveMetrics.swift |
Metrics aggregation |
Cache Key Strategy¶
Cache keys are constructed from:
- Resource type (server, network, volume, etc.)
- Resource ID (UUID)
- Query parameters (for list operations)
Example cache keys:
Thread Safety¶
All cache operations are actor-based:
- No locks or mutexes required
- Guaranteed thread safety
- Swift 6 strict concurrency enforced
- Zero data race conditions
Best Practices¶
For Operators¶
- Let the cache work - Don't constantly press
c
- Monitor hit rates - Press
h
to check cache performance - Purge strategically - Only when data is truly stale
- Accept short delays on first load - Cache warming is normal
For Developers¶
- Respect TTLs - Don't bypass cache unless necessary
- Monitor memory - Watch for memory leaks
- Test under load - Validate cache behavior with 10K+ resources
- Profile eviction - Ensure eviction works under pressure
Troubleshooting¶
Low Cache Hit Rate¶
Symptoms: Hit rate < 60% in Health Dashboard
Causes:
- Constantly pressing
c
(cache purge) - TTLs too short for environment
- High memory pressure (frequent evictions)
- Resources changing very rapidly
Solutions:
- Stop purging cache manually
- Let cache warm up (first loads are slow)
- Check memory usage (< 85% is good)
- For stable environments, consider longer TTLs (future)
High Memory Usage¶
Symptoms: Memory > 85%, frequent evictions
Causes:
- Too many resources (50K+ servers)
- Other applications using RAM
- Memory leak (unlikely, but report if suspected)
Solutions:
- Close other applications
- Filter views with
/
(reduces active dataset) - Use project-scoped credentials (fewer resources)
- Increase system RAM
Stale Data¶
Symptoms: Resources not appearing, wrong states
Causes:
- TTL hasn't expired yet
- Cache holds old data
- OpenStack cluster had issues
Solutions:
- Press
c
to purge cache - Press
r
to refresh view - Fresh data loads from API
Remember: Caching is the secret to Substation's performance. 60-80% fewer API calls means your OpenStack cluster thanks you, and your operations are lightning fast.
Cache wisely, operate swiftly.