Skip to content

Performance Troubleshooting

Common performance problems and their solutions.

Common Performance Problems

1. High Memory Usage

Symptoms:

  • Substation using > 500MB memory
  • System feels sluggish
  • OOM killer threatening your app

Causes:

  • Cache sizes too large for your environment
  • Too many resources being cached
  • Memory leaks (shouldn't happen, but check)

Diagnosis:

# Check current memory usage
ps aux | grep substation

# Monitor in Substation
# Press 'h' for health dashboard
# Check memory utilization metrics

Solutions:

  1. Immediate relief - Purge caches:
  2. Press c in Substation (purges all caches)
  3. Next operations slower while cache rebuilds
  4. Memory usage should drop immediately

  5. Short-term fix - Reduce cache TTLs:

// In CacheManager.swift
case .server: return 60.0    // Reduce from 120s to 60s
case .network: return 180.0  // Reduce from 300s to 180s
  1. Long-term fix - Increase cache eviction threshold:
// In MemoryManager.swift
let evictionThreshold = 0.75  // Evict at 75% instead of 85%

Memory Reality Check:

  • Target: < 200MB steady state
  • With 10K resources: < 300MB
  • If using 500MB+: Something's wrong

2. Slow API Response Times

Symptoms:

  • Operations take > 5 seconds
  • UI feels frozen
  • Watching paint dry would be more exciting

Causes (in order of likelihood):

  1. Your OpenStack API is slow (90% of cases)
  2. Network latency between you and OpenStack (8% of cases)
  3. Substation bug (2% of cases, report it)

Diagnosis:

# Enable wiretap mode to see ALL API calls
substation --cloud mycloud --wiretap

# Check the log file
tail -f ~/substation.log

# Look for:
# - API call duration (should be < 2s)
# - Retry attempts (exponential backoff in action)
# - 500 errors (OpenStack having a bad day)
# - Timeouts (OpenStack having a REALLY bad day)

Solutions:

If it's the OpenStack API (usually):

  1. Check OpenStack service health:
openstack endpoint list
openstack server list --all-projects --limit 1  # Test response time
  1. Check database connections on OpenStack controller:
# On controller node
mysql -e "SHOW PROCESSLIST;" | wc -l  # Connection count
  1. Check load on OpenStack API nodes:
# On API nodes
top  # Check CPU usage
iostat  # Check disk I/O
  1. Consider scaling your OpenStack control plane
  2. Add more API workers
  3. Add more database read replicas
  4. Optimize database queries

  5. Accept that OpenStack is slow (sad but true)

If it's network latency:

  1. Measure latency:
ping your-openstack-api.com
  1. Check network path:
traceroute your-openstack-api.com
  1. Consider running Substation closer to OpenStack
  2. Same datacenter
  3. Same network segment
  4. Direct connect instead of VPN

  5. Use VPN or direct connect if over internet

If it's Substation (unlikely but possible):

  1. Update to latest version
  2. Check GitHub issues: https://github.com/cloudnull/substation/issues
  3. Report issue with wiretap logs
  4. We'll investigate (we care about performance)

The Hard Truth About OpenStack Performance

OpenStack APIs are slow. This is a known issue. Years of discussion. Multiple summits. Countless patches. Still slow.

Why?

  • Database queries are expensive (especially with 50K servers)
  • Keystone auth adds overhead to every request
  • Neutron network queries involve complex joins
  • Nova compute queries hit multiple tables

What Substation Does:

  • Caches aggressively (60-80% API reduction)
  • Parallelizes where possible (search, batch ops)
  • Uses HTTP/2 connection pooling
  • Implements exponential backoff retry

But: If the API takes 5 seconds to respond, we can't make it 1 second. The bottleneck is OpenStack, not Substation.

3. Low Cache Hit Rates

Symptoms:

  • Cache hit rate < 60% (check health dashboard with h)
  • Performance feels slow despite caching
  • API calls happening too frequently

Causes:

  • TTLs too short (cache expires too quickly)
  • Resources changing very frequently
  • Cache eviction happening too often (memory pressure)
  • Using wrong cache keys (bug, report it)

Diagnosis:

# Check cache statistics in health dashboard
# Press 'h' in Substation

# Look for:
# - Hit rate < 60%: TTLs too short or high churn
# - High eviction count: Memory pressure
# - Miss rate > 40%: Cache not working

Solutions:

1. Increase TTLs (if environment is stable):

// In CacheManager.swift:100
case .server: return 300.0  // Increase from 2min to 5min
case .network: return 600.0  // Increase from 5min to 10min

Benefits: Fewer API calls, better hit rate Risks: Staler data, slower to see new resources

2. Reduce memory pressure (if evictions are high):

  • Press c to purge stale data
  • Increase cache eviction threshold (evict at 90% instead of 85%)
  • Add more RAM to your system

3. Accept high churn (if resources change constantly):

  • Some environments are just chaotic
  • Production with auto-scaling = high churn
  • Lower hit rates are expected
  • 60% is acceptable in high-churn environments

Cache Hit Rate Reality:

  • Target: 80%+ in stable environments
  • Reality: 70%+ in production
  • Acceptable: 60%+ in chaotic environments
  • Concerning: < 60% (investigate)

4. Poor Search Performance

Symptoms:

  • Searches take > 2 seconds consistently
  • Search results appear slowly
  • Some services timeout (partial results)

Causes:

  • OpenStack APIs are slow (again, usually this)
  • Too many resources to search through
  • Network latency
  • Search cache not effective

Diagnosis:

# Enable wiretap to see search API calls
substation --cloud mycloud --wiretap

# Check logs for:
# - Which services are slow
# - Timeout messages
# - API response times

tail -f ~/substation.log | grep -i search

Solutions:

1. Check OpenStack service health:

# Nova slow?
openstack server list --limit 1  # Test response time

# Neutron slow?
openstack network list --limit 1

# All services slow?
openstack token issue  # Test Keystone auth

2. Review search patterns:

  • Use more specific queries ("prod-web-01" not just "prod")
  • Filter by specific services if you know which one
  • Accept partial results on timeout

3. Check search cache:

  • In health dashboard (h key), check:
  • Search cache hit rate (target: 70%)
  • Search cache size
  • Recent searches
  • Searches are cached (repeat searches should be instant)

The 5-Second Search Timeout

We timeout searches at 5 seconds. This is intentional.

Why?

  • If OpenStack can't respond in 5 seconds, something's wrong
  • Better to show partial results than wait forever
  • Operator waiting > 5 seconds = operator rage

When it happens:

  • Service is down (partial results, missing that service)
  • Service is overloaded (partial results, missing that service)
  • Network is broken (partial results, timeouts)

What to do:

  • Check the service that timed out
  • Review OpenStack logs for that service
  • Consider this a canary (something's wrong with OpenStack)

5. UI Rendering Issues

Symptoms:

  • UI feels sluggish or janky
  • Frame rate < 30 FPS
  • Screen updates delayed

Causes:

  • Terminal performance issues
  • SSH connection latency
  • Too many screen updates
  • Rendering overhead

Diagnosis:

# Check rendering metrics
# Press 'h' for health dashboard
# Look at "Rendering FPS" metric

# Target: 60 FPS
# Acceptable: 30+ FPS
# Poor: < 30 FPS

Solutions:

1. Reduce auto-refresh frequency:

  • Press a to toggle auto-refresh
  • Increase interval from 5s to 10s or 30s
  • Manual refresh with r when needed

2. Check terminal performance:

  • Try different terminal emulator
  • Check SSH connection latency
  • Consider tmux/screen for session persistence

3. Reduce data volume:

  • Filter lists to show fewer items
  • Use pagination
  • Limit detail view depth

4. Check system resources:

# CPU usage
top

# Memory usage
free -h

# Disk I/O
iostat

6. Authentication Timeouts

Symptoms:

  • "Authentication failed" errors
  • Timeouts during login
  • Token expiration errors

Causes:

  • Keystone service slow or down
  • Network latency
  • Token TTL too short
  • Invalid credentials

Diagnosis:

# Test Keystone directly
openstack token issue

# Check Keystone service health
curl -I https://keystone.example.com:5000/v3

# Check auth logs
tail -f ~/substation.log | grep -i auth

Solutions:

1. Check Keystone health:

# On Keystone controller
systemctl status apache2  # or httpd
journalctl -u apache2 -f  # Check logs

2. Increase timeouts:

OpenStackConfig(
    authUrl: "https://keystone.example.com:5000/v3",
    timeout: 60,  // Increase from 30s
    retryCount: 5  // Increase from 3
)

3. Check token cache:

  • Authentication tokens cached for 1 hour
  • Press c to clear cache if tokens seem stale
  • Verify token TTL in Keystone configuration

4. Verify credentials:

# Test with CLI
openstack --os-cloud mycloud server list

Performance Monitoring Checklist

Daily Checks

  • Check health dashboard (h key)
  • Verify cache hit rate > 60%
  • Verify memory usage < 300MB
  • Check for API timeouts
  • Review search performance

Weekly Reviews

  • Run full benchmark suite
  • Compare with baseline metrics
  • Review performance trends
  • Check for regressions
  • Update documentation

Monthly Maintenance

  • Review TTL configurations
  • Optimize cache sizes
  • Clean up old benchmark data
  • Update performance baselines
  • Plan optimization work

When to Seek Help

Report issues when:

  • Performance degrades 10%+ suddenly
  • Solutions in this guide don't help
  • You suspect a Substation bug
  • Benchmarks show scores < 0.6

Where to report:

Information to provide:

  • Substation version
  • OpenStack version
  • Resource counts (servers, networks, etc.)
  • Performance metrics
  • Logs with --wiretap enabled

Quick Diagnosis Flowchart

Performance Issue
    |
    ├─ Memory High?
    │   ├─ Yes → Press 'c' to purge cache → Reduce TTLs → Increase eviction threshold
    │   └─ No → Continue
    |
    ├─ API Slow?
    │   ├─ Yes → Check OpenStack health → Check network → Enable wiretap
    │   └─ No → Continue
    |
    ├─ Cache Hit Rate Low?
    │   ├─ Yes → Increase TTLs → Reduce eviction → Check for high churn
    │   └─ No → Continue
    |
    ├─ Search Slow?
    │   ├─ Yes → Check service health → Use specific queries → Check cache
    │   └─ No → Continue
    |
    └─ UI Sluggish?
        ├─ Yes → Reduce auto-refresh → Check terminal → Reduce data volume
        └─ No → Report issue (might be a bug)

See Also:

Remember: Most performance issues are OpenStack-side, not Substation-side. Always check OpenStack service health first.