Skip to content

[BUG]: SSL context cache thrash, missing test assertion and OAuth error handling inconsistencies #4239

@bogdanmariusc10

Description

@bogdanmariusc10

🐞 Bug Summary

SSL context cache performs full wipe on overflow causing cache thrash, and OAuth error handling is inconsistent across token exchange flows.


🧩 Affected Component

Select the area of the project impacted:

  • mcpgateway - API
  • mcpgateway - UI (admin panel)
  • mcpgateway.wrapper - stdio wrapper
  • Federation or Transports
  • CLI, Makefiles, or shell scripts
  • Container setup (Docker/Podman/Compose)
  • Other (explain below)

🔁 Steps to Reproduce

Issue 1 - SSL Context Cache Thrash:

  1. Configure >100 gateways with distinct custom CA certificates
  2. Generate steady traffic rotating through these gateways
  3. Observe cache behavior when 101st unique CA cert is encountered
  4. Monitor cache hit rates and SSL context creation overhead

Issue 2 - OAuth Error Handling Inconsistency:

  1. Compare error handling in refresh_token vs _client_credentials_flow
  2. Note manual status_code == 200 check vs raise_for_status() call
  3. Observe different error handling patterns across OAuth flows

🤔 Expected Behavior

Issue 1: SSL context cache should use LRU (Least Recently Used) eviction when reaching capacity, not clear all entries

Issue 2: All OAuth flows should use consistent error handling pattern (preferably raise_for_status())


📓 Logs / Error Output

Issue 1 - SSL Context Cache (mcpgateway/utils/ssl_context_cache.py:200-215):

def get_cached_ssl_context(...):
    # ... cache key generation ...
    
    if cache_key in _ssl_context_cache:
        entry = _ssl_context_cache[cache_key]
        if not _is_cache_entry_expired(entry):
            return entry["context"]
        del _ssl_context_cache[cache_key]
    
    # ⚠️ Cache overflow: full wipe instead of LRU eviction
    if len(_ssl_context_cache) >= _SSL_CONTEXT_CACHE_SIZE:
        _ssl_context_cache.clear()  # Clears ALL entries
    
    # ... create new SSL context ...

Impact: Every insert beyond the 100-entry limit clears all prior entries, causing:

  • Cache thrash under steady traffic with >100 distinct CA certs
  • Repeated SSL context creation overhead
  • Degraded performance for high-traffic deployments

Issue 2 - Inconsistent Error Handling (mcpgateway/services/oauth_manager.py:1383):

# refresh_token - manual status check
if response.status_code == 200:
    # ...
else:
    raise OAuthError(...)

# _client_credentials_flow and _password_flow - raise_for_status()
response.raise_for_status()

🧠 Environment Info

Key Value
Version or commit main (pre-existing issues)
Runtime Python 3.11+
Platform / OS All platforms
Container N/A

🧩 Additional Context

Discovered during: PR #4048 review (OAuth CA certificate propagation)

Severity:

  • Issue 1 (SSL Cache): High - Affects production deployments with many gateways using custom CA certificates
  • Issue 2 (OAuth Consistency): Medium - Pre-existing asymmetry that should be unified for maintainability

Suggested Fixes:

Issue 1 - SSL Cache LRU Eviction:

from collections import OrderedDict

_ssl_context_cache: OrderedDict[str, Dict[str, Any]] = OrderedDict()

def get_cached_ssl_context(...):
    # ... cache key generation ...
    
    if cache_key in _ssl_context_cache:
        entry = _ssl_context_cache[cache_key]
        if not _is_cache_entry_expired(entry):
            # Move to end (mark as recently used)
            _ssl_context_cache.move_to_end(cache_key)
            return entry["context"]
        del _ssl_context_cache[cache_key]
    
    # LRU eviction: remove oldest entry
    if len(_ssl_context_cache) >= _SSL_CONTEXT_CACHE_SIZE:
        _ssl_context_cache.popitem(last=False)  # Remove oldest
    
    # ... create new SSL context ...
    _ssl_context_cache[cache_key] = {
        "context": ssl_context,
        "created_at": datetime.now()
    }

Issue 2 - Unify OAuth Error Handling:

# Standardize all OAuth flows to use raise_for_status()
async def refresh_token(...):
    # ...
    response.raise_for_status()  # Instead of manual status_code check
    # ...

Metadata

Metadata

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseapiREST API Related itembugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions