Skip to content

[Bug report] Gravitino-server Metaspace OOM after long-running #10844

@freesinger

Description

@freesinger

Version

main branch (v1.1.0 actually)

Describe what's wrong

Gravitino-server can hit java.lang.OutOfMemoryError: Metaspace after running for a long time. Once it happens, requests may start failing with 401/500 (often empty/incomplete body), which is misleading because the underlying issue is the JVM OOM.

Error message and/or stacktrace

  • lance rest
{
    "error": "Unable to process: Received HTTP 500 response with empty body",
    "code": 500,
    "type": "RESTException",
    "detail": "org.apache.gravitino.exceptions.RESTException: Unable to process: Received HTTP 500 response with empty body\n\tat org.apache.gravitino.client.ErrorHandlers$RestErrorHandler.accept(ErrorHandlers.java:1333)\n\tat org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:549)\n\tat org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:488)\n\tat 
....",
    "instance": "demo_catalog2$demo_schema"
}
  • gravitino
2026-04-22 15:22:17.699
WARN[Gravitino-webserver-41] [org.apache.gravitino.utils.Principalutils.doAs(Principalutils.java:50)]- doAs method occurs
an unexpected error
java.lang.OutofMemoryError: Metaspace

How to reproduce

When capturing jcmd outputs from an OOM’ed process, we saw many long-lived org.apache.gravitino.hive.client.HiveClientClassLoader instances. Each classloader retains hundreds of classes, consistent with classloader churn + class unloading being blocked.

Image

Additional context

We suspect the issue is Hive client pool cache miss / churn, amplified by frequent token refresh, which repeatedly creates isolated Hive-client classloaders:

  • We use a custom cloud IAM-based Hive authenticator which fetches/refreshes a short-lived token and injects it into the Hive client configuration.
  • If the token (or derived config) participates in the Hive client pool cache key, each refresh results in a new key, causing cache misses and continuous creation of new HiveClientFactory / HiveClientClassLoader.
  • Even if old pools are evicted, class unloading may still be blocked by global/static caches, ThreadLocals, shutdown hooks, etc., leading to Metaspace growth and eventually OOM.

This matches the “classloader cannot be reclaimed/unloaded” behavior that is known to happen in Hive/Hadoop ecosystems when classloader-bound resources are not fully cleaned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions