[Bug report] Gravitino-server Metaspace OOM after long-running

### Version

main branch (v1.1.0 actually)

### Describe what's wrong

Gravitino-server can hit java.lang.OutOfMemoryError: Metaspace after running for a long time. Once it happens, requests may start failing with 401/500 (often empty/incomplete body), which is misleading because the underlying issue is the JVM OOM.



### Error message and/or stacktrace

- lance rest
```log
{
    "error": "Unable to process: Received HTTP 500 response with empty body",
    "code": 500,
    "type": "RESTException",
    "detail": "org.apache.gravitino.exceptions.RESTException: Unable to process: Received HTTP 500 response with empty body\n\tat org.apache.gravitino.client.ErrorHandlers$RestErrorHandler.accept(ErrorHandlers.java:1333)\n\tat org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:549)\n\tat org.apache.gravitino.client.ErrorHandlers$CatalogErrorHandler.accept(ErrorHandlers.java:488)\n\tat 
....",
    "instance": "demo_catalog2$demo_schema"
}
```


- gravitino

```log
2026-04-22 15:22:17.699
WARN［Gravitino-webserver-41］ ［org.apache.gravitino.utils.Principalutils.doAs（Principalutils.java:50）］- doAs method occurs
an unexpected error
java.lang.OutofMemoryError: Metaspace
```

### How to reproduce

When capturing `jcmd` outputs from an OOM’ed process, we saw many long-lived `org.apache.gravitino.hive.client.HiveClientClassLoader` instances. Each classloader retains hundreds of classes, consistent with classloader churn + class unloading being blocked.

<img width="2798" height="1798" alt="Image" src="https://github.com/user-attachments/assets/845d2a52-e456-4118-adc7-6f736c3497a4" />


### Additional context


We suspect the issue is **Hive client pool cache miss / churn**, amplified by **frequent token refresh**, which repeatedly creates isolated Hive-client classloaders:

- We use a custom cloud IAM-based Hive authenticator which fetches/refreshes a short-lived token and injects it into the Hive client configuration.
- If the token (or derived config) participates in the Hive client pool cache key, each refresh results in a new key, causing cache misses and continuous creation of new `HiveClientFactory` / `HiveClientClassLoader`.
- Even if old pools are evicted, class unloading may still be blocked by global/static caches, `ThreadLocal`s, shutdown hooks, etc., leading to Metaspace growth and eventually OOM.

This matches the “classloader cannot be reclaimed/unloaded” behavior that is known to happen in Hive/Hadoop ecosystems when classloader-bound resources are not fully cleaned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug report] Gravitino-server Metaspace OOM after long-running #10844

Version

Describe what's wrong

Error message and/or stacktrace

How to reproduce

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug report] Gravitino-server Metaspace OOM after long-running #10844

Description

Version

Describe what's wrong

Error message and/or stacktrace

How to reproduce

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions