Add unified Cache-layer management for GLM-5 DSA Indexer keys#45595
Add unified Cache-layer management for GLM-5 DSA Indexer keys#45595louzongzhi wants to merge 1 commit intohuggingface:mainfrom
Conversation
…ged buffer into past_key_values per-layer cache
|
[For maintainers] Suggested jobs to run (before merge) run-slow: glm_moe_dsa |
|
cc @ArthurZucker @vasqu who reviewed the last issue! #45424 |
Hi @Rocketknight1 @ArthurZucker @vasqu, thanks for the attention! I'd like to clarify the background of IndexCache and the context of this PR. IndexCache is a mechanism explicitly described in the GLM-5 papers and official code. It allows Shared (S) layers to reuse top-k indices from Full (F) layers to accelerate inference. You can find the details in arXiv:2602.15763 and arXiv:2603.12201, and the official implementation is at THUDM/IndexCache. The GLM-5 implementation in transformers did not include IndexCache until #45424. Without it, Shared layers were unable to reuse indices, which deviated from the official behavior described in the papers. In #45424, I added IndexCache support following the official implementation, so the functional behavior now aligns with the official repo. A side note on This PR (#45595) does not change the functional logic of IndexCache. That logic remains the same as in the THUDM official implementation (i.e., the behavior before my #45424 submission). What this PR does is migrate the cache management from the internal If anyone has questions about the IndexCache behavior or the migration approach, I'm happy to explain further. Thanks! |
What does this PR do?
This PR migrates the GLM-5 DSA IndexCache key cache from a self-managed
register_buffer(as introduced in #45424) into the standardpast_key_values(Cache) per-layer infrastructure. Indexer keys now share the same lifecycle as attention KV caches, enabling transparent support for beam-search reordering, cache cropping, batch selection/repetition, and offloading.Background & Motivation
GLM-5 integrates the DeepSeek Sparse Attention (DSA) Indexer. In #45424, we added support for the IndexCache mechanism (THUDM/IndexCache, arXiv:2603.12201) to accelerate inference by caching indexer keys and allowing Shared (S) layers to reuse indices from Full (F) layers.
However, the implementation in #45424 managed these cached keys inside
GlmMoeDsaIndexerviaself.register_buffer("_cached_keys", None, persistent=False). While this works for basic generation, it creates architectural friction when building downstream models on top of GLM-5:Cache.reset,reorder_cache,crop,batch_repeat_interleave, andbatch_select_indicesdo not propagate to the Indexer's isolated buffer.DynamicCachealready does.To simplify our own downstream model code and benefit the broader GLM-5 ecosystem, we are upstreaming this unification first.
Changes
src/transformers/cache_utils.pyCacheLayerMixin: Addedindexer_keysattribute and abstract methodupdate_cached_keys().DynamicLayer: Implementedupdate_cached_keys()by concatenating along the sequence dimension (dim=1); synchronizedcrop,batch_repeat_interleave,batch_select_indices,reset, andreorder_cacheto also operate onindexer_keys.StaticLayer: Added passthroughupdate_cached_keys()returning the input as-is.Cache: Addedupdate_cached_keys(cached_keys, layer_idx)andreset_cached_keys(layer_idx)to dispatch per layer, mirroring the existingupdate()/reset()API.src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.pyGlmMoeDsaIndexer:self.register_buffer("_cached_keys", None, persistent=False).forward()now acceptspast_key_values: Cache | Noneinstead ofuse_cache: bool.seq_len > 1) triggerspast_key_values.reset_cached_keys(layer_idx)before computing scores.seq_len == 1) appends keys viapast_key_values.update_cached_keys(k, layer_idx).GlmMoeDsaAttention: Updatedself.indexer(...)call to passpast_key_values=past_key_valuesdirectly.Behavior equivalence
if seq_len > 1: self._cached_keys = Noneif seq_len > 1: past_key_values.reset_cached_keys(layer_idx)if use_cache: cat([self._cached_keys, k])if past_key_values is not None: past_key_values.update_cached_keys(k, layer_idx)else: k_cached = kelse: k_cached = kLocal verification
Verified locally with a tiny
hidden_size=256config:reorder_cachebatch_select_indices/batch_repeat_interleavecroptruncationskip_topk) reuse pathpast_key_values=None)Cache.reset()clearsindexer_keysindexer_keysaccumulate correctly across turnsBackward compatibility
generate()behavior is unchanged._cached_keysbuffer was never part of saved state (persistent=False), so existing checkpoints remain fully compatible.Follow-up context
We are actively building a new LLM architecture on top of the GLM-5 backbone, leveraging the DSA Indexer and IndexCache design. Unifying the Indexer cache into the standard
Cachelayer is prerequisite infrastructure work for open-sourcing that model. We will share more details once training converges. Stay tuned!<!-- Remove if not applicable -->
Fixes # (N/A)
Code Agent Policy
The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.
PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.
This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read
CONTRIBUTING.md.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.
Please tag fewer than 3 people.
Models:
Library: