Closed
Conversation
4cf82e0 to
9a4cdd1
Compare
Implement the text-only portion of Google DeepMind Gemma 4 architecture: - Hybrid attention: alternating sliding window and full attention layers - Dual RoPE: proportional RoPE for full attention, default for sliding - Per-Layer Embeddings (PLE): per-layer token-dependent gating - KV sharing: later layers reuse KV from earlier layers of same type - Q/K/V normalization: RMS normalization on query, key, and value - Per-layer scalar: learned scaling factor per transformer block - Optional MoE: mixture-of-experts FFN blocks (26B-A4B variant) Architectures: :base (Gemma4TextModel), :for_causal_language_modeling (Gemma4ForCausalLM). Multimodal Gemma4ForConditionalGeneration is not yet supported. Uses a custom decoder loop rather than Layers.Transformer.blocks/2 because the model requires features not available in the shared infrastructure: per-layer embeddings threaded through the block loop, cross-block KV sharing state, per-layer head dimension variation, and value normalization. Includes integration test verified against Python transformers reference values (atol < 5e-5).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
Bumblebee.Text.Gemma4implementing the text-only portion of the Gemma 4 architecture from Google DeepMind, supporting both the E4B (4.5B dense) and the 26B-A4B (Mixture-of-Experts) variants.Architecture features
attention_head_size/global_attention_head_size)Architectures
:base—Gemma4TextModel:for_causal_language_modeling—Gemma4ForCausalLMThe multimodal
Gemma4ForConditionalGenerationis not yet supported.Custom decoder loop
Uses a custom decoder loop rather than
Layers.Transformer.blocks/2because the model requires features not available in the shared infrastructure: per-layer embeddings (PLE) threaded through the block loop, cross-block KV sharing state, per-layer head dimension variation, per-layer scalar, and value normalization.Registry entries
Gemma4ForCausalLM→{Bumblebee.Text.Gemma4, :for_causal_language_modeling}Gemma4TextModel→{Bumblebee.Text.Gemma4, :base}gemma4/gemma4_text→:gemmatokenizer typeTesting
Integration test verified against Python transformers reference values (atol < 5e-5) using a tiny-random checkpoint.
Note for maintainers: The integration test references
{:hf, "bumblebee-testing/tiny-random-Gemma4ForCausalLM"}. The checkpoint can be generated using this script:Unit tests cover config loading (E4B + 26B MoE), forward pass (sliding/full attention, partial rotary, MoE, masking, softcapping).