Abstract
We present Gemma-Prune, a multi-stage compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text and image understanding capabilities. Our seven-step pipeline combines vocabulary pruning, vision encoder quantization with dimension padding, text layer pruning, resolution reduction, MLP neuron pruning, and weight splitting.
On Apple Silicon, the compressed model achieves 110 tokens/s generation (vs 90 original), 184 tokens/s image prompt processing (vs 54 original), and 2.2 GB peak memory for text-only inference (vs 2.9 GB original). We identify critical failure modes including vision layer sensitivity asymmetry and BPE coverage thresholds for vocabulary pruning.
Model Architecture
Gemma 3 4B IT is a vision-language model combining a text decoder with a SigLIP vision encoder and a multimodal projector:
Text Decoder
34-layer Transformer, 262K vocab, hidden=2560, 8 heads (GQA 4 KV), 4-bit QAT
Vision Encoder
SigLIP 27-layer, hidden=1152, 16 heads, 896×896 input, 14×14 patches
Multimodal Projector
Linear projection: vision hidden (1152) → text hidden (2560)
Seven-Stage Compression Pipeline
Each stage targets a distinct source of redundancy. The techniques compose without interference and accumulate to a total savings of 709 MB.
Vocabulary Pruning
262K → 144K tokens via ASCII vocabulary scan + token map indirection.
Vision fc2 Quantization
Dimension padding (4304 → 4352) solves prime factor alignment for 4-bit quantization.
Text Layer Pruning
Remove layers 31–33 from 34-layer decoder. Representation converges by layer 30.
Resolution Reduction
896×896 → 672×672 pixels. Quadratic attention cost drops ~68%.
MLP Neuron Pruning
60–100% dead neurons in layers 14–30. Activation-guided pruning with group alignment.
Weight Splitting
1.9 GB language + 231 MB vision. Text-only: load language only (~2.2 GB peak).
Vocabulary Pruning & BPE Coverage
The first step reduces the embedding matrix from 262K to 144K tokens using a token map indirection scheme:
A critical finding: dictionary-only pruning (80K tokens) causes generation collapse. BPE tokenizers produce merged tokens like “_the”, “_and”, “_is” that don't appear in any dictionary. Missing these tokens fragments generation output completely.
The solution is an ASCII vocabulary scan — collecting all tokens whose surface forms use only ASCII characters. This brings coverage from 80K to 144K tokens, completely resolving quality issues. This establishes a practical lower bound of ~144K tokens for English deployment.
The Prime Factor Problem
SigLIP's intermediate dimension is 4304 = 16 × 269, where 269 is prime. This means the fc2 weight matrix cannot be directly quantized with MLX group sizes (32, 64, 128).
The solution: zero-pad to 4352 = 68 × 64, aligned to group size 64. This is mathematically lossless — appended zero neurons contribute nothing to dot products, preserving layer output exactly.
Vision Encoder Sensitivity
A surprising asymmetry: removing 3 text decoder layers (159 MB) causes only minor quality loss, while removing just 4 vision encoder layers (35 MB) causes complete image understanding failure — pizza misidentified as “skin texture”, hallucinated descriptions throughout.
Why Vision Layers Are Irreplaceable
- Contrastive pre-training: SigLIP creates tightly interdependent layers, unlike the autoregressive text decoder's more modular representations.
- Information bottleneck: Vision must compress 1.35M pixel values into just 144 tokens — every layer contributes critically.
- Layer count: 27 vision layers are already compact; the 34-layer text decoder has more built-in redundancy.
Resolution as Thermal Management
Reducing resolution from 896×896 to 672×672 pixels exploits quadratic attention scaling:
| Resolution | Patches | Attention Cost | Status |
|---|---|---|---|
| 896×896 (original) | 64×64 = 4096 | 1.0x | Baseline |
| 672×672 (selected) | 48×48 = 2304 | 0.32x | Good quality |
| 448×448 (rejected) | 32×32 = 1024 | 0.06x | Repetition loops |
A 25% linear reduction yields ~68% attention cost savings — more effective for mobile thermal management than layer reduction, which only scales linearly.
MLP Neuron Pruning
Activation profiling across 20 forward passes reveals that layers 14–30 have 60–100% dead neurons (mean activation < 0.5). Layer 29 is nearly 100% dead. Layers 0–13 maintain healthy activation patterns and are protected from pruning.
For each layer, we retain neurons with the highest activation magnitudes, subject to a maximum 25% reduction ratio and quantization group alignment (group_size=64). This removes 188 MB of dead weight.
Results
Text Generation Performance
| Model | Disk Size | Prompt (t/s) | Gen (t/s) | Peak Mem |
|---|---|---|---|---|
| Original | 2.8 GB | 109 | 90 | 2,910 MB |
| Lite (Steps 1–4) | 2.3 GB | ~120 | ~110 | ~2,500 MB |
| Mobile (All Steps) | 2.1 GB | 120 | 110 | 2,231 MB |
Image Understanding Performance
| Model | Prompt (t/s) | Gen (t/s) | Peak Mem | Quality |
|---|---|---|---|---|
| Original (896px) | 54 | 27 | ~5,500 MB | Excellent |
| Step 3 (896px) | 73 | 61 | 4,850 MB | Good |
| Mobile (672px) | 184 | 104 | 4,358 MB | Good |
Failed Experiments
Insufficient visual information causes token repetition loops
Complete image understanding failure — hallucinated descriptions
Missing BPE-merged tokens cause generation quality collapse
Incremental Compression Analysis
| Step | Operation | Disk Size | Savings |
|---|---|---|---|
| Baseline | Original QAT 4-bit | 2.8 GB | — |
| Step 1 | Vocabulary pruning | 2.54 GB | 170 MB |
| Step 2 | Vision fc2 quantization | 2.35 GB | 191 MB |
| Step 3 | Text layer pruning | 2.19 GB | 159 MB |
| Step 4 | Resolution reduction | 2.19 GB | ~1 MB |
| Step 5 | Neuron pruning | 2.00 GB | 188 MB |
| Step 6 | Weight splitting | 2.1 GB | 709 MB total |
Swift Inference Engine
The full inference pipeline is implemented in Swift using Apple's MLX framework with the MLXVLM library, requiring no Python dependencies.
Token Map Support
class MLXArrayBox {
let value: MLXArray
init(_ value: MLXArray) { self.value = value }
}
func callAsFunction(_ x: MLXArray) -> MLXArray {
let indices = tokenMap.value.gathered(x.flattened())
return compactEmbedding(indices)
.reshaped(shape + [embeddingDimension])
}Per-Layer Intermediate Sizes
for i in 0..<config.numHiddenLayers {
let intermediateSize =
config.perLayerIntermediateSizes?[i]
?? config.intermediateSize
layers.append(TransformerLayer(config, intermediateSize))
}Quick Start
gemma-cli --model /path/to/compressed-model \
--prompt "Describe this image" \
--image photo.jpg \
--max-tokens 200 \
--temperature 0.0Pre-built Models
Three model configurations are available, all on HuggingFace and GitHub:
Original
Baseline QAT 4-bit. 34 layers, 262K vocab, 896px, 256 image tokens.
Lite
Steps 1–4. 31 layers, 144K vocab, 672px, 144 image tokens.
Mobile
All steps. Per-layer MLP pruning. Split: 1.9 GB language + 231 MB vision.
Citation
@article{atomgradient2026gemmaprune,
title={Gemma-Prune: A Multi-Stage Compression Pipeline
for Deploying Gemma 3 4B Vision-Language Model
on Mobile Devices},
author={AtomGradient},
year={2026},
url={https://github.com/AtomGradient/swift-gemma-cli}
}