Research
February 17, 2026
12 min read

Gemma-Prune: Compressing Gemma 3 4B Vision-Language Model for Mobile Devices

A multi-stage compression pipeline reduces the Gemma 3 4B vision-language model from 2.8 GB to 2.1 GB, achieving 22% faster text generation, 3.4x faster image processing, and 23% lower peak memory on Apple Silicon.

View on GitHub →
2.1 GB
Final Model Size
22%
Faster Generation
3.4x
Faster Vision
23%
Less Memory

Abstract

We present Gemma-Prune, a multi-stage compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text and image understanding capabilities. Our seven-step pipeline combines vocabulary pruning, vision encoder quantization with dimension padding, text layer pruning, resolution reduction, MLP neuron pruning, and weight splitting.

On Apple Silicon, the compressed model achieves 110 tokens/s generation (vs 90 original), 184 tokens/s image prompt processing (vs 54 original), and 2.2 GB peak memory for text-only inference (vs 2.9 GB original). We identify critical failure modes including vision layer sensitivity asymmetry and BPE coverage thresholds for vocabulary pruning.

Model Architecture

Gemma 3 4B IT is a vision-language model combining a text decoder with a SigLIP vision encoder and a multimodal projector:

Text Decoder

34-layer Transformer, 262K vocab, hidden=2560, 8 heads (GQA 4 KV), 4-bit QAT

Vision Encoder

SigLIP 27-layer, hidden=1152, 16 heads, 896×896 input, 14×14 patches

Multimodal Projector

Linear projection: vision hidden (1152) → text hidden (2560)

Seven-Stage Compression Pipeline

Each stage targets a distinct source of redundancy. The techniques compose without interference and accumulate to a total savings of 709 MB.

1

Vocabulary Pruning

-170 MB

262K → 144K tokens via ASCII vocabulary scan + token map indirection.

2

Vision fc2 Quantization

-191 MB

Dimension padding (4304 → 4352) solves prime factor alignment for 4-bit quantization.

3

Text Layer Pruning

-159 MB

Remove layers 31–33 from 34-layer decoder. Representation converges by layer 30.

4

Resolution Reduction

~3x faster

896×896 → 672×672 pixels. Quadratic attention cost drops ~68%.

5

MLP Neuron Pruning

-188 MB

60–100% dead neurons in layers 14–30. Activation-guided pruning with group alignment.

6

Weight Splitting

-700 MB*

1.9 GB language + 231 MB vision. Text-only: load language only (~2.2 GB peak).

Vocabulary Pruning & BPE Coverage

The first step reduces the embedding matrix from 262K to 144K tokens using a token map indirection scheme:

embed(t) = E'[M[t]]

A critical finding: dictionary-only pruning (80K tokens) causes generation collapse. BPE tokenizers produce merged tokens like “_the”, “_and”, “_is” that don't appear in any dictionary. Missing these tokens fragments generation output completely.

The solution is an ASCII vocabulary scan — collecting all tokens whose surface forms use only ASCII characters. This brings coverage from 80K to 144K tokens, completely resolving quality issues. This establishes a practical lower bound of ~144K tokens for English deployment.

The Prime Factor Problem

SigLIP's intermediate dimension is 4304 = 16 × 269, where 269 is prime. This means the fc2 weight matrix cannot be directly quantized with MLX group sizes (32, 64, 128).

The solution: zero-pad to 4352 = 68 × 64, aligned to group size 64. This is mathematically lossless — appended zero neurons contribute nothing to dot products, preserving layer output exactly.

Vision Encoder Sensitivity

A surprising asymmetry: removing 3 text decoder layers (159 MB) causes only minor quality loss, while removing just 4 vision encoder layers (35 MB) causes complete image understanding failure — pizza misidentified as “skin texture”, hallucinated descriptions throughout.

Why Vision Layers Are Irreplaceable

  1. Contrastive pre-training: SigLIP creates tightly interdependent layers, unlike the autoregressive text decoder's more modular representations.
  2. Information bottleneck: Vision must compress 1.35M pixel values into just 144 tokens — every layer contributes critically.
  3. Layer count: 27 vision layers are already compact; the 34-layer text decoder has more built-in redundancy.

Resolution as Thermal Management

Reducing resolution from 896×896 to 672×672 pixels exploits quadratic attention scaling:

ResolutionPatchesAttention CostStatus
896×896 (original)64×64 = 40961.0xBaseline
672×672 (selected)48×48 = 23040.32xGood quality
448×448 (rejected)32×32 = 10240.06xRepetition loops

A 25% linear reduction yields ~68% attention cost savings — more effective for mobile thermal management than layer reduction, which only scales linearly.

MLP Neuron Pruning

Activation profiling across 20 forward passes reveals that layers 14–30 have 60–100% dead neurons (mean activation < 0.5). Layer 29 is nearly 100% dead. Layers 0–13 maintain healthy activation patterns and are protected from pruning.

For each layer, we retain neurons with the highest activation magnitudes, subject to a maximum 25% reduction ratio and quantization group alignment (group_size=64). This removes 188 MB of dead weight.

Results

Text Generation Performance

ModelDisk SizePrompt (t/s)Gen (t/s)Peak Mem
Original2.8 GB109902,910 MB
Lite (Steps 1–4)2.3 GB~120~110~2,500 MB
Mobile (All Steps)2.1 GB1201102,231 MB

Image Understanding Performance

ModelPrompt (t/s)Gen (t/s)Peak MemQuality
Original (896px)5427~5,500 MBExcellent
Step 3 (896px)73614,850 MBGood
Mobile (672px)1841044,358 MBGood

Failed Experiments

448px resolution

Insufficient visual information causes token repetition loops

Failed
Vision layer removal (4 layers, 35 MB)

Complete image understanding failure — hallucinated descriptions

Failed
80K vocabulary (dictionary-only)

Missing BPE-merged tokens cause generation quality collapse

Failed

Incremental Compression Analysis

StepOperationDisk SizeSavings
BaselineOriginal QAT 4-bit2.8 GB
Step 1Vocabulary pruning2.54 GB170 MB
Step 2Vision fc2 quantization2.35 GB191 MB
Step 3Text layer pruning2.19 GB159 MB
Step 4Resolution reduction2.19 GB~1 MB
Step 5Neuron pruning2.00 GB188 MB
Step 6Weight splitting2.1 GB709 MB total

Swift Inference Engine

The full inference pipeline is implemented in Swift using Apple's MLX framework with the MLXVLM library, requiring no Python dependencies.

Token Map Support

class MLXArrayBox {
    let value: MLXArray
    init(_ value: MLXArray) { self.value = value }
}

func callAsFunction(_ x: MLXArray) -> MLXArray {
    let indices = tokenMap.value.gathered(x.flattened())
    return compactEmbedding(indices)
        .reshaped(shape + [embeddingDimension])
}

Per-Layer Intermediate Sizes

for i in 0..<config.numHiddenLayers {
    let intermediateSize =
        config.perLayerIntermediateSizes?[i]
        ?? config.intermediateSize
    layers.append(TransformerLayer(config, intermediateSize))
}

Quick Start

gemma-cli --model /path/to/compressed-model \
          --prompt "Describe this image" \
          --image photo.jpg \
          --max-tokens 200 \
          --temperature 0.0

Pre-built Models

Three model configurations are available, all on HuggingFace and GitHub:

Original

2.8 GB

Baseline QAT 4-bit. 34 layers, 262K vocab, 896px, 256 image tokens.

Lite

2.3 GB

Steps 1–4. 31 layers, 144K vocab, 672px, 144 image tokens.

Mobile

2.1 GB

All steps. Per-layer MLP pruning. Split: 1.9 GB language + 231 MB vision.

Citation

@article{atomgradient2026gemmaprune,
  title={Gemma-Prune: A Multi-Stage Compression Pipeline
         for Deploying Gemma 3 4B Vision-Language Model
         on Mobile Devices},
  author={AtomGradient},
  year={2026},
  url={https://github.com/AtomGradient/swift-gemma-cli}
}