EchoStream AI

Abstract

We present Gemma-Prune, a multi-stage compression pipeline that reduces the Gemma 3 4B IT QAT vision-language model from 2.8 GB to 2.1 GB while preserving both text and image understanding capabilities. Our seven-step pipeline combines vocabulary pruning, vision encoder quantization with dimension padding, text layer pruning, resolution reduction, MLP neuron pruning, and weight splitting.

On Apple Silicon, the compressed model achieves 110 tokens/s generation (vs 90 original), 184 tokens/s image prompt processing (vs 54 original), and 2.2 GB peak memory for text-only inference (vs 2.9 GB original). We identify critical failure modes including vision layer sensitivity asymmetry and BPE coverage thresholds for vocabulary pruning.

Model Architecture

Gemma 3 4B IT is a vision-language model combining a text decoder with a SigLIP vision encoder and a multimodal projector:

Text Decoder

34-layer Transformer, 262K vocab, hidden=2560, 8 heads (GQA 4 KV), 4-bit QAT

Vision Encoder

SigLIP 27-layer, hidden=1152, 16 heads, 896×896 input, 14×14 patches

Multimodal Projector

Linear projection: vision hidden (1152) → text hidden (2560)

Seven-Stage Compression Pipeline

Each stage targets a distinct source of redundancy. The techniques compose without interference and accumulate to a total savings of 709 MB.

Vocabulary Pruning

-170 MB

262K → 144K tokens via ASCII vocabulary scan + token map indirection.

Vision fc2 Quantization

-191 MB

Dimension padding (4304 → 4352) solves prime factor alignment for 4-bit quantization.

Text Layer Pruning

-159 MB

Remove layers 31–33 from 34-layer decoder. Representation converges by layer 30.

Resolution Reduction

~3x faster

896×896 → 672×672 pixels. Quadratic attention cost drops ~68%.

MLP Neuron Pruning

-188 MB

60–100% dead neurons in layers 14–30. Activation-guided pruning with group alignment.

Weight Splitting

-700 MB*

1.9 GB language + 231 MB vision. Text-only: load language only (~2.2 GB peak).

Vocabulary Pruning & BPE Coverage

The first step reduces the embedding matrix from 262K to 144K tokens using a token map indirection scheme:

embed(t) = E'[M[t]]

A critical finding: dictionary-only pruning (80K tokens) causes generation collapse. BPE tokenizers produce merged tokens like “_the”, “_and”, “_is” that don't appear in any dictionary. Missing these tokens fragments generation output completely.

The solution is an ASCII vocabulary scan — collecting all tokens whose surface forms use only ASCII characters. This brings coverage from 80K to 144K tokens, completely resolving quality issues. This establishes a practical lower bound of ~144K tokens for English deployment.

The Prime Factor Problem

SigLIP's intermediate dimension is 4304 = 16 × 269, where 269 is prime. This means the fc2 weight matrix cannot be directly quantized with MLX group sizes (32, 64, 128).

The solution: zero-pad to 4352 = 68 × 64, aligned to group size 64. This is mathematically lossless — appended zero neurons contribute nothing to dot products, preserving layer output exactly.

Vision Encoder Sensitivity

A surprising asymmetry: removing 3 text decoder layers (159 MB) causes only minor quality loss, while removing just 4 vision encoder layers (35 MB) causes complete image understanding failure — pizza misidentified as “skin texture”, hallucinated descriptions throughout.

Why Vision Layers Are Irreplaceable

Contrastive pre-training: SigLIP creates tightly interdependent layers, unlike the autoregressive text decoder's more modular representations.
Information bottleneck: Vision must compress 1.35M pixel values into just 144 tokens — every layer contributes critically.
Layer count: 27 vision layers are already compact; the 34-layer text decoder has more built-in redundancy.

Resolution as Thermal Management

Reducing resolution from 896×896 to 672×672 pixels exploits quadratic attention scaling:

Resolution	Patches	Attention Cost	Status
896×896 (original)	64×64 = 4096	1.0x	Baseline
672×672 (selected)	48×48 = 2304	0.32x	Good quality
448×448 (rejected)	32×32 = 1024	0.06x	Repetition loops

A 25% linear reduction yields ~68% attention cost savings — more effective for mobile thermal management than layer reduction, which only scales linearly.

MLP Neuron Pruning

Activation profiling across 20 forward passes reveals that layers 14–30 have 60–100% dead neurons (mean activation < 0.5). Layer 29 is nearly 100% dead. Layers 0–13 maintain healthy activation patterns and are protected from pruning.

For each layer, we retain neurons with the highest activation magnitudes, subject to a maximum 25% reduction ratio and quantization group alignment (group_size=64). This removes 188 MB of dead weight.

Results

Text Generation Performance

Model	Disk Size	Prompt (t/s)	Gen (t/s)	Peak Mem
Original	2.8 GB	109	90	2,910 MB
Lite (Steps 1–4)	2.3 GB	~120	~110	~2,500 MB
Mobile (All Steps)	2.1 GB	120	110	2,231 MB

Image Understanding Performance

Model	Prompt (t/s)	Gen (t/s)	Peak Mem	Quality
Original (896px)	54	27	~5,500 MB	Excellent
Step 3 (896px)	73	61	4,850 MB	Good
Mobile (672px)	184	104	4,358 MB	Good

Failed Experiments

448px resolution

Insufficient visual information causes token repetition loops

Failed

Vision layer removal (4 layers, 35 MB)

Complete image understanding failure — hallucinated descriptions

Failed

80K vocabulary (dictionary-only)

Missing BPE-merged tokens cause generation quality collapse

Failed

Incremental Compression Analysis

Step	Operation	Disk Size	Savings
Baseline	Original QAT 4-bit	2.8 GB	—
Step 1	Vocabulary pruning	2.54 GB	170 MB
Step 2	Vision fc2 quantization	2.35 GB	191 MB
Step 3	Text layer pruning	2.19 GB	159 MB
Step 4	Resolution reduction	2.19 GB	~1 MB
Step 5	Neuron pruning	2.00 GB	188 MB
Step 6	Weight splitting	2.1 GB	709 MB total

Swift Inference Engine

The full inference pipeline is implemented in Swift using Apple's MLX framework with the MLXVLM library, requiring no Python dependencies.

Token Map Support

class MLXArrayBox {
    let value: MLXArray
    init(_ value: MLXArray) { self.value = value }
}

func callAsFunction(_ x: MLXArray) -> MLXArray {
    let indices = tokenMap.value.gathered(x.flattened())
    return compactEmbedding(indices)
        .reshaped(shape + [embeddingDimension])
}

Per-Layer Intermediate Sizes

for i in 0..<config.numHiddenLayers {
    let intermediateSize =
        config.perLayerIntermediateSizes?[i]
        ?? config.intermediateSize
    layers.append(TransformerLayer(config, intermediateSize))
}

Quick Start

gemma-cli --model /path/to/compressed-model \
          --prompt "Describe this image" \
          --image photo.jpg \
          --max-tokens 200 \
          --temperature 0.0

Pre-built Models

Three model configurations are available, all on HuggingFace and GitHub:

Original

2.8 GB

Baseline QAT 4-bit. 34 layers, 262K vocab, 896px, 256 image tokens.

Lite

2.3 GB

Steps 1–4. 31 layers, 144K vocab, 672px, 144 image tokens.

Mobile

2.1 GB

All steps. Per-layer MLP pruning. Split: 1.9 GB language + 231 MB vision.

Citation

@article{atomgradient2026gemmaprune,
  title={Gemma-Prune: A Multi-Stage Compression Pipeline
         for Deploying Gemma 3 4B Vision-Language Model
         on Mobile Devices},
  author={AtomGradient},
  year={2026},
  url={https://github.com/AtomGradient/swift-gemma-cli}
}

Gemma-Prune: Compressing Gemma 3 4B Vision-Language Model for Mobile Devices

Abstract

Model Architecture

Text Decoder

Vision Encoder

Multimodal Projector

Seven-Stage Compression Pipeline

Vocabulary Pruning

Vision fc2 Quantization

Text Layer Pruning

Resolution Reduction

MLP Neuron Pruning

Weight Splitting

Vocabulary Pruning & BPE Coverage

The Prime Factor Problem

Vision Encoder Sensitivity

Why Vision Layers Are Irreplaceable

Resolution as Thermal Management

MLP Neuron Pruning

Results

Text Generation Performance

Image Understanding Performance

Failed Experiments

Incremental Compression Analysis

Swift Inference Engine

Token Map Support

Per-Layer Intermediate Sizes

Quick Start

Pre-built Models

Original

Lite

Mobile

Citation