Abstract
llama.cpp achieves remarkably flat memory behavior on Apple Silicon through memory-mapped (mmap) weight loading and pre-allocated KV caches. We investigate whether these same techniques can benefit the MLX framework, which instead uses pread-based loading and dynamically grown KV caches.
We implement a zero-copy mmap loading path in MLX’s C++ core and a KV cache pre-allocation option in mlx-lm, and evaluate both across eight Qwen3 quantized model variants on an M1 Max (32 GB). Our results are mixed: mmap loading shows dramatic speedups for certain larger models (up to 20.65x) but performs worse for small models. KV pre-allocation flattens memory growth but adds 0.5–0.6 GB upfront cost with no throughput benefit. Overall, MLX’s existing memory management is already well-suited to its design goals.
Background
Apple Silicon’s Unified Memory Architecture (UMA) enables CPU and GPU to share the same physical memory pool with up to 800 GB/s bandwidth—eliminating PCIe bottlenecks and enabling zero-copy data sharing. This makes mmap an attractive loading strategy in theory.
MLX Approach
pread()-based loading into allocated buffers. Lazy evaluation model with dynamic KV cache growing in 256-token increments. Safetensors format.
llama.cpp Approach
mmap for zero-copy weight access. Static computation graph (ggml) with pre-allocated KV cache at full context length. GGUF format with guaranteed alignment.
The key question: would MLX benefit from adopting llama.cpp’s memory strategies? We implemented both techniques to find out.
Implementation
Zero-Copy mmap Loading
We implemented an MmapReader class in MLX’s C++ core that memory-maps safetensors files and exposes offset views via Metal buffers:
MmapReader::MmapReader(std::string file_path) {
int fd = open(file_path.c_str(), O_RDONLY);
struct stat st;
fstat(fd, &st);
file_size_ = st.st_size;
mmap_ptr_ = mmap(nullptr, file_size_,
PROT_READ, MAP_PRIVATE, fd, 0);
close(fd); // mmap survives fd closure
madvise(mmap_ptr_, file_size_, MADV_SEQUENTIAL);
}The critical zero-copy path creates offset views into the mmap buffer with a careful alignment check. Safetensors does not guarantee element-aligned offsets, so we verify offset % itemsize == 0 before allowing zero-copy access. Unaligned offsets fall back to memcpy.
if (reader_->is_mmap() && !swap_endianness_) {
auto mmap_reader = dynamic_pointer_cast<MmapReader>(reader_);
if (mmap_reader && (offset_ % out.itemsize() == 0)) {
auto metal_buf = mmap_reader->get_metal_buffer();
auto parent = array(metal_buf, {1}, uint8,
[reader_ref](Buffer) { /* prevent dealloc */ });
out.copy_shared_buffer(parent, strides, flags,
out.size(), offset_ / out.itemsize());
return; // Zero-copy success
}
}
// Fallback: allocate + memcpy for unaligned offsetsKV Cache Pre-Allocation
We added optional max_context_length parameter for upfront KV cache allocation. When specified, keys and values tensors are allocated for the full context window at startup, with mx.eval() forcing immediate physical allocation. If the sequence exceeds pre-allocated length, it gracefully falls back to dynamic expansion.
class KVCache(_BaseCache):
step = 256
def __init__(self, n_kv_heads=0, head_dim=0,
max_context_length=0, dtype=mx.float16):
self.offset = 0
if max_context_length > 0 and n_kv_heads > 0:
L = ((max_context_length + 255) // 256) * 256
self.keys = mx.zeros(
(1, n_kv_heads, L, head_dim), dtype=dtype)
self.values = mx.zeros(
(1, n_kv_heads, L, head_dim), dtype=dtype)
mx.eval(self.keys, self.values) # Force allocQuantized dtype Bug Fix
During implementation, we discovered that QuantizedLinear.weight.dtype returns uint32 (the packed storage type) rather than the working precision. This caused the KV cache to be allocated in uint32 instead of float16. The fix uses scales.dtype instead, which stores dequantization factors in the correct working dtype.
Evaluation
Test Setup
- Hardware: Apple M1 Max, 32 GB unified memory
- Software: macOS, MLX 0.30.7, Python 3.11
- Models: 8 Qwen3 quantized variants (4B to 14B, 3-bit to 8-bit)
- Prompt: 34-token fixed physics question; 200 output tokens per run
Loading Speed: Standard vs. mmap
| Model | Standard (s) | Mmap (s) | Speedup |
|---|---|---|---|
| Qwen3-4B-4bit | 0.101 | 0.131 | 0.77x |
| Qwen3-4B-8bit | 0.184 | 0.246 | 0.75x |
| Qwen3-8B-3bit | 0.146 | 0.075 | 1.95x |
| Qwen3-8B-4bit | 0.352 | 0.328 | 1.07x |
| Qwen3-8B-6bit | 0.637 | 0.435 | 1.46x |
| Qwen3-8B-8bit | 2.572 | 0.125 | 20.65x |
| Qwen3-14B-4bit | 2.323 | 0.535 | 4.34x |
| Qwen3-14B-6bit | 3.701 | 5.388 | 0.69x |
Results are highly inconsistent. Three out of eight models load slower with mmap. The 20.65x speedup on Qwen3-8B-8bit is an outlier—likely due to the standard loader hitting a pathological pread pattern. For most models in the 2–6 GB range, MLX’s standard loading is already efficient enough.
Inference Impact
| Model | Gen t/s (Std) | Gen t/s (Mmap) | Mem Std (GB) | Mem Mmap (GB) |
|---|---|---|---|---|
| Qwen3-4B-4bit | 56.9 | 58.5 | 2.38 | 2.38 |
| Qwen3-8B-4bit | 30.1 | 30.1 | 4.72 | 8.82* |
| Qwen3-8B-8bit | 21.6 | 21.3 | 8.82 | 8.84 |
| Qwen3-14B-4bit | 16.2 | 16.5 | 8.84 | 11.09* |
| Qwen3-14B-6bit | 12.1 | 12.0 | 12.08 | 12.16 |
* Memory nearly doubled—mmap region coexists with materialized copies due to lazy evaluation.
Key Inference Findings
- Generation throughput is unchanged (<3% difference across all models). Once weights are in memory, the computation is identical.
- Mmap can increase peak memory. For 8B-4bit and 14B-4bit, peak memory nearly doubled—MLX materializes quantized tensors during lazy evaluation while mmap-backed originals remain pinned.
KV Cache Pre-Allocation
| Model | Mode | FTLT (ms) | Gen (t/s) | Peak (GB) | ΔMem |
|---|---|---|---|---|---|
| Qwen3-4B-4bit | Dynamic | 260.4 | 62.0 | 2.221 | — |
| Pre-alloc 2k | 276.2 | 63.0 | 2.471 | +0.250 | |
| Pre-alloc 4k | 285.2 | 62.3 | 2.736 | +0.515 | |
| Qwen3-8B-4bit | Dynamic | 419.2 | 31.3 | 4.395 | — |
| Pre-alloc 2k | 428.0 | 30.3 | 4.633 | +0.238 | |
| Pre-alloc 4k | 436.1 | 29.7 | 4.908 | +0.513 | |
| Qwen3-14B-4bit | Dynamic | 859.6 | 16.7 | 7.827 | — |
| Pre-alloc 2k | 917.4 | 16.0 | 8.101 | +0.274 | |
| Pre-alloc 4k | 901.5 | 16.1 | 8.414 | +0.587 |
Pre-allocation does not improve throughput. Generation speed stays within <5% of dynamic mode. First token latency increases 5–7% due to forced mx.eval() at startup. Memory overhead is predictable: ~0.25 GB for 2048 tokens, ~0.5 GB for 4096 tokens.
Why MLX’s Existing Design Works Well
pread is fast enough
For models under 4 GB (the majority of quantized models run locally), standard loading completes in under 200 ms. At this scale, mmap’s overhead—VMA creation, page table setup, TLB pressure—outweighs any benefit from avoiding copies.
Lazy evaluation conflicts with mmap
When quantized tensors loaded via mmap are later materialized (e.g., dequantization), MLX allocates new buffers while mmap-backed originals remain referenced. This can lead to memory doubling rather than saving—the exact opposite of the goal.
Dynamic KV growth is not a bottleneck
The 256-token step growth pattern does not cause meaningful overhead. Generation throughput is identical whether the cache is pre-allocated or not. MLX’s allocator handles periodic growth efficiently.
The 14B-6bit Anomaly
The largest model tested (11.18 GB) loaded slower with mmap (0.69x). At this size, the mmap approach must fault in an extremely large number of pages. The OS page fault handler becomes the bottleneck—each 16 KB page requires a kernel trap, page table update, and TLB insertion. The standard loader’s buffered reads, which batch these operations, prove more efficient at scale.
Recommendations
Stick with MLX’s default loading and dynamic KV cache—fast, memory-efficient, and well-integrated with lazy evaluation.
Useful as opt-in for specific large models where standard loading is slow, but should not be the default.
Only for server deployments valuing memory predictability, or long-context scenarios (>8k tokens).
Conclusion
We explored whether llama.cpp’s memory management strategies—mmap loading and KV cache pre-allocation—could improve MLX’s performance on Apple Silicon. Our systematic benchmarks across eight models reveal inconsistent results: mmap helps for some larger models but hurts for small ones and can increase peak memory; KV pre-allocation achieves flat memory but offers no throughput benefit.
The conclusion: MLX’s existing memory management is already well-designed for its target use case. The design choices that differ from llama.cpp are not oversights but appropriate adaptations to MLX’s lazy evaluation model and the safetensors ecosystem. During this work, we also identified and fixed a quantized model dtype inference bug—arguably the most practically useful contribution.
Citation
@article{atomgradient2026optmlx,
title={Exploring Zero-Copy mmap Loading and KV Cache
Pre-Allocation for MLX on Apple Silicon},
author={AtomGradient},
year={2026},
url={https://github.com/AtomGradient/OptMLX}
}