"Introducing LFM2-8B-A1B: Liquid AI's On-Device Mixture-of-Experts Model with 8.3B Parameters, Activating 1.5B Parameters per Token"

**Liquid AI Unveils LFM2-8B-A1B: A Compact, High-Capacity Mixture-of-Experts for On-Device Use**

Liquid AI has introduced LFM2-8B-A1B, a pioneering Mixture-of-Experts (MoE) model designed to operate efficiently on resource-constrained devices like smartphones, laptops, and embedded systems. Unlike most MoE models optimized for cloud batch serving, LFM2-8B-A1B is engineered to deliver high performance within tight memory, latency, and energy budgets.

**Model Architecture: Balancing Efficiency and Capacity**

LFM2-8B-A1B retains the LFM2 ‘fast backbone’ and incorporates sparse MoE feed-forward blocks to enhance capacity without significantly increasing active compute. The backbone consists of 18 gated short-convolution blocks and 6 grouped-query attention (GQA) blocks. All layers, except the first two, include an MoE block, with the first two remaining dense for stability. Each MoE block comprises 32 experts, and a router selects the top 4 experts per token using a normalized-sigmoid gate and adaptive routing bias to balance load and stabilize training. With a context length of 32,768 tokens and a vocabulary size of 65,536, the model was pre-trained on approximately 12T tokens.

This architecture keeps per-token FLOPs and cache growth bounded by the active path (attention + four expert MLPs), allowing the model to specialize across domains such as multilingual knowledge, math, and code—use cases that often suffer with very small dense models.

**Performance: Speed and Quality**

Liquid AI reports that LFM2-8B-A1B outperforms Qwen3-1.7B in CPU tests using an internal XNNPACK-based stack and a custom CPU MoE kernel. The model demonstrates comparable quality to 3-4B dense models while keeping the active compute near 1.5B. Performance is showcased using int4 quantization with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra, with claims framed as per-device comparisons against similarly active models.

On accuracy, LFM2-8B-A1B delivers competitive instruction-following and math performance within the small-model band, and improved knowledge capacity relative to LFM2-2.6B, consistent with its larger total parameter budget. It achieves this across 16 benchmarks, including MMLU/MMLU-Pro/GPQA (knowledge), IFEval/IFBench/Multi-IF (instruction following), GSM8K/GSMPlus/MATH500/MATH-Lvl-5 (math), and MGSM/MMMLU (multilingual).

**Deployment and Tooling**

LFM2-8B-A1B ships with Transformers/vLLM for GPU inference and GGUF builds for llama.cpp. The official GGUF repo lists common quants from Q4_0 (~4.7 GB) up to F16 (~16.7 GB) for local runs. For CPU validation, Liquid AI uses Q4_0 with int8 dynamic activations on AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra, where LFM2-8B-A1B shows higher decode throughput than Qwen3-1.7B at a similar active-parameter class. ExecuTorch is referenced for mobile/embedded CPU deployment.

**Key Takeaways**

– **Architecture & routing**: LFM2-8B-A1B combines an LFM2 fast backbone with per-layer sparse-MoE FFNs, using 32 experts with top-4 routing via normalized-sigmoid gating and adaptive biases, resulting in 8.3B total parameters and ~1.5B active per token.
– **On-device target**: Designed for phones, laptops, and embedded CPUs/GPUs, quantized variants fit comfortably on high-end consumer hardware for private, low-latency use.
– **Performance positioning**: Liquid AI reports LFM2-8B-A1B is significantly faster than Qwen3-1.7B in CPU tests and aims for 3-4B dense-class quality while keeping an ~1.5B active path.

**Editorial Comments**

LFM2-8B-A1B demonstrates the practicality of sparse MoE below the usual server-scale regime. By integrating an LFM2 conv-attention backbone with per-layer expert MLPs (except the first two layers), the model keeps token compute near 1.5B while lifting quality toward 3-4B dense classes. With standard and GGUF weights, llama.cpp/ExecuTorch/vLLM paths, and a permissive on-device posture, LFM2-8B-A1B is a concrete option for building low-latency, private assistants and application-embedded copilots on consumer and edge hardware.