Ever felt like you need a humongous VLM to get the job done? Alibaba’s Qwen team has news for you: enter the dense Qwen3-VL 4B/8B (Instruct & Thinking) with FP8 checkpoints! These new models run smoothly on low VRAM, yet retain a whopping 256K to 1M context length and a full capability surface. Here’s what you need to know:
What’s New?
– Models Galore: Qwen3-VL-4B and Qwen3-VL-8B, each in Instruct and Thinking flavors, plus FP8 versions of these checkpoints. They’re compact, dense, and ready for action.
– Same Capabilities, Smaller Size: These newbies understand images/videos, OCR in 32 languages, spatial grounding, and control GUIs on desktops and mobile. No compromises here!
– Under the Hood: They’ve got Interleaved-MRoPE, DeepStack, and Text–Timestamp Alignment for robust video understanding. It’s all in the model cards.
FP8: The Game Changer
– Fine-Grained and Powerful: FP8 quantization with a block size of 128, nearly matching BF16 performance. No more re-quantization headaches!
– Tooling Up: Transformers doesn’t load these FP8 weights directly yet, but vLLM or SGLang can handle it. Plus, vLLM loves FP8 for H100 memory efficiency.
Why You Should Care
– Qwen’s new 4B/8B models come in both Instruct and Thinking variants, with FP8 checkpoints for low-VRAM deployment.
– They retain the full capability surface, making them perfect for single-GPU or edge budgets.
Join the Qwen Community
– Check out the model on Hugging Face.
– Explore tutorials, codes, and notebooks on our GitHub page.
– Follow us on Twitter, join our ML SubReddit, subscribe to our newsletter, and even join us on Telegram!