How to Run Qwen 3.6 Locally — 27B Dense, 35B MoE, and Coding Variants
Qwen 3.6 dropped on April 21 2026. Two main families: a 27B dense model that activates every parameter per token and a 35B MoE with 3B active per token. Both ship with vision, agentic coding, thinking-mode preservation, and a 256K context window. This guide covers everything you need to run them locally on consumer hardware.
If you only have time for the short version: install Locally Uncensored, open Model Manager, click Discover, search Qwen 3.6, hit the download arrow on the variant that fits your VRAM. The rest of this post is the long version.
Which Qwen 3.6 Variant Should You Pick?
The biggest decision is dense vs MoE. The second biggest is which quant. The third is whether you want the coding-specialised variant of the 35B MoE.
Dense vs MoE
The 27B dense activates all 27B parameters for every token. Slower per token, but every token gets the full model. Quality is consistent. This is the recommended default for general chat, reasoning, and most coding work.
The 35B MoE only activates 3B parameters per token via routing. Much faster per token (often 2-3x the throughput of the dense at similar quants). VRAM peak during inference is lower than the model size suggests. But routing introduces variance — some tokens get the wrong expert and quality dips. The MoE wins on coding benchmarks (SWE-bench specifically) when you pick the coding-specialised variant.
Quant Comparison Table
All sizes below are the disk footprint of the GGUF file. VRAM usage during inference is roughly file size + 1-2 GB of overhead.
| Variant | Quant | Disk | VRAM Target | Quality |
|---|---|---|---|---|
| 27B dense | UD-IQ2_XXS | 8.7 GB | 8 GB GPU | Good (low-VRAM lifesaver) |
| 27B dense | Q3_K_M | 13 GB | 12 GB GPU | Very good (RTX 3060 sweet spot) |
| 27B dense | Q4_K_M | 16 GB | 16 GB GPU | Recommended default |
| 27B dense | UD-Q4_K_XL | 16 GB | 16 GB GPU | Better quality per GB than Q4_K_M |
| 27B dense | Q5_K_M | 18 GB | 20 GB GPU | High |
| 27B dense | Q6_K | 21 GB | 24 GB GPU | Near-lossless |
| 27B dense | Q8_0 | 27 GB | 32 GB GPU | Effectively lossless |
| 35B MoE | Q4_K_M | 24 GB | 24 GB GPU | Recommended for MoE |
| 35B MoE | NVFP4 | 22 GB | 22 GB GPU (RTX 40+) | Smallest with full quality on Blackwell |
| 35B MoE coding NVFP4 | NVFP4 | 22 GB | 22 GB GPU (RTX 40+) | Best coding-bench-per-GB |
| 35B MoE | BF16 | 71 GB | 96 GB GPU | Reference quality |
| 35B MoE | MLX BF16 | 70 GB | Apple Silicon M3/M4 | MLX-optimised |
Recommendation by Hardware
- 8 GB VRAM (RTX 3060 8GB, RTX 4060 8GB): 27B UD-IQ2_XXS — the only quant that fits
- 12 GB VRAM (RTX 3060 12GB, RTX 3080 Ti, RTX 4070): 27B Q3_K_M — sweet spot, ~15-25 tok/s
- 16 GB VRAM (RTX 4070 Ti Super, RTX 4080): 27B Q4_K_M or UD-Q4_K_XL — the recommended default
- 24 GB VRAM (RTX 3090, RTX 4090, RTX 5090): 27B Q6_K for max dense quality, OR 35B MoE Q4_K_M for coding
- RTX 40-series Blackwell (5090, 6000 Ada): 35B MoE NVFP4 wins — smallest size with native quality
- Apple Silicon M3/M4: Qwen 3.6 35B MoE MLX BF16 via MLX runtime
- CPU only with 32 GB RAM: 27B Q4_K_M at 1-3 tok/s — usable for short tasks
Installation Path 1 — Ollama (CLI)
Ollama is the no-frills route if you don’t want a GUI.
- Install Ollama from ollama.com
- Pull the model:
ollama pull qwen3.6:27b(dense Q4_K_M, 16 GB) orollama pull qwen3.6(35B MoE Q4_K_M, 24 GB) - For variants:
ollama pull qwen3.6:35b-a3b-coding-nvfp4for the NVFP4 coding model - Chat:
ollama run qwen3.6:27b
Ollama’s default 4096-token context window is conservative for Qwen 3.6’s 256K capability. To enable the long context create a Modelfile:
FROM qwen3.6:27b
PARAMETER num_ctx 32768
Then ollama create qwen3.6-long -f Modelfile and use that. 32K is a sane starting point — full 256K eats VRAM aggressively.
Installation Path 2 — Locally Uncensored (GUI)
If you want a one-click experience plus chat, agent mode, image generation, and a/b model compare in the same window, use Locally Uncensored.
- Download the v2.4.0 installer for your OS from GitHub Releases
- Run the installer (Windows: signed NSIS .exe; Linux: deb / rpm / AppImage)
- The first-launch wizard auto-detects Ollama if you have it. If not, the wizard offers a one-click Ollama install.
- Open the Model Manager, switch to the Discover tab, sub-tab Text
- Search Qwen 3.6 — you’ll see all variants with size + hardware tags
- Click the download arrow on the variant matching your VRAM. The download badge in the header shows progress
- Once done, switch to Chat, the model picker shows qwen3.6:27b (or your variant). Type a prompt
The new v2.4.0 Settings > Model Storage section lets you redirect the GGUF folder to a custom path — useful for dual-boot setups or NAS-mounted model libraries.
Performance Numbers
Tested on RTX 3060 12 GB (Windows 11, CUDA 12.1) with Qwen 3.6 27B Q3_K_M, 4096-token context, fp16 KV cache:
| Workload | tok/s | Notes |
|---|---|---|
| Cold first response (model load) | ~3 | Includes load_duration |
| Warm chat (50-token answers) | 22-26 | Steady state |
| Long-form generation (1000 tokens) | 18-20 | Sustained |
| Thinking-mode enabled | 15-18 | Includes hidden chain-of-thought tokens |
For comparison, our broader benchmark of 2026 local models covers the same hardware against Llama 4, GPT-OSS, GLM 4.7, and DeepSeek R1.
Thinking Mode
Qwen 3.6 preserves the thinking-mode lineage from QwQ and DeepSeek R1. The model produces a hidden chain-of-thought before the visible answer. In Locally Uncensored toggle the Think button in the chat input. With Ollama set think: true in the request body. Without thinking the model still answers but skips the deliberation step — faster but lower quality on hard reasoning prompts.
Note: thinking mode adds 1.5-3x to response latency depending on prompt complexity. Worth it for code review, math, multi-step planning. Skip it for casual chat.
Vision Support
Both 27B dense and 35B MoE accept image input. Drag-and-drop a screenshot, photo, or chart into the LU chat input. The model returns a description, transcribes text, identifies objects, or answers questions about the image.
VRAM cost for vision inference is +1-2 GB on top of the base model. So 27B Q3_K_M with vision needs ~14 GB — doesn’t fit on 12 GB GPUs. Use Q3_K_M for text only or step down to UD-IQ2_XXS for vision on 12 GB GPUs.
Coding Performance
The 35B MoE coding-specialised variants (qwen3.6:35b-a3b-coding-nvfp4, qwen3.6:35b-a3b-coding-mxfp8) are tuned on the SWE-bench training data. On the SWE-bench-verified benchmark the coding NVFP4 variant scores in the same ballpark as Claude 3.5 Sonnet at a fraction of the inference cost.
For day-to-day coding inside LU’s Codex agent, the 27B dense Q4_K_M is the better default — consistent quality, no MoE-routing variance. Switch to the 35B MoE coding variant for hard refactors or when SWE-bench-style multi-file changes are involved.
Comparison — Qwen 3.6 vs Qwen 3.5
| Feature | Qwen 3.5 | Qwen 3.6 |
|---|---|---|
| Vision | No | Yes (both 27B and 35B) |
| Context window | 128K | 256K |
| Thinking mode | QwQ-only | Preserved across variants |
| Coding-specific MoE | No | Yes (35B-a3b-coding) |
| NVFP4 quant | No | Yes (35B MoE) |
| MLX variant for Apple Silicon | No | Yes (35B MoE BF16 MLX) |
| Best dense size | 27B | 27B (denser per benchmark) |
Troubleshooting
Out of memory on Q4_K_M
Step down one quant level. Q4_K_M → Q3_K_M (12 GB), Q3_K_M → UD-IQ2_XXS (8 GB). Or reduce context window size in the Modelfile.
"does not support thinking" HTTP 400 error
You’re on an Ollama version older than 0.3.10 with thinking enabled in LU. Update Ollama. v2.4.0 of LU also retries automatically without the thinking flag and warns in-app.
Slow first response, fast subsequent
Normal. The first prompt loads the model into VRAM (load_duration in the Ollama response). Keep the model warm by setting OLLAMA_KEEP_ALIVE=24h as an env var, or in LM Studio set "Keep model loaded for" to infinite.
HuggingFace download 404
If you’re downloading via LU’s search box (not the curated list) and hit a 404, you may have hit the pre-v2.4.0 filename heuristic bug. Update LU to v2.4.0 — that release fixes the doubled-quant-tag issue.
Related Reading
- Best uncensored AI models 2026 — Qwen 3.6 in the broader landscape
- Best local AI apps 2026 — how the Ollama / LM Studio / LU stack compares
- How to run uncensored AI locally — full setup walkthrough including abliterated variants
- v2.4.0 release notes — configurable HF download path, single-instance lock, more
Locally Uncensored is AGPL-3.0 licensed. Built by PurpleDoubleD. Bug reports and feature requests on GitHub Discussions or in the Discord.