April 23, 2026 · 10 min read

How to Run Qwen 3.6 Locally — 27B Dense, 35B MoE, and Coding Variants

Qwen 3.6 dropped on April 21 2026. Two main families: a 27B dense model that activates every parameter per token and a 35B MoE with 3B active per token. Both ship with vision, agentic coding, thinking-mode preservation, and a 256K context window. This guide covers everything you need to run them locally on consumer hardware.

If you only have time for the short version: install Locally Uncensored, open Model Manager, click Discover, search Qwen 3.6, hit the download arrow on the variant that fits your VRAM. The rest of this post is the long version.

Which Qwen 3.6 Variant Should You Pick?

The biggest decision is dense vs MoE. The second biggest is which quant. The third is whether you want the coding-specialised variant of the 35B MoE.

Dense vs MoE

The 27B dense activates all 27B parameters for every token. Slower per token, but every token gets the full model. Quality is consistent. This is the recommended default for general chat, reasoning, and most coding work.

The 35B MoE only activates 3B parameters per token via routing. Much faster per token (often 2-3x the throughput of the dense at similar quants). VRAM peak during inference is lower than the model size suggests. But routing introduces variance — some tokens get the wrong expert and quality dips. The MoE wins on coding benchmarks (SWE-bench specifically) when you pick the coding-specialised variant.

Quant Comparison Table

All sizes below are the disk footprint of the GGUF file. VRAM usage during inference is roughly file size + 1-2 GB of overhead.

VariantQuantDiskVRAM TargetQuality
27B denseUD-IQ2_XXS8.7 GB8 GB GPUGood (low-VRAM lifesaver)
27B denseQ3_K_M13 GB12 GB GPUVery good (RTX 3060 sweet spot)
27B denseQ4_K_M16 GB16 GB GPURecommended default
27B denseUD-Q4_K_XL16 GB16 GB GPUBetter quality per GB than Q4_K_M
27B denseQ5_K_M18 GB20 GB GPUHigh
27B denseQ6_K21 GB24 GB GPUNear-lossless
27B denseQ8_027 GB32 GB GPUEffectively lossless
35B MoEQ4_K_M24 GB24 GB GPURecommended for MoE
35B MoENVFP422 GB22 GB GPU (RTX 40+)Smallest with full quality on Blackwell
35B MoE coding NVFP4NVFP422 GB22 GB GPU (RTX 40+)Best coding-bench-per-GB
35B MoEBF1671 GB96 GB GPUReference quality
35B MoEMLX BF1670 GBApple Silicon M3/M4MLX-optimised

Recommendation by Hardware

Installation Path 1 — Ollama (CLI)

Ollama is the no-frills route if you don’t want a GUI.

  1. Install Ollama from ollama.com
  2. Pull the model: ollama pull qwen3.6:27b (dense Q4_K_M, 16 GB) or ollama pull qwen3.6 (35B MoE Q4_K_M, 24 GB)
  3. For variants: ollama pull qwen3.6:35b-a3b-coding-nvfp4 for the NVFP4 coding model
  4. Chat: ollama run qwen3.6:27b

Ollama’s default 4096-token context window is conservative for Qwen 3.6’s 256K capability. To enable the long context create a Modelfile:

FROM qwen3.6:27b
PARAMETER num_ctx 32768

Then ollama create qwen3.6-long -f Modelfile and use that. 32K is a sane starting point — full 256K eats VRAM aggressively.

Installation Path 2 — Locally Uncensored (GUI)

If you want a one-click experience plus chat, agent mode, image generation, and a/b model compare in the same window, use Locally Uncensored.

  1. Download the v2.4.0 installer for your OS from GitHub Releases
  2. Run the installer (Windows: signed NSIS .exe; Linux: deb / rpm / AppImage)
  3. The first-launch wizard auto-detects Ollama if you have it. If not, the wizard offers a one-click Ollama install.
  4. Open the Model Manager, switch to the Discover tab, sub-tab Text
  5. Search Qwen 3.6 — you’ll see all variants with size + hardware tags
  6. Click the download arrow on the variant matching your VRAM. The download badge in the header shows progress
  7. Once done, switch to Chat, the model picker shows qwen3.6:27b (or your variant). Type a prompt

The new v2.4.0 Settings > Model Storage section lets you redirect the GGUF folder to a custom path — useful for dual-boot setups or NAS-mounted model libraries.

Performance Numbers

Tested on RTX 3060 12 GB (Windows 11, CUDA 12.1) with Qwen 3.6 27B Q3_K_M, 4096-token context, fp16 KV cache:

Workloadtok/sNotes
Cold first response (model load)~3Includes load_duration
Warm chat (50-token answers)22-26Steady state
Long-form generation (1000 tokens)18-20Sustained
Thinking-mode enabled15-18Includes hidden chain-of-thought tokens

For comparison, our broader benchmark of 2026 local models covers the same hardware against Llama 4, GPT-OSS, GLM 4.7, and DeepSeek R1.

Thinking Mode

Qwen 3.6 preserves the thinking-mode lineage from QwQ and DeepSeek R1. The model produces a hidden chain-of-thought before the visible answer. In Locally Uncensored toggle the Think button in the chat input. With Ollama set think: true in the request body. Without thinking the model still answers but skips the deliberation step — faster but lower quality on hard reasoning prompts.

Note: thinking mode adds 1.5-3x to response latency depending on prompt complexity. Worth it for code review, math, multi-step planning. Skip it for casual chat.

Vision Support

Both 27B dense and 35B MoE accept image input. Drag-and-drop a screenshot, photo, or chart into the LU chat input. The model returns a description, transcribes text, identifies objects, or answers questions about the image.

VRAM cost for vision inference is +1-2 GB on top of the base model. So 27B Q3_K_M with vision needs ~14 GB — doesn’t fit on 12 GB GPUs. Use Q3_K_M for text only or step down to UD-IQ2_XXS for vision on 12 GB GPUs.

Coding Performance

The 35B MoE coding-specialised variants (qwen3.6:35b-a3b-coding-nvfp4, qwen3.6:35b-a3b-coding-mxfp8) are tuned on the SWE-bench training data. On the SWE-bench-verified benchmark the coding NVFP4 variant scores in the same ballpark as Claude 3.5 Sonnet at a fraction of the inference cost.

For day-to-day coding inside LU’s Codex agent, the 27B dense Q4_K_M is the better default — consistent quality, no MoE-routing variance. Switch to the 35B MoE coding variant for hard refactors or when SWE-bench-style multi-file changes are involved.

Comparison — Qwen 3.6 vs Qwen 3.5

FeatureQwen 3.5Qwen 3.6
VisionNoYes (both 27B and 35B)
Context window128K256K
Thinking modeQwQ-onlyPreserved across variants
Coding-specific MoENoYes (35B-a3b-coding)
NVFP4 quantNoYes (35B MoE)
MLX variant for Apple SiliconNoYes (35B MoE BF16 MLX)
Best dense size27B27B (denser per benchmark)

Troubleshooting

Out of memory on Q4_K_M

Step down one quant level. Q4_K_M → Q3_K_M (12 GB), Q3_K_M → UD-IQ2_XXS (8 GB). Or reduce context window size in the Modelfile.

"does not support thinking" HTTP 400 error

You’re on an Ollama version older than 0.3.10 with thinking enabled in LU. Update Ollama. v2.4.0 of LU also retries automatically without the thinking flag and warns in-app.

Slow first response, fast subsequent

Normal. The first prompt loads the model into VRAM (load_duration in the Ollama response). Keep the model warm by setting OLLAMA_KEEP_ALIVE=24h as an env var, or in LM Studio set "Keep model loaded for" to infinite.

HuggingFace download 404

If you’re downloading via LU’s search box (not the curated list) and hit a 404, you may have hit the pre-v2.4.0 filename heuristic bug. Update LU to v2.4.0 — that release fixes the doubled-quant-tag issue.

Related Reading


Locally Uncensored is AGPL-3.0 licensed. Built by PurpleDoubleD. Bug reports and feature requests on GitHub Discussions or in the Discord.

One-click Qwen 3.6 install plus 75+ other models in Locally Uncensored

Download Locally Uncensored