In r/LocalLLaMA, running newer local models is framed as a VRAM and context-window budgeting problem, with builders trading off quantization and model choice to fit long context while keeping throughput high.
Gemma is a massive memory hog though, context take so much so I had to drop to Q5 or Q4 31b on 5090 to fit everything, speed is pretty good though 50-60 tok/sec right now, similar to qwen.
One reason why SWA feels so bad is llama.cpp forced SWA layers to fp16. They changed that a few hours ago.
I’m able to have much larger context windows on my standard consumer hardware.
This finding is one of many signals tracked across Artificial Intelligence. The live feed updates every few hours with new expert voices, debates, and emerging ideas.
← Back to Artificial Intelligence