What's Changed
- Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder
- Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs
- Fixed an issue where
keep_alive in the API would accept different values for the /api/chat and /api/generate endpoints
- Fixed tool calling rendering with
qwen3-coder
- More reliable and accurate VRAM detection
OLLAMA_FLASH_ATTENTION can now be overridden to 0 for models that have flash attention enabled by default
- macOS 12 Monterey and macOS 13 Ventura are no longer supported
- Fixed crash where templates were not correctly defined
- Fix memory calculations on NVIDIA iGPUs
- AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm. We're working to support these GPUs via Vulkan in a future release.
New Contributors
- @Fachep made their first contribution in https://github.com/ollama/ollama/pull/12412
Full Changelog: https://github.com/ollama/ollama/compare/v0.12.3...v0.12.4-rc3