Ollama

ollama/ollama last check 191 releases recent
Notes
Release notes
v0.1.35 · 1y+
view on github

New models

  • Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).

What's Changed

  • Quantization: ollama create can now quantize models when importing them using the --quantize or -q flag:
ollama create -f Modelfile --quantize q4_0 mymodel

> [!NOTE] > --quantize works when importing float16 or float32 models: > * From a binary GGUF files (e.g. FROM ./model.gguf) > * From a library model (e.g. FROM llama3:8b-instruct-fp16)

  • Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
  • Fixed a series out of memory errors when loading models on multi-GPU systems
  • <kbd>Ctrl+J</kbd> characters will now properly add newlines in ollama run
  • Fixed issues when running ollama show for vision models
  • OPTIONS requests to the Ollama API will no longer result in errors
  • Fixed issue where partially downloaded files wouldn't be cleaned up
  • Added a new done_reason field in responses describing why generation stopped responding
  • Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.34...v0.1.35