Ollama

ollama/ollama last check 191 releases recent
Notes
Release notes
v0.5.0 · 1y+
view on github

<div align="center"> <img src="https://github.com/user-attachments/assets/1a70f8b1-b794-435a-8b7f-c9d4b64ba6db" width="512" /> </div>

New models

  • Llama 3.3: a new state of the art 70B model. Llama 3.3 70B offers similar performance compared to Llama 3.1 405B model.
  • Snowflake Arctic Embed 2: Snowflake's frontier embedding model. Arctic Embed 2.0 adds multilingual support without sacrificing English performance or scalability.

Structured outputs

Ollama now supports structured outputs, making it possible to constrain a model's output to a specific format defined by a JSON schema. The Ollama Python and JavaScript libraries have been updated to support structured outputs, together with Ollama's OpenAI-compatible API endpoints.

REST API

To use structured outputs in Ollama's generate or chat APIs, provide a JSON schema object in the format parameter:

curl -X POST http://localhost:11434/api/chat -H &quot;Content-Type: application/json&quot; -d &#39;{
  &quot;model&quot;: &quot;llama3.1&quot;,
  &quot;messages&quot;: [{&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;Tell me about Canada.&quot;}],
  &quot;stream&quot;: false,
  &quot;format&quot;: {
    &quot;type&quot;: &quot;object&quot;,
    &quot;properties&quot;: {
      &quot;name&quot;: {
        &quot;type&quot;: &quot;string&quot;
      },
      &quot;capital&quot;: {
        &quot;type&quot;: &quot;string&quot;
      },
      &quot;languages&quot;: {
        &quot;type&quot;: &quot;array&quot;,
        &quot;items&quot;: {
          &quot;type&quot;: &quot;string&quot;
        }
      }
    },
    &quot;required&quot;: [
      &quot;name&quot;,
      &quot;capital&quot;, 
      &quot;languages&quot;
    ]
  }
}&#39;

Python library

Using the Ollama Python library, pass in the schema as a JSON object to the format parameter as either dict or use Pydantic (recommended) to serialize the schema using model_json_schema().

from ollama import chat
from pydantic import BaseModel

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]


response = chat(
  messages=[
    {
      &#39;role&#39;: &#39;user&#39;,
      &#39;content&#39;: &#39;Tell me about Canada.&#39;,
    }
  ],
  model=&#39;llama3.1&#39;,
  format=Country.model_json_schema(),
)

country = Country.model_validate_json(response.message.content)
print(country)

JavaScript library

Using the Ollama JavaScript library, pass in the schema as a JSON object to the format parameter as either object or use Zod (recommended) to serialize the schema using zodToJsonSchema():

import ollama from &#39;ollama&#39;;
import { z } from &#39;zod&#39;;
import { zodToJsonSchema } from &#39;zod-to-json-schema&#39;;

const Country = z.object({
    name: z.string(),
    capital: z.string(), 
    languages: z.array(z.string()),
});

const response = await ollama.chat({
    model: &#39;llama3.1&#39;,
    messages: [{ role: &#39;user&#39;, content: &#39;Tell me about Canada.&#39; }],
    format: zodToJsonSchema(Country),
});

const country = Country.parse(JSON.parse(response.message.content));
console.log(country);

What's Changed

  • Fixed error importing model vocabulary files
  • Experimental: new flag to set KV cache quantization to 4-bit (q4_0), 8-bit (q8_0) or 16-bit (f16). This reduces VRAM requirements for longer context windows.
    • To enable for all models, use OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve
    • Note: in the future flash attention will be enabled by default where available, with kv cache quantization available on a per-model basis
    • Thank you @sammcj for the contribution in in https://github.com/ollama/ollama/pull/7926

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.4.7...v0.5.0