documents/dev/ollama.md
Table of Contents
Ollama
Mac
The best Ollama models for local coding in early 2026 are
- Qwen2.5-Coder (7B–30B)
- DeepSeek-Coder-V2 (16B/33B)
- Codestral (22B)
Optimized for MLX on Apple Silicon preview
ollama run qwen3.5:35b-a3b-coding-nvfp4
Cloud models allow users to run LLM high-paremeter models in the cloud, rather than locally. Enables accessing models (e.g., 120B+ params) without needing a powerful local GPU.
ollama run kimi-k2.5:cloud
ollama run kimi-k2-thinking:cloud
Use Ollama in Cursor
- Got to Cursor Settings, add custom model, enter exact name (e.g.,
qwen3.5:35b-a3b-coding-nvfp4) - Expand Override OpenAI Base URL
- Change
https://api.openai.com/v1tohttps://ollama.dph.am/v1- needs a proxy to your localhost:11434
- Disable other GPT models
API key
- generate API key
openssl rand -base64 32python3 -c "import secrets; print(secrets.token_urlsafe(32))"
- set variable
launchctl setenv OLLAMA_API_KEY "your_api_key"
OLLAMA_API_KEY=ZnrKFRFpTmaLFoLiGq1u9i4fP-gLNWKdKpurV3vJxMY launchctl setenv OLLAMA_API_KEY "ZnrKFRFpTmaLFoLiGq1u9i4fP-gLNWKdKpurV3vJxMY"
Test:
curl https://ollama.dph.am/api/generate
-d '{
"model": "qwen3.5:35b-a3b-coding-nvfp4",
"prompt": "Why is the sky blue?",
"stream": false
}'
curl https://ollama.dph.am/api/generate
-H "Authorization: Bearer $OLLAMA_API_KEY"
-d '{
"model": "qwen3.5:35b-a3b-coding-nvfp4",
"prompt": "Why is the sky blue?",
"stream": false
}'
Open Web UI
source ~/.venv-scriptsync/bin/activate
# serve
open-webui serve --port 3010
# install / upgrade
pip install open-webui -U
# change password
cd ~/.venv-scriptsync/lib/python3.11/site-packages/open_webui
sqlite3 data/webui.db
# https://gchq.github.io/CyberChef/#recipe=Bcrypt(10)&input=Y2Fubm9u
UPDATE auth SET password='bcrypt hash from cyberchef' WHERE email='dominick.pham@gmail.com';
General
My models
- llama3.1:70b
- llama3.2:3b
- qwen2.5:14b
- qwen2.5-coder:1.5b
# to serve
set OLLAMA_HOST=0.0.0.0
ollama serve
# to run in chat mode
ollama run qwen2.5-coder:1.5b
# default models directory
C:\Users\domin\.ollama\models
# to change, set system environment variable OLLAMA_MODELS to new path
Autocomplete
curl http://ollama.dph.am/api/generate -d '{
"model": "qwen2.5-coder:1.5b",
"system": "You are an AI specialized in providing autocompletion suggestions. When given a partial sentence or phrase, suggest concise and relevant completions in a list format.",
"prompt": "Provide autocompletion suggestions for the following partial sentence: \"the quick brown fox\"",
"stream": false,
"format": "json",
"options": {
"temperature": 0.7,
"max_tokens": 100,
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0
}
}'
Other calls
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion POST /api/chat - chat completion
curl http://ollama.dph.am/api/chat -d '{
"model": "qwen2.5-coder:1.5b",
"stream": false,
"system": "You respond in Vietnamese",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
},
{
"role": "assistant",
"content": "due to rayleigh scattering."
},
{
"role": "user",
"content": "how is that different than mie scattering?"
}
]
}'
POST /api/generate -
curl http://ollama.dph.am/api/generate -d '{
"model": "qwen2.5-coder:1.5b",
"system": "You are an AI specialized in providing autocompletion suggestions. When given a partial sentence or phrase, suggest concise and relevant completions in a list format.",
"prompt": "Provide autocompletion suggestions for the following partial sentence: \"the quick brown fox\""
}'
For non streaming, it returns the full text. When format is set to json, the output will always be a well-formed JSON object. It's important to also instruct the model to respond in JSON.
curl http://ollama.dph.am/api/generate -d '{
"model": "qwen2.5-coder:1.5b",
"system": "talk like a 5 year old",
"prompt": "Why is the sky blue? Respond using JSON",
"format": "json",
"stream": false
}'
Full options
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:1.5b",
"prompt": "Why is the sky blue?",
"stream": false,
"options": {
"num_keep": 5,
"seed": 42,
"num_predict": 100,
"top_k": 20,
"top_p": 0.9,
"min_p": 0.0,
"typical_p": 0.7,
"repeat_last_n": 33,
"temperature": 0.8,
"repeat_penalty": 1.2,
"presence_penalty": 1.5,
"frequency_penalty": 1.0,
"mirostat": 1,
"mirostat_tau": 0.8,
"mirostat_eta": 0.6,
"penalize_newline": true,
"stop": ["\n", "user:"],
"numa": false,
"num_ctx": 1024,
"num_batch": 2,
"num_gpu": 1,
"main_gpu": 0,
"low_vram": false,
"vocab_only": false,
"use_mmap": true,
"use_mlock": false,
"num_thread": 8
}
}'