INDEX
- Introduction
- Requirements
- What we are building
- Installing Ollama
- Choosing small models
- Valid SLM tasks
- Local classification service
- Measuring quality, cost and time
- Extras
- Full stack
1. Introduction
Local AI does not have to start with a huge GPU. The first goal is to find narrow, repeated, low-risk tasks where a small model can prepare work before a paid model is called.
2. Requirements
We need a machine for Ollama, real examples, candidate low-risk tasks, a way to measure accuracy and a paid model for comparison.
3. What we are building
We install Ollama, download a small model, test classification, summarization and extraction, create a local function and escalate to paid AI when needed.
4. Installing Ollama
Ollama serves a local API at http://localhost:11434/api and provides endpoints such as /api/generate and /api/chat.
ollama run llama3.2:1b
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2:1b",
"messages": [
{"role": "user", "content": "Classify this ticket: I cannot access VPN"}
],
"stream": false
}'
5. Choosing small models
The Llama 3.2 page on Ollama lists small 1B and 3B variants. Use 1B for simple classification, tags and short rewrites; use 3B for summaries, JSON extraction and slightly more nuanced classification.
6. Valid SLM tasks
Good tasks: classify emails, summarize internal notes, extract dates, detect duplicates, prepare context and create drafts. Bad first tasks: critical decisions, final client writing and anything hard to review.
7. Local classification service
import requests
def classify_ticket(text):
prompt = f"Classify this ticket as vpn, email, hardware, software or other:\n{text}"
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3.2:1b",
"stream": False,
"messages": [{"role": "user", "content": prompt}],
},
timeout=60,
)
response.raise_for_status()
return response.json()["message"]["content"].strip().lower()
8. Measuring quality, cost and time
Track accuracy, avoided tokens, manual review and decision. If the model fails only in specific cases, add escalation rules.
9. Extras
Use Ollama Modelfile to specialize behavior, keep failed examples, and do not measure speed alone.
10. Full stack
Ollama, a small model, real examples, a test script, accuracy tracking and a fallback rule to paid AI.