Running AI Locally with Ollama: A Complete Guide
Learn how to run powerful AI models on your own machine with Ollama — zero cloud dependency, complete privacy, and surprisingly fast performance.
Why Run AI Locally?
Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:
Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck Latency: Network round trips add 500ms–3s of delay on every response Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements
Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.
What Is Ollama?
Ollama is an open-source tool (with over 130K stars on GitHub) that makes it trivially easy to download, run, and manage large language models on your own computer. Think of it as "Docker for LLMs" — one command to pull a model, one command to run it.
Before Ollama, running a local LLM required wrangling Python dependencies, CUDA configurations, model quantization formats, and custom inference scripts. Ollama abstracts all of that behind a simple CLI and REST API.
How Ollama Works Under the Hood
Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:
Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency) It loads the model into RAM (or VRAM if you have a GPU) It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to Responses are generated token-by-token on your hardware
The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.
System Requirements
Before getting started, here's what you need:
Minimum Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | RAM | 8 GB | 16 GB+ | | Storage | 5 GB free | 20 GB+ (for multiple models) | | OS | macOS 12+, Windows 10+, Linux | Latest stable release | | CPU | Any modern x86_64 or ARM | Apple Silicon M1+ or recent AMD/Intel |
GPU Acceleration (Optional but Recommended) Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended) AMD GPUs: ROCm support on Linux
GPU acceleration can make responses 3–10x faster compared to CPU-only inference. Apple Silicon Macs are particularly impressive — an M2 MacBook Air can generate 30+ tokens/second with Llama 3 8B.
Installation: Step by Step
macOS
The fastest method: `bash curl -fsSL https://ollama.com/install.sh | sh `
Or download the DMG installer from ollama.com/download.
On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.
Windows
Run in PowerShell: `bash irm https://ollama.com/install.ps1 | iex `
Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.
Linux
`bash curl -fsSL https://ollama.com/install.sh | sh `
On Linux, Ollama installs as a systemd service that starts automatically.
Docker
`bash docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `
For GPU passthrough with NVIDIA: `bash docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `
Verify Installation
After installing, open a terminal and run: `bash ollama --version `
You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.
Downloading Your First Model
Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:
`bash Download Llama 3 — Meta's best open model (4.7 GB) ollama pull llama3
Or download while running ollama run llama3 `
The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.
Recommended Models for Different Use Cases
| Model | Size | Speed | Best For | |-------|------|-------|----------| | Llama 3.1 8B | 4.7 GB | Fast | General purpose, conversation, analysis | | Llama 3.1 70B | 40 GB | Moderate | Complex reasoning, coding, research (needs 48GB+ RAM) | | Mistral 7B | 4.1 GB | Fast | Coding, instruction following, structured output | | Gemma 2 9B | 5.4 GB | Fast | Google's efficient model, strong at reasoning | | Phi-3 Mini 3.8B | 2.3 GB | Very fast | Lightweight tasks, older hardware, quick answers | | Qwen 2.5 7B | 4.4 GB | Fast | Multilingual support, strong at Chinese + English | | CodeLlama 7B | 3.8 GB | Fast | Code generation, debugging, documentation | | DeepSeek Coder V2 | 8.9 GB | Moderate | Advanced code generation and analysis |
Our recommendation for most users: Start with Llama 3.1 8B. It offers the best balance of quality, speed, and RAM usage. If you have a powerful machine with 32GB+ RAM, try the 70B variant for near-GPT-4 quality responses.
Managing Models
`bash List downloaded models ollama list
Show model details ollama show llama3
Remove a model to free space ollama rm codellama
Copy/rename a model ollama cp llama3 my-custom-llama `
Connecting Ollama to Cognito
This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.
Setup Steps
Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."
Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)
Select Ollama as your provider — In the model provider dropdown, choose "Ollama"
Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.
Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.
CORS Configuration (If Needed)
If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:
`bash OLLAMA_ORIGINS="*" ollama serve `
On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.
Performance Optimization Tips
Use Quantized Models Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff: Q8: Highest quality, most RAM, slowest Q4_K_M: Best balance (default for most models) Q4_K_S: Slightly smaller, minimal quality loss Q3_K_S: Low RAM usage, noticeable quality reduction
Allocate GPU Layers If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date: `bash nvidia-smi # Check GPU status and VRAM `
Tune Context Length Default context is usually 2048 tokens. For longer conversations: `bash ollama run llama3 --ctx 4096 `
Note: Larger context windows use more RAM.
Close Heavy Applications When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.
Monitor Performance `bash Check Ollama process memory usage ollama ps
View detailed logs ollama logs `
Real-World Use Cases for Local AI
For Developers Explain proprietary code without sending it to the cloud Generate unit tests for internal codebases Debug errors locally while working on air-gapped networks Code reviews on classified or NDA-protected projects
For Legal and Medical Professionals Summarize client documents with complete confidentiality Draft correspondence without data leaving the firm's network Research case law with zero data exposure Analyze medical records in HIPAA-compliant environments
For Students and Researchers Process research data without institutional data-sharing concerns Generate literature review outlines from local document collections Practice coding exercises offline during exams or commutes Run experiments with different models without API costs
For Privacy-Conscious Users Browse and ask questions about sensitive topics privately Translate personal documents without cloud exposure Get AI assistance while traveling without reliable internet Maintain complete control over your AI interactions
Ollama vs OpenAI API: Quick Comparison
| Aspect | Ollama (Local) | OpenAI API (Cloud) | |--------|---------------|-------------------| | Privacy | Complete — data never leaves your machine | Partial — data sent to OpenAI servers | | Cost | Free after model download | $0.002–$0.06 per 1K tokens | | Speed | 15–60 tokens/sec (hardware dependent) | 30–80 tokens/sec | | Quality | Good to excellent (model dependent) | Best-in-class (GPT-4/5) | | Internet | Not required | Required | | Model variety | 300+ open-source models | OpenAI models only | | Setup | One-time install + download | API key + billing |
Troubleshooting Common Issues
"Ollama is not running" macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications Windows: Check the system tray. Restart via Start Menu Linux: Run sudo systemctl start ollama
"Model too slow" Switch to a smaller model (8B instead of 70B) Enable GPU acceleration (check if your GPU is detected) Close other applications to free RAM Use a more quantized version (Q4 instead of Q8)
"Out of memory" Use a smaller model variant Reduce context length with --ctx 2048 Add swap space on Linux for more virtual memory Consider upgrading RAM if you regularly use 13B+ models
"CORS error in browser extension" Set OLLAMA_ORIGINS="*" environment variable Restart the Ollama service after changing the variable Ensure no firewall is blocking localhost connections
What's Next?
Ollama and the local AI ecosystem are evolving fast: Multimodal models like LLaVA let you process images locally Function calling enables tool use and agent workflows Fine-tuning support lets you customize models on your own data Cluster mode allows distributing inference across multiple machines
The gap between local and cloud AI shrinks with every new model release. Today's 8B parameter models rival GPT-3.5 in quality, and the trajectory suggests local models will reach GPT-4 parity within a year.
Get Started in 5 Minutes
Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Pull a model: ollama pull llama3 Install Cognito from the Chrome Web Store Set Ollama as provider in Cognito settings Start chatting with local AI on any webpage
No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.
---
Related Reading
Privacy-First AI: Why It Matters Open Source AI Models Guide API Keys Explained for AI Tools
Resources
Ollama Official Site Meta AI Llama


