Running AI Locally with Ollama: A Complete Guide

Q: What's Next?

Ollama and the local AI ecosystem are evolving fast: - Multimodal models like LLaVA let you process images locally - Function calling enables tool use and agent workflows - Fine-tuning support lets you customize models on your own data - Cluster mode allows distributing inference across multiple mac

Learn how to run powerful AI models on your own machine with Ollama — zero cloud dependency, complete privacy, and surprisingly fast performance.

Why Run AI Locally?

Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:

Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck Latency: Network round trips add 500ms–3s of delay on every response Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements

Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.

What Is Ollama?

Ollama is an open-source tool (with over 130K stars on GitHub) that makes it trivially easy to download, run, and manage large language models on your own computer. Think of it as "Docker for LLMs" — one command to pull a model, one command to run it.

Before Ollama, running a local LLM required wrangling Python dependencies, CUDA configurations, model quantization formats, and custom inference scripts. Ollama abstracts all of that behind a simple CLI and REST API.

How Ollama Works Under the Hood

Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:

Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency) It loads the model into RAM (or VRAM if you have a GPU) It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to Responses are generated token-by-token on your hardware

The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.

System Requirements

Before getting started, here's what you need:

Minimum Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | RAM | 8 GB | 16 GB+ | | Storage | 5 GB free | 20 GB+ (for multiple models) | | OS | macOS 12+, Windows 10+, Linux | Latest stable release | | CPU | Any modern x86_64 or ARM | Apple Silicon M1+ or recent AMD/Intel |

GPU Acceleration (Optional but Recommended) Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended) AMD GPUs: ROCm support on Linux

GPU acceleration can make responses 3–10x faster compared to CPU-only inference. Apple Silicon Macs are particularly impressive — an M2 MacBook Air can generate 30+ tokens/second with Llama 3 8B.

Installation: Step by Step

macOS

The fastest method: `bash curl -fsSL https://ollama.com/install.sh | sh `

Or download the DMG installer from ollama.com/download.

On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.

Windows

Run in PowerShell: `bash irm https://ollama.com/install.ps1 | iex `

Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.

Linux

`bash curl -fsSL https://ollama.com/install.sh | sh `

On Linux, Ollama installs as a systemd service that starts automatically.

Docker

`bash docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `

For GPU passthrough with NVIDIA: `bash docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `

Verify Installation

After installing, open a terminal and run: `bash ollama --version `

You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.

Downloading Your First Model

Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:

`bash Download Llama 3 — Meta's best open model (4.7 GB) ollama pull llama3

Or download while running ollama run llama3 `

The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.

Recommended Models for Different Use Cases

| Model | Size | Speed | Best For | |-------|------|-------|----------| | Llama 3.1 8B | 4.7 GB | Fast | General purpose, conversation, analysis | | Llama 3.1 70B | 40 GB | Moderate | Complex reasoning, coding, research (needs 48GB+ RAM) | | Mistral 7B | 4.1 GB | Fast | Coding, instruction following, structured output | | Gemma 2 9B | 5.4 GB | Fast | Google's efficient model, strong at reasoning | | Phi-3 Mini 3.8B | 2.3 GB | Very fast | Lightweight tasks, older hardware, quick answers | | Qwen 2.5 7B | 4.4 GB | Fast | Multilingual support, strong at Chinese + English | | CodeLlama 7B | 3.8 GB | Fast | Code generation, debugging, documentation | | DeepSeek Coder V2 | 8.9 GB | Moderate | Advanced code generation and analysis |

Our recommendation for most users: Start with Llama 3.1 8B. It offers the best balance of quality, speed, and RAM usage. If you have a powerful machine with 32GB+ RAM, try the 70B variant for near-GPT-4 quality responses.

Managing Models

`bash List downloaded models ollama list

Show model details ollama show llama3

Remove a model to free space ollama rm codellama

Copy/rename a model ollama cp llama3 my-custom-llama `

Connecting Ollama to Cognito

This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.

Setup Steps

Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."

Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)

Select Ollama as your provider — In the model provider dropdown, choose "Ollama"

Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.

Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.

CORS Configuration (If Needed)

If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:

`bash OLLAMA_ORIGINS="*" ollama serve `

On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.

Performance Optimization Tips

Use Quantized Models Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff: Q8: Highest quality, most RAM, slowest Q4_K_M: Best balance (default for most models) Q4_K_S: Slightly smaller, minimal quality loss Q3_K_S: Low RAM usage, noticeable quality reduction

Allocate GPU Layers If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date: `bash nvidia-smi # Check GPU status and VRAM `

Tune Context Length Default context is usually 2048 tokens. For longer conversations: `bash ollama run llama3 --ctx 4096 `

Note: Larger context windows use more RAM.

Close Heavy Applications When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.

Monitor Performance `bash Check Ollama process memory usage ollama ps

View detailed logs ollama logs `

Real-World Use Cases for Local AI

For Developers Explain proprietary code without sending it to the cloud Generate unit tests for internal codebases Debug errors locally while working on air-gapped networks Code reviews on classified or NDA-protected projects

For Legal and Medical Professionals Summarize client documents with complete confidentiality Draft correspondence without data leaving the firm's network Research case law with zero data exposure Analyze medical records in HIPAA-compliant environments

For Students and Researchers Process research data without institutional data-sharing concerns Generate literature review outlines from local document collections Practice coding exercises offline during exams or commutes Run experiments with different models without API costs

For Privacy-Conscious Users Browse and ask questions about sensitive topics privately Translate personal documents without cloud exposure Get AI assistance while traveling without reliable internet Maintain complete control over your AI interactions

Ollama vs OpenAI API: Quick Comparison

| Aspect | Ollama (Local) | OpenAI API (Cloud) | |--------|---------------|-------------------| | Privacy | Complete — data never leaves your machine | Partial — data sent to OpenAI servers | | Cost | Free after model download | $0.002–$0.06 per 1K tokens | | Speed | 15–60 tokens/sec (hardware dependent) | 30–80 tokens/sec | | Quality | Good to excellent (model dependent) | Best-in-class (GPT-4/5) | | Internet | Not required | Required | | Model variety | 300+ open-source models | OpenAI models only | | Setup | One-time install + download | API key + billing |

Troubleshooting Common Issues

"Ollama is not running" macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications Windows: Check the system tray. Restart via Start Menu Linux: Run sudo systemctl start ollama

"Model too slow" Switch to a smaller model (8B instead of 70B) Enable GPU acceleration (check if your GPU is detected) Close other applications to free RAM Use a more quantized version (Q4 instead of Q8)

"Out of memory" Use a smaller model variant Reduce context length with --ctx 2048 Add swap space on Linux for more virtual memory Consider upgrading RAM if you regularly use 13B+ models

"CORS error in browser extension" Set OLLAMA_ORIGINS="*" environment variable Restart the Ollama service after changing the variable Ensure no firewall is blocking localhost connections

What's Next?

Ollama and the local AI ecosystem are evolving fast: Multimodal models like LLaVA let you process images locally Function calling enables tool use and agent workflows Fine-tuning support lets you customize models on your own data Cluster mode allows distributing inference across multiple machines

The gap between local and cloud AI shrinks with every new model release. Today's 8B parameter models rival GPT-3.5 in quality, and the trajectory suggests local models will reach GPT-4 parity within a year.

Get Started in 5 Minutes

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Pull a model: ollama pull llama3 Install Cognito from the Chrome Web Store Set Ollama as provider in Cognito settings Start chatting with local AI on any webpage

No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.

---

Resources

Ollama Official Site Meta AI Llama

Running AI Locally with Ollama: A Complete Guide

Learn how to run powerful AI models on your own machine with Ollama — zero cloud dependency, complete privacy, and surprisingly fast performance.

Cognito Team

8 min read·Mar 14, 2026

Running AI Locally with Ollama: A Complete Guide

Why Run AI Locally?

Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:

Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training
Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited
Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck
Latency: Network round trips add 500ms–3s of delay on every response
Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements

Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.

What Is Ollama?

How Ollama Works Under the Hood

Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:

Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency)
It loads the model into RAM (or VRAM if you have a GPU)
It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to
Responses are generated token-by-token on your hardware

The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.

System Requirements

Before getting started, here's what you need:

Minimum Requirements

Component	Minimum	Recommended
RAM	8 GB	16 GB+
Storage	5 GB free	20 GB+ (for multiple models)
OS	macOS 12+, Windows 10+, Linux	Latest stable release
CPU	Any modern x86_64 or ARM	Apple Silicon M1+ or recent AMD/Intel

GPU Acceleration (Optional but Recommended)

Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed
NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended)
AMD GPUs: ROCm support on Linux

GPU acceleration can make responses 3–10x faster compared to CPU-only inference. Apple Silicon Macs are particularly impressive — an M2 MacBook Air can generate 30+ tokens/second with Llama 3 8B.

Installation: Step by Step

macOS

The fastest method:

bash

curl -fsSL https://ollama.com/install.sh | sh

Or download the DMG installer from ollama.com/download.

On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.

Windows

Run in PowerShell:

bash

irm https://ollama.com/install.ps1 | iex

Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.

Linux

bash

curl -fsSL https://ollama.com/install.sh | sh

On Linux, Ollama installs as a systemd service that starts automatically.

Docker

bash

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For GPU passthrough with NVIDIA:

bash

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify Installation

After installing, open a terminal and run:

bash

ollama --version

You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.

Downloading Your First Model

Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:

bash

# Download Llama 3 — Meta's best open model (4.7 GB)
ollama pull llama3

# Or download while running  
ollama run llama3

The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.

Recommended Models for Different Use Cases

Model	Size	Speed	Best For
Llama 3.1 8B	4.7 GB	Fast	General purpose, conversation, analysis
Llama 3.1 70B	40 GB	Moderate	Complex reasoning, coding, research (needs 48GB+ RAM)
Mistral 7B	4.1 GB	Fast	Coding, instruction following, structured output
Gemma 2 9B	5.4 GB	Fast	Google's efficient model, strong at reasoning
Phi-3 Mini 3.8B	2.3 GB	Very fast	Lightweight tasks, older hardware, quick answers
Qwen 2.5 7B	4.4 GB	Fast	Multilingual support, strong at Chinese + English
CodeLlama 7B	3.8 GB	Fast	Code generation, debugging, documentation
DeepSeek Coder V2	8.9 GB	Moderate	Advanced code generation and analysis

Our recommendation for most users: Start with Llama 3.1 8B. It offers the best balance of quality, speed, and RAM usage. If you have a powerful machine with 32GB+ RAM, try the 70B variant for near-GPT-4 quality responses.

Managing Models

bash

# List downloaded models
ollama list

# Show model details
ollama show llama3

# Remove a model to free space
ollama rm codellama

# Copy/rename a model
ollama cp llama3 my-custom-llama

Connecting Ollama to Cognito

This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.

Setup Steps

Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."
Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)
Select Ollama as your provider — In the model provider dropdown, choose "Ollama"
Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.
Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.

CORS Configuration (If Needed)

If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:

bash

OLLAMA_ORIGINS="*" ollama serve

On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.

Performance Optimization Tips

1. Use Quantized Models

Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff:

Q8: Highest quality, most RAM, slowest
Q4_K_M: Best balance (default for most models)
Q4_K_S: Slightly smaller, minimal quality loss
Q3_K_S: Low RAM usage, noticeable quality reduction

2. Allocate GPU Layers

If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date:

bash

nvidia-smi  # Check GPU status and VRAM

3. Tune Context Length

Default context is usually 2048 tokens. For longer conversations:

bash

ollama run llama3 --ctx 4096

Note: Larger context windows use more RAM.

4. Close Heavy Applications

When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.

5. Monitor Performance

bash

# Check Ollama process memory usage
ollama ps

# View detailed logs
ollama logs

Real-World Use Cases for Local AI

For Developers

Explain proprietary code without sending it to the cloud
Generate unit tests for internal codebases
Debug errors locally while working on air-gapped networks
Code reviews on classified or NDA-protected projects

For Legal and Medical Professionals

Summarize client documents with complete confidentiality
Draft correspondence without data leaving the firm's network
Research case law with zero data exposure
Analyze medical records in HIPAA-compliant environments

For Students and Researchers

Process research data without institutional data-sharing concerns
Generate literature review outlines from local document collections
Practice coding exercises offline during exams or commutes
Run experiments with different models without API costs

For Privacy-Conscious Users

Browse and ask questions about sensitive topics privately
Translate personal documents without cloud exposure
Get AI assistance while traveling without reliable internet
Maintain complete control over your AI interactions

Ollama vs OpenAI API: Quick Comparison

Aspect	Ollama (Local)	OpenAI API (Cloud)
Privacy	Complete — data never leaves your machine	Partial — data sent to OpenAI servers
Cost	Free after model download	$0.002–$0.06 per 1K tokens
Speed	15–60 tokens/sec (hardware dependent)	30–80 tokens/sec
Quality	Good to excellent (model dependent)	Best-in-class (GPT-4/5)
Internet	Not required	Required
Model variety	300+ open-source models	OpenAI models only
Setup	One-time install + download	API key + billing

Troubleshooting Common Issues

"Ollama is not running"

macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications
Windows: Check the system tray. Restart via Start Menu
Linux: Run sudo systemctl start ollama

"Model too slow"

Switch to a smaller model (8B instead of 70B)
Enable GPU acceleration (check if your GPU is detected)
Close other applications to free RAM
Use a more quantized version (Q4 instead of Q8)

"Out of memory"

Use a smaller model variant
Reduce context length with --ctx 2048
Add swap space on Linux for more virtual memory
Consider upgrading RAM if you regularly use 13B+ models

"CORS error in browser extension"

Set OLLAMA_ORIGINS="*" environment variable
Restart the Ollama service after changing the variable
Ensure no firewall is blocking localhost connections

What's Next?

Ollama and the local AI ecosystem are evolving fast:

Multimodal models like LLaVA let you process images locally
Function calling enables tool use and agent workflows
Fine-tuning support lets you customize models on your own data
Cluster mode allows distributing inference across multiple machines

Get Started in 5 Minutes

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull a model: ollama pull llama3
Install Cognito from the Chrome Web Store
Set Ollama as provider in Cognito settings
Start chatting with local AI on any webpage

No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.

Resources

Try Cognito AI — Free Chrome Extension

ChatGPT, Claude, Gemini & local models in your browser sidebar. No switching tabs.

Add to Chrome — It's Free

ollamalocal-AIprivacytutorial

Why Run AI Locally?

Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:

Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training
Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited
Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck
Latency: Network round trips add 500ms–3s of delay on every response
Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements

Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.

What Is Ollama?

How Ollama Works Under the Hood

Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:

Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency)
It loads the model into RAM (or VRAM if you have a GPU)
It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to
Responses are generated token-by-token on your hardware

The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.

System Requirements

Before getting started, here's what you need:

Minimum Requirements

Component	Minimum	Recommended
RAM	8 GB	16 GB+
Storage	5 GB free	20 GB+ (for multiple models)
OS	macOS 12+, Windows 10+, Linux	Latest stable release
CPU	Any modern x86_64 or ARM	Apple Silicon M1+ or recent AMD/Intel

GPU Acceleration (Optional but Recommended)

Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed
NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended)
AMD GPUs: ROCm support on Linux

Installation: Step by Step

macOS

The fastest method:

bash

curl -fsSL https://ollama.com/install.sh | sh

Or download the DMG installer from ollama.com/download.

On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.

Windows

Run in PowerShell:

bash

irm https://ollama.com/install.ps1 | iex

Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.

Linux

bash

curl -fsSL https://ollama.com/install.sh | sh

On Linux, Ollama installs as a systemd service that starts automatically.

Docker

bash

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For GPU passthrough with NVIDIA:

bash

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify Installation

After installing, open a terminal and run:

bash

ollama --version

You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.

Downloading Your First Model

Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:

bash

# Download Llama 3 — Meta's best open model (4.7 GB)
ollama pull llama3

# Or download while running  
ollama run llama3

The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.

Recommended Models for Different Use Cases

Model	Size	Speed	Best For
Llama 3.1 8B	4.7 GB	Fast	General purpose, conversation, analysis
Llama 3.1 70B	40 GB	Moderate	Complex reasoning, coding, research (needs 48GB+ RAM)
Mistral 7B	4.1 GB	Fast	Coding, instruction following, structured output
Gemma 2 9B	5.4 GB	Fast	Google's efficient model, strong at reasoning
Phi-3 Mini 3.8B	2.3 GB	Very fast	Lightweight tasks, older hardware, quick answers
Qwen 2.5 7B	4.4 GB	Fast	Multilingual support, strong at Chinese + English
CodeLlama 7B	3.8 GB	Fast	Code generation, debugging, documentation
DeepSeek Coder V2	8.9 GB	Moderate	Advanced code generation and analysis

Managing Models

bash

# List downloaded models
ollama list

# Show model details
ollama show llama3

# Remove a model to free space
ollama rm codellama

# Copy/rename a model
ollama cp llama3 my-custom-llama

Connecting Ollama to Cognito

This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.

Setup Steps

Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."
Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)
Select Ollama as your provider — In the model provider dropdown, choose "Ollama"
Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.
Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.

CORS Configuration (If Needed)

If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:

bash

OLLAMA_ORIGINS="*" ollama serve

On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.

Performance Optimization Tips

1. Use Quantized Models

Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff:

Q8: Highest quality, most RAM, slowest
Q4_K_M: Best balance (default for most models)
Q4_K_S: Slightly smaller, minimal quality loss
Q3_K_S: Low RAM usage, noticeable quality reduction

2. Allocate GPU Layers

If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date:

bash

nvidia-smi  # Check GPU status and VRAM

3. Tune Context Length

Default context is usually 2048 tokens. For longer conversations:

bash

ollama run llama3 --ctx 4096

Note: Larger context windows use more RAM.

4. Close Heavy Applications

When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.

5. Monitor Performance

bash

# Check Ollama process memory usage
ollama ps

# View detailed logs
ollama logs

Real-World Use Cases for Local AI

For Developers

Explain proprietary code without sending it to the cloud
Generate unit tests for internal codebases
Debug errors locally while working on air-gapped networks
Code reviews on classified or NDA-protected projects

For Legal and Medical Professionals

Summarize client documents with complete confidentiality
Draft correspondence without data leaving the firm's network
Research case law with zero data exposure
Analyze medical records in HIPAA-compliant environments

For Students and Researchers

Process research data without institutional data-sharing concerns
Generate literature review outlines from local document collections
Practice coding exercises offline during exams or commutes
Run experiments with different models without API costs

For Privacy-Conscious Users

Browse and ask questions about sensitive topics privately
Translate personal documents without cloud exposure
Get AI assistance while traveling without reliable internet
Maintain complete control over your AI interactions

Ollama vs OpenAI API: Quick Comparison

Aspect	Ollama (Local)	OpenAI API (Cloud)
Privacy	Complete — data never leaves your machine	Partial — data sent to OpenAI servers
Cost	Free after model download	$0.002–$0.06 per 1K tokens
Speed	15–60 tokens/sec (hardware dependent)	30–80 tokens/sec
Quality	Good to excellent (model dependent)	Best-in-class (GPT-4/5)
Internet	Not required	Required
Model variety	300+ open-source models	OpenAI models only
Setup	One-time install + download	API key + billing

Troubleshooting Common Issues

"Ollama is not running"

macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications
Windows: Check the system tray. Restart via Start Menu
Linux: Run sudo systemctl start ollama

"Model too slow"

Switch to a smaller model (8B instead of 70B)
Enable GPU acceleration (check if your GPU is detected)
Close other applications to free RAM
Use a more quantized version (Q4 instead of Q8)

"Out of memory"

Use a smaller model variant
Reduce context length with --ctx 2048
Add swap space on Linux for more virtual memory
Consider upgrading RAM if you regularly use 13B+ models

"CORS error in browser extension"

Set OLLAMA_ORIGINS="*" environment variable
Restart the Ollama service after changing the variable
Ensure no firewall is blocking localhost connections

What's Next?

Ollama and the local AI ecosystem are evolving fast:

Multimodal models like LLaVA let you process images locally
Function calling enables tool use and agent workflows
Fine-tuning support lets you customize models on your own data
Cluster mode allows distributing inference across multiple machines

Get Started in 5 Minutes

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Pull a model: ollama pull llama3
Install Cognito from the Chrome Web Store
Set Ollama as provider in Cognito settings
Start chatting with local AI on any webpage

No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.

Running AI Locally with Ollama: A Complete Guide

Why Run AI Locally?

What Is Ollama?

How Ollama Works Under the Hood

System Requirements

Minimum Requirements

GPU Acceleration (Optional but Recommended)

Installation: Step by Step

macOS

Windows

Linux

Docker

Verify Installation

Downloading Your First Model

Recommended Models for Different Use Cases

Managing Models

Connecting Ollama to Cognito

Setup Steps

CORS Configuration (If Needed)

Performance Optimization Tips

1. Use Quantized Models

2. Allocate GPU Layers

3. Tune Context Length

4. Close Heavy Applications

5. Monitor Performance

Real-World Use Cases for Local AI

For Developers

For Legal and Medical Professionals

For Students and Researchers

For Privacy-Conscious Users

Ollama vs OpenAI API: Quick Comparison

Troubleshooting Common Issues

"Ollama is not running"

"Model too slow"

"Out of memory"

"CORS error in browser extension"

What's Next?

Get Started in 5 Minutes

Related Reading

Resources

Try Cognito AI — Free Chrome Extension

More from Cognito AI

Prompt Engineering: How to Get Better Answers from AI

AI Summarization: How to Instantly Digest Any Content

API Keys Explained: How to Set Up AI Providers Securely

Privacy-First AI: Why It Matters More Than Ever

Get the AI Productivity Cheat Sheet

Running AI Locally with Ollama: A Complete Guide

Why Run AI Locally?

What Is Ollama?

How Ollama Works Under the Hood

System Requirements

Minimum Requirements

GPU Acceleration (Optional but Recommended)

Installation: Step by Step

macOS