Running AI Locally with Ollama: A Complete Guide

Learn how to run powerful AI models on your own machine with Ollama — zero cloud dependency, complete privacy, and surprisingly fast performance.

Why Run AI Locally?

Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:

Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck Latency: Network round trips add 500ms–3s of delay on every response Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements

Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.

What Is Ollama?

Ollama is an open-source tool (with over 130K stars on GitHub) that makes it trivially easy to download, run, and manage large language models on your own computer. Think of it as "Docker for LLMs" — one command to pull a model, one command to run it.

Before Ollama, running a local LLM required wrangling Python dependencies, CUDA configurations, model quantization formats, and custom inference scripts. Ollama abstracts all of that behind a simple CLI and REST API.

How Ollama Works Under the Hood

Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:

Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency) It loads the model into RAM (or VRAM if you have a GPU) It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to Responses are generated token-by-token on your hardware

The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.

System Requirements

Before getting started, here's what you need:

Minimum Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | RAM | 8 GB | 16 GB+ | | Storage | 5 GB free | 20 GB+ (for multiple models) | | OS | macOS 12+, Windows 10+, Linux | Latest stable release | | CPU | Any modern x86_64 or ARM | Apple Silicon M1+ or recent AMD/Intel |

GPU Acceleration (Optional but Recommended) Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended) AMD GPUs: ROCm support on Linux

GPU acceleration can make responses 3–10x faster compared to CPU-only inference. Apple Silicon Macs are particularly impressive — an M2 MacBook Air can generate 30+ tokens/second with Llama 3 8B.

Installation: Step by Step

macOS

The fastest method: `bash curl -fsSL https://ollama.com/install.sh | sh `

Or download the DMG installer from ollama.com/download.

On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.

Windows

Run in PowerShell: `bash irm https://ollama.com/install.ps1 | iex `

Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.

Linux

`bash curl -fsSL https://ollama.com/install.sh | sh `

On Linux, Ollama installs as a systemd service that starts automatically.

Docker

`bash docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `

For GPU passthrough with NVIDIA: `bash docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama `

Verify Installation

After installing, open a terminal and run: `bash ollama --version `

You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.

Downloading Your First Model

Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:

`bash Download Llama 3 — Meta's best open model (4.7 GB) ollama pull llama3

Or download while running ollama run llama3 `

The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.

Recommended Models for Different Use Cases

| Model | Size | Speed | Best For | |-------|------|-------|----------| | Llama 3.1 8B | 4.7 GB | Fast | General purpose, conversation, analysis | | Llama 3.1 70B | 40 GB | Moderate | Complex reasoning, coding, research (needs 48GB+ RAM) | | Mistral 7B | 4.1 GB | Fast | Coding, instruction following, structured output | | Gemma 2 9B | 5.4 GB | Fast | Google's efficient model, strong at reasoning | | Phi-3 Mini 3.8B | 2.3 GB | Very fast | Lightweight tasks, older hardware, quick answers | | Qwen 2.5 7B | 4.4 GB | Fast | Multilingual support, strong at Chinese + English | | CodeLlama 7B | 3.8 GB | Fast | Code generation, debugging, documentation | | DeepSeek Coder V2 | 8.9 GB | Moderate | Advanced code generation and analysis |

Our recommendation for most users: Start with Llama 3.1 8B. It offers the best balance of quality, speed, and RAM usage. If you have a powerful machine with 32GB+ RAM, try the 70B variant for near-GPT-4 quality responses.

Managing Models

`bash List downloaded models ollama list

Show model details ollama show llama3

Remove a model to free space ollama rm codellama

Copy/rename a model ollama cp llama3 my-custom-llama `

Connecting Ollama to Cognito

This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.

Setup Steps

Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."

Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)

Select Ollama as your provider — In the model provider dropdown, choose "Ollama"

Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.

Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.

CORS Configuration (If Needed)

If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:

`bash OLLAMA_ORIGINS="*" ollama serve `

On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.

Performance Optimization Tips

Use Quantized Models Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff: Q8: Highest quality, most RAM, slowest Q4_K_M: Best balance (default for most models) Q4_K_S: Slightly smaller, minimal quality loss Q3_K_S: Low RAM usage, noticeable quality reduction

Allocate GPU Layers If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date: `bash nvidia-smi # Check GPU status and VRAM `

Tune Context Length Default context is usually 2048 tokens. For longer conversations: `bash ollama run llama3 --ctx 4096 `

Note: Larger context windows use more RAM.

Close Heavy Applications When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.

Monitor Performance `bash Check Ollama process memory usage ollama ps

View detailed logs ollama logs `

Real-World Use Cases for Local AI

For Developers Explain proprietary code without sending it to the cloud Generate unit tests for internal codebases Debug errors locally while working on air-gapped networks Code reviews on classified or NDA-protected projects

For Legal and Medical Professionals Summarize client documents with complete confidentiality Draft correspondence without data leaving the firm's network Research case law with zero data exposure Analyze medical records in HIPAA-compliant environments

For Students and Researchers Process research data without institutional data-sharing concerns Generate literature review outlines from local document collections Practice coding exercises offline during exams or commutes Run experiments with different models without API costs

For Privacy-Conscious Users Browse and ask questions about sensitive topics privately Translate personal documents without cloud exposure Get AI assistance while traveling without reliable internet Maintain complete control over your AI interactions

Ollama vs OpenAI API: Quick Comparison

| Aspect | Ollama (Local) | OpenAI API (Cloud) | |--------|---------------|-------------------| | Privacy | Complete — data never leaves your machine | Partial — data sent to OpenAI servers | | Cost | Free after model download | $0.002–$0.06 per 1K tokens | | Speed | 15–60 tokens/sec (hardware dependent) | 30–80 tokens/sec | | Quality | Good to excellent (model dependent) | Best-in-class (GPT-4/5) | | Internet | Not required | Required | | Model variety | 300+ open-source models | OpenAI models only | | Setup | One-time install + download | API key + billing |

Troubleshooting Common Issues

"Ollama is not running" macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications Windows: Check the system tray. Restart via Start Menu Linux: Run sudo systemctl start ollama

"Model too slow" Switch to a smaller model (8B instead of 70B) Enable GPU acceleration (check if your GPU is detected) Close other applications to free RAM Use a more quantized version (Q4 instead of Q8)

"Out of memory" Use a smaller model variant Reduce context length with --ctx 2048 Add swap space on Linux for more virtual memory Consider upgrading RAM if you regularly use 13B+ models

"CORS error in browser extension" Set OLLAMA_ORIGINS="*" environment variable Restart the Ollama service after changing the variable Ensure no firewall is blocking localhost connections

What's Next?

Ollama and the local AI ecosystem are evolving fast: Multimodal models like LLaVA let you process images locally Function calling enables tool use and agent workflows Fine-tuning support lets you customize models on your own data Cluster mode allows distributing inference across multiple machines

The gap between local and cloud AI shrinks with every new model release. Today's 8B parameter models rival GPT-3.5 in quality, and the trajectory suggests local models will reach GPT-4 parity within a year.

Get Started in 5 Minutes

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Pull a model: ollama pull llama3 Install Cognito from the Chrome Web Store Set Ollama as provider in Cognito settings Start chatting with local AI on any webpage

No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.

---

Related Reading

Privacy-First AI: Why It Matters Open Source AI Models Guide API Keys Explained for AI Tools

Resources

Ollama Official Site Meta AI Llama

Prompt Engineering: How to Get Better Answers from AIAI Summarization: How to Instantly Digest Any ContentAPI Keys Explained: How to Set Up AI Providers SecurelyPrivacy-First AI: Why It Matters More Than Ever
Cognito AI
Cognito AI
HomeFeaturesPricingContactDocumentationBlogs
HomeFeaturesPricingContactDocumentationBlogs
  1. Home
  2. Blog
  3. Running AI Locally with Ollama: A Complete Guide

Running AI Locally with Ollama: A Complete Guide

Learn how to run powerful AI models on your own machine with Ollama — zero cloud dependency, complete privacy, and surprisingly fast performance.

Cognito AI
Cognito Team
8 min read·Mar 14, 2026
Running AI Locally with Ollama: A Complete Guide

Why Run AI Locally?

Cloud AI services like ChatGPT, Claude, and Gemini are powerful — but they come with real trade-offs:

  • Privacy risk: Every prompt you send travels to a remote server, gets processed, logged, and potentially used for training
  • Subscription costs: ChatGPT Plus costs $20/month, Claude Pro costs $20/month — and you're still rate-limited
  • Internet dependency: No WiFi? No AI. On a flight? On a train in a tunnel? Out of luck
  • Latency: Network round trips add 500ms–3s of delay on every response
  • Data compliance: If you work with HIPAA, GDPR, or SOC 2 regulated data, sending it to third-party servers may violate compliance requirements

Local AI eliminates all of these. Your data stays on your machine. Your AI runs without the internet. And after the one-time download, it costs exactly $0 forever.

What Is Ollama?

Ollama is an open-source tool (with over 130K stars on GitHub) that makes it trivially easy to download, run, and manage large language models on your own computer. Think of it as "Docker for LLMs" — one command to pull a model, one command to run it.

Before Ollama, running a local LLM required wrangling Python dependencies, CUDA configurations, model quantization formats, and custom inference scripts. Ollama abstracts all of that behind a simple CLI and REST API.

How Ollama Works Under the Hood

Ollama uses llama.cpp as its inference backend — the same battle-tested C/C++ library that powers most local AI tools. When you run a model:

  1. Ollama downloads the GGUF-formatted model weights (pre-quantized for efficiency)
  2. It loads the model into RAM (or VRAM if you have a GPU)
  3. It exposes a local REST API on http://localhost:11434 that any application — including Cognito — can talk to
  4. Responses are generated token-by-token on your hardware

The result? A fully functional AI chatbot running entirely on your machine, accessible through a clean API that's compatible with the OpenAI chat format.

System Requirements

Before getting started, here's what you need:

Minimum Requirements

ComponentMinimumRecommended
RAM8 GB16 GB+
Storage5 GB free20 GB+ (for multiple models)
OSmacOS 12+, Windows 10+, LinuxLatest stable release
CPUAny modern x86_64 or ARMApple Silicon M1+ or recent AMD/Intel

GPU Acceleration (Optional but Recommended)

  • Apple Silicon (M1/M2/M3/M4): Automatic Metal acceleration — no configuration needed
  • NVIDIA GPUs: CUDA support for GTX/RTX series (6GB+ VRAM recommended)
  • AMD GPUs: ROCm support on Linux

GPU acceleration can make responses 3–10x faster compared to CPU-only inference. Apple Silicon Macs are particularly impressive — an M2 MacBook Air can generate 30+ tokens/second with Llama 3 8B.

Installation: Step by Step

macOS

The fastest method:

bash
curl -fsSL https://ollama.com/install.sh | sh

Or download the DMG installer from ollama.com/download.

On macOS, Ollama runs as a native app in your menu bar. It starts automatically at login and manages the local server in the background.

Windows

Run in PowerShell:

bash
irm https://ollama.com/install.ps1 | iex

Or download the installer from ollama.com/download. Ollama runs as a system tray application on Windows.

Linux

bash
curl -fsSL https://ollama.com/install.sh | sh

On Linux, Ollama installs as a systemd service that starts automatically.

Docker

bash
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For GPU passthrough with NVIDIA:

bash
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Verify Installation

After installing, open a terminal and run:

bash
ollama --version

You should see something like ollama version 0.6.x. If you see a version number, you're ready to go.

Downloading Your First Model

Ollama's model library at ollama.com/library has hundreds of models. Here's how to get started:

bash
# Download Llama 3 — Meta's best open model (4.7 GB) ollama pull llama3 # Or download while running ollama run llama3

The first run downloads the model weights. Subsequent runs start instantly since the model is cached locally.

Recommended Models for Different Use Cases

ModelSizeSpeedBest For
Llama 3.1 8B4.7 GBFastGeneral purpose, conversation, analysis
Llama 3.1 70B40 GBModerateComplex reasoning, coding, research (needs 48GB+ RAM)
Mistral 7B4.1 GBFastCoding, instruction following, structured output
Gemma 2 9B5.4 GBFastGoogle's efficient model, strong at reasoning
Phi-3 Mini 3.8B2.3 GBVery fastLightweight tasks, older hardware, quick answers
Qwen 2.5 7B4.4 GBFastMultilingual support, strong at Chinese + English
CodeLlama 7B3.8 GBFastCode generation, debugging, documentation
DeepSeek Coder V28.9 GBModerateAdvanced code generation and analysis

Our recommendation for most users: Start with Llama 3.1 8B. It offers the best balance of quality, speed, and RAM usage. If you have a powerful machine with 32GB+ RAM, try the 70B variant for near-GPT-4 quality responses.

Managing Models

bash
# List downloaded models ollama list # Show model details ollama show llama3 # Remove a model to free space ollama rm codellama # Copy/rename a model ollama cp llama3 my-custom-llama

Connecting Ollama to Cognito

This is where it gets exciting. Cognito has native Ollama integration, meaning you can use your local models directly in your browser sidebar — no terminal required.

Setup Steps

  1. Ensure Ollama is running — It should be active in your menu bar (macOS) or system tray (Windows). Verify by visiting http://localhost:11434 in your browser — you should see "Ollama is running."

  2. Open Cognito Settings — Click the Cognito extension icon → Settings (gear icon)

  3. Select Ollama as your provider — In the model provider dropdown, choose "Ollama"

  4. Pick your model — Cognito automatically detects all models you've downloaded with ollama pull. Select one from the dropdown.

  5. Start chatting — That's it! Every conversation now runs entirely on your machine. The AI sidebar works on every webpage, just like it does with cloud providers — except nothing leaves your computer.

CORS Configuration (If Needed)

If Cognito can't connect to Ollama, you may need to set the CORS origin. On macOS/Linux:

bash
OLLAMA_ORIGINS="*" ollama serve

On Windows, set the environment variable OLLAMA_ORIGINS=* in System → Environment Variables, then restart Ollama.

Performance Optimization Tips

1. Use Quantized Models

Models come in different quantization levels. The default Q4_K_M offers an excellent speed/quality tradeoff:

  • Q8: Highest quality, most RAM, slowest
  • Q4_K_M: Best balance (default for most models)
  • Q4_K_S: Slightly smaller, minimal quality loss
  • Q3_K_S: Low RAM usage, noticeable quality reduction

2. Allocate GPU Layers

If you have a GPU, Ollama automatically offloads layers. For NVIDIA users, ensure CUDA drivers are up to date:

bash
nvidia-smi # Check GPU status and VRAM

3. Tune Context Length

Default context is usually 2048 tokens. For longer conversations:

bash
ollama run llama3 --ctx 4096

Note: Larger context windows use more RAM.

4. Close Heavy Applications

When running larger models (13B+), close memory-hungry apps like Chrome tabs, Docker containers, or IDEs to free up RAM for the model.

5. Monitor Performance

bash
# Check Ollama process memory usage ollama ps # View detailed logs ollama logs

Real-World Use Cases for Local AI

For Developers

  • Explain proprietary code without sending it to the cloud
  • Generate unit tests for internal codebases
  • Debug errors locally while working on air-gapped networks
  • Code reviews on classified or NDA-protected projects

For Legal and Medical Professionals

  • Summarize client documents with complete confidentiality
  • Draft correspondence without data leaving the firm's network
  • Research case law with zero data exposure
  • Analyze medical records in HIPAA-compliant environments

For Students and Researchers

  • Process research data without institutional data-sharing concerns
  • Generate literature review outlines from local document collections
  • Practice coding exercises offline during exams or commutes
  • Run experiments with different models without API costs

For Privacy-Conscious Users

  • Browse and ask questions about sensitive topics privately
  • Translate personal documents without cloud exposure
  • Get AI assistance while traveling without reliable internet
  • Maintain complete control over your AI interactions

Ollama vs OpenAI API: Quick Comparison

AspectOllama (Local)OpenAI API (Cloud)
PrivacyComplete — data never leaves your machinePartial — data sent to OpenAI servers
CostFree after model download$0.002–$0.06 per 1K tokens
Speed15–60 tokens/sec (hardware dependent)30–80 tokens/sec
QualityGood to excellent (model dependent)Best-in-class (GPT-4/5)
InternetNot requiredRequired
Model variety300+ open-source modelsOpenAI models only
SetupOne-time install + downloadAPI key + billing

Troubleshooting Common Issues

"Ollama is not running"

  • macOS: Check your menu bar for the Ollama icon. If absent, launch the Ollama app from Applications
  • Windows: Check the system tray. Restart via Start Menu
  • Linux: Run sudo systemctl start ollama

"Model too slow"

  • Switch to a smaller model (8B instead of 70B)
  • Enable GPU acceleration (check if your GPU is detected)
  • Close other applications to free RAM
  • Use a more quantized version (Q4 instead of Q8)

"Out of memory"

  • Use a smaller model variant
  • Reduce context length with --ctx 2048
  • Add swap space on Linux for more virtual memory
  • Consider upgrading RAM if you regularly use 13B+ models

"CORS error in browser extension"

  • Set OLLAMA_ORIGINS="*" environment variable
  • Restart the Ollama service after changing the variable
  • Ensure no firewall is blocking localhost connections

What's Next?

Ollama and the local AI ecosystem are evolving fast:

  • Multimodal models like LLaVA let you process images locally
  • Function calling enables tool use and agent workflows
  • Fine-tuning support lets you customize models on your own data
  • Cluster mode allows distributing inference across multiple machines

The gap between local and cloud AI shrinks with every new model release. Today's 8B parameter models rival GPT-3.5 in quality, and the trajectory suggests local models will reach GPT-4 parity within a year.

Get Started in 5 Minutes

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull a model: ollama pull llama3
  3. Install Cognito from the Chrome Web Store
  4. Set Ollama as provider in Cognito settings
  5. Start chatting with local AI on any webpage

No accounts. No API keys. No subscriptions. Just you, your computer, and AI that respects your privacy.


Related Reading

  • Privacy-First AI: Why It Matters
  • Open Source AI Models Guide
  • API Keys Explained for AI Tools

Resources

  • Ollama Official Site
  • Meta AI Llama

Try Cognito AI — Free Chrome Extension

ChatGPT, Claude, Gemini & local models in your browser sidebar. No switching tabs.

ChromeAdd to Chrome — It's Free
ollamalocal-AIprivacytutorial

More from Cognito AI

Prompt Engineering: How to Get Better Answers from AI
Cognito AIIn Tutorial by Cognito Team

Prompt Engineering: How to Get Better Answers from AI

Master the art of prompt engineering with practical techniques that dramatically improve AI output quality — from zero-shot and chain-of-thought to role prompts and iterative refinement.

Feb 25, 2026·8 min read
AI Summarization: How to Instantly Digest Any Content
Cognito AIIn Tutorial by Cognito Team

AI Summarization: How to Instantly Digest Any Content

Learn how AI summarization works, the different techniques available, and how to get the best summaries from any content.

Feb 15, 2026·7 min read
API Keys Explained: How to Set Up AI Providers Securely
Cognito AIIn Tutorial by Cognito Team

API Keys Explained: How to Set Up AI Providers Securely

A clear guide to understanding API keys, setting them up for AI services, and keeping them secure.

Feb 2, 2026·7 min read
Privacy-First AI: Why It Matters More Than Ever
Cognito AIIn Privacy by Cognito Team

Privacy-First AI: Why It Matters More Than Ever

As AI becomes deeply integrated into our workflows, understanding and protecting your data privacy is critical. Here's what you need to know.

Mar 8, 2026·9 min read
Share this articleXLinkedInReddit

Free Weekly Newsletter

Get the AI Productivity Cheat Sheet

Join 1,000+ developers & knowledge workers. Every Tuesday: the best prompts, tools, and workflows to 10× your output with AI.

PreviousWhat Is Cognito? Your AI Companion for the BrowserNext ChatGPT vs Claude vs Gemini in 2026: Which AI Should You Use?
  • Why Run AI Locally?
  • What Is Ollama?
  • How Ollama Works Under the Hood
  • System Requirements
  • Minimum Requirements
  • GPU Acceleration (Optional but Recommended)
  • Installation: Step by Step
  • macOS
  • Windows
  • Linux
  • Docker
  • Verify Installation
  • Downloading Your First Model
  • Recommended Models for Different Use Cases
  • Managing Models
  • Connecting Ollama to Cognito
  • Setup Steps
  • CORS Configuration (If Needed)
  • Performance Optimization Tips
  • 1. Use Quantized Models
  • 2. Allocate GPU Layers
  • 3. Tune Context Length
  • 4. Close Heavy Applications
  • 5. Monitor Performance
  • Real-World Use Cases for Local AI
  • For Developers
  • For Legal and Medical Professionals
  • For Students and Researchers
  • For Privacy-Conscious Users
  • Ollama vs OpenAI API: Quick Comparison
  • Troubleshooting Common Issues
  • "Ollama is not running"
  • "Model too slow"
  • "Out of memory"
  • "CORS error in browser extension"
  • What's Next?
  • Get Started in 5 Minutes
Cognito AI

Cognito AI

Your AI Thinking Partner

Empowering conversations with advanced AI technology.

Product

  • Features
  • Pricing
  • Documentation
  • Blogs

Legal

  • Privacy Policy
  • Terms of Service
  • Refund Policy
  • Cookie Policy

Company

  • Blogs
  • Contact

© 2026 Cognito AI. All rights reserved.