xKiwiLabs xKiwiLabs

Local LLM Setup Guide

Step-by-step guide to running local language models with Ollama and LM Studio — installation, model selection, VS Code integration, Python usage, and when to use local vs. cloud.

Ollama LM Studio Privacy
Last updated February 15, 2026

Why Run Models Locally?

Local models keep your data on your machine. No API calls, no usage logs, no third-party access. This matters when you’re working with student data, unpublished research, or sensitive institutional information.

For the full argument on why local matters — including privacy, institutional compliance, and defence use cases — read the companion blog post: Why I Run Language Models on My Own Machine.

This guide covers the practical how-to: installation, model selection, and connecting local models to your development tools.

Option 1: Ollama

Ollama is the fastest way to get a local model running from the command line. It handles model downloads, quantisation, and serving with a single command.

Install Ollama

  1. Download from ollama.com or install via Homebrew
  2. Verify the installation works
  3. Pull your first model
# macOS (Homebrew)
brew install ollama

# Verify installation
ollama --version

# Pull a model
ollama pull llama3.2

Run a Model

# Start a chat session
ollama run llama3.2

# Or use the API
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.2", "prompt": "Explain p-values in plain English"}'

Useful Ollama Commands

# List downloaded models
ollama list

# Pull a specific model
ollama pull mistral

# Remove a model to free disk space
ollama rm llama3.2

# Show model details (size, parameters, quantisation)
ollama show llama3.2

# Run with a system prompt
ollama run llama3.2 --system "You are a helpful research assistant specialising in psychology."

Option 2: LM Studio

LM Studio provides a graphical interface for downloading and chatting with models. It’s a good choice if you prefer a GUI over the terminal.

  1. Download from lmstudio.ai
  2. Search for a model in the built-in model browser
  3. Download and load the model
  4. Start chatting in the built-in interface

LM Studio also provides a local API server that’s compatible with the OpenAI API format — useful for connecting to other tools and scripts.

Choosing a Model

The right model depends on your hardware and what you need it for. Here’s a practical guide:

General-Purpose Models

ModelSizeRAM NeededGood For
Llama 3.2 (3B)~2 GB8 GBQuick tasks, summarisation, simple Q&A
Gemma 3 (4B)~2.5 GB8 GBCompact but strong, great quality for size
Phi-4 Mini (3.8B)~2.3 GB8 GBCompact, fast inference, good reasoning
Mistral (7B)~4 GB16 GBSolid all-rounder, structured output
Qwen 3 (8B)~5 GB16 GBStrong multilingual, good at reasoning
Gemma 3 (12B)~7.5 GB16 GBExcellent quality, multimodal (vision)
GPT-OSS (20B)~12 GB24 GBOpenAI’s open-source model, general-purpose

Coding-Focused Models

ModelSizeRAM NeededGood For
Qwen 2.5 Coder (7B)~4.5 GB16 GBCode generation, refactoring
Qwen 3 Coder (30B)~18 GB48 GBState-of-the-art open-source coding

Larger Models (If You Have the RAM)

ModelSizeRAM NeededGood For
Gemma 3 (27B)~16 GB32 GBBeats much larger models on benchmarks
QwQ (32B)~18 GB48 GBStrong reasoning, long context
Llama 3.3 (70B)~40 GB64 GBNear-cloud quality, versatile
GPT-OSS (120B)~70 GB128 GBOpenAI’s large open-source model, frontier-level

Practical advice: Start with Gemma 3 (4B) or Qwen 3 (8B). They handle most everyday tasks well and run on any modern laptop. Move to a larger model only if you find the output quality insufficient for your specific use case.

To pull any of these in Ollama:

ollama pull gemma3
ollama pull qwen3
ollama pull phi4-mini
ollama pull gpt-oss
ollama pull mistral

Hardware Requirements

  • Minimum: 8 GB RAM for small models (3B parameters)
  • Recommended: 16 GB RAM for 7–8B models
  • Ideal: Apple Silicon Mac with 32 GB+ unified memory
  • Power user: 64 GB+ for 70B models or running multiple models

Apple Silicon Notes

Apple Silicon Macs (M1/M2/M3/M4) are exceptionally well-suited for local LLMs because of unified memory — the CPU and GPU share the same memory pool, so the full RAM is available for model inference. A 32 GB M-series Mac handles 8B models with ease and can run larger quantised models too.

ARM-Based Mini PCs

The new generation of ARM-based mini PCs (like those from Ampere or Qualcomm-based systems) with shared VRAM offer similar benefits to Apple Silicon at different price points. These are worth considering if you need a dedicated local inference machine.

Model Formats: GGUF vs. MLX

When you download a model, it comes in a specific format. The two you’ll encounter most for local use are GGUF and MLX.

GGUF (llama.cpp)

GGUF is the universal format for local LLMs. It runs on everything — Mac, Windows, Linux, CPU, GPU. Ollama uses GGUF under the hood.

  • Works on: Any hardware (CPU or GPU)
  • Best for: Most users, cross-platform compatibility, Ollama
  • Trade-off: Good performance everywhere, but not optimised for any one chip

MLX (Apple Silicon only)

MLX is Apple’s machine learning framework, optimised specifically for M-series chips. MLX models squeeze more speed out of Apple Silicon’s unified memory architecture.

  • Works on: Apple Silicon Macs only (M1/M2/M3/M4)
  • Best for: Mac users who want maximum inference speed
  • Trade-off: Faster on Apple Silicon, but Mac-only — not portable

If you’re on a Mac: Try MLX models via LM Studio (which supports both formats) for the best speed. Ollama also supports MLX for some models.

If you’re on anything else (or want simplicity): Stick with GGUF via Ollama. It just works.

Quantisation Levels

Models are compressed (“quantised”) to fit in less RAM. Common quantisation levels you’ll see:

QuantisationSize vs. FullQualityWhen to Use
Q4_K_M~25–30%Good for most tasksDefault choice — best balance of size and quality
Q5_K_M~35%Slightly betterWhen you have spare RAM and want a bit more quality
Q6_K~45%Near-originalWhen quality matters most and RAM isn’t tight
Q8_0~50%ExcellentMaximum quality quantised model
Q3_K_S~20%Noticeable degradationOnly when RAM is very limited

Practical advice: Ollama picks a sensible default (usually Q4_K_M) when you pull a model. Unless you have a specific reason to change it, the default is fine.

Connecting Local Models to VS Code

Running a local model in the terminal is useful, but connecting it to your code editor makes it part of your daily workflow.

Continue Extension

Continue is an open-source AI coding assistant for VS Code that works with local models.

  1. Install the Continue extension from the VS Code marketplace
  2. Open Continue settings (click the gear icon in the Continue panel)
  3. Add Ollama as a provider:
{
  "models": [
    {
      "title": "Ollama - Llama 3.1",
      "provider": "ollama",
      "model": "llama3.1"
    }
  ]
}
  1. Make sure Ollama is running (ollama serve in a terminal)
  2. Start chatting or using inline completions

Continue gives you chat, inline editing, and code explanations — all powered by your local model, with no data leaving your machine.

GitHub Copilot with Local Models

GitHub Copilot doesn’t natively support local models, but you can use Ollama’s OpenAI-compatible API with tools that accept custom API endpoints. If you’re already using Copilot for cloud-based assistance, Continue is the best complement for local model access.

Other Options

  • Cody (Sourcegraph) — supports Ollama as a backend
  • Cline — VS Code extension that works with local Ollama models
  • Aider — terminal-based coding assistant with Ollama support

Using Local Models from Python

If you’re building tools, research pipelines, or just want to script your interactions, connecting to local models from Python is straightforward.

Using the Ollama Python Library

pip install ollama
import ollama

# Simple generation
response = ollama.generate(
    model='llama3.1',
    prompt='Summarise the key assumptions of linear regression in 3 bullet points.'
)
print(response['response'])
# Chat with message history
response = ollama.chat(
    model='llama3.1',
    messages=[
        {'role': 'system', 'content': 'You are a research methods expert.'},
        {'role': 'user', 'content': 'When should I use a mixed-effects model instead of a repeated-measures ANOVA?'}
    ]
)
print(response['message']['content'])

Using the OpenAI-Compatible API

Ollama serves an OpenAI-compatible API on localhost:11434. This means you can use the OpenAI Python library with local models — useful if you have existing code that uses the OpenAI API and want to switch to local.

pip install openai
from openai import OpenAI

# Point the client at Ollama's local server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but not used
)

response = client.chat.completions.create(
    model='llama3.1',
    messages=[
        {'role': 'user', 'content': 'Write a Python function to compute Cohen\'s d.'}
    ]
)
print(response.choices[0].message.content)

This approach lets you develop locally and swap in a cloud model for production by changing the base_url and api_key — no other code changes needed.

Batch Processing Research Papers

Here’s a practical example — processing multiple paper abstracts:

import ollama
import json

abstracts = [
    "Abstract of paper 1...",
    "Abstract of paper 2...",
    "Abstract of paper 3...",
]

results = []
for i, abstract in enumerate(abstracts):
    response = ollama.generate(
        model='llama3.1',
        prompt=f"""Analyse this research abstract and extract:
- Research question
- Methodology (1 sentence)
- Key finding (1 sentence)
- Sample size

Abstract: {abstract}

Respond in JSON format."""
    )
    results.append({
        'paper': i + 1,
        'analysis': response['response']
    })

# Save results
with open('paper_analysis.json', 'w') as f:
    json.dump(results, f, indent=2)

When to Use Local vs. Cloud

Not every task needs a local model, and not every task needs a cloud model. Here’s a practical decision framework:

Use Local When:

  • Privacy matters — student data, unpublished manuscripts, ethics-restricted information, peer review content
  • You’re iterating rapidly — testing prompts, debugging pipelines, running batch jobs where API costs add up
  • You’re offline or on restricted networks — travel, secure environments, unreliable internet
  • Cost is a concern — local models have zero marginal cost per request
  • You’re teaching — students can experiment freely without usage limits or account requirements

Use Cloud When:

  • You need maximum capability — complex reasoning, long-form writing, nuanced analysis that exceeds what 8B models can do
  • The task requires a large context window — processing very long documents where local models run out of context
  • Speed matters more than privacy — cloud models on dedicated hardware are typically faster than local inference
  • You need specific features — web search, image generation, tool use, or other capabilities not available locally

The Hybrid Approach

The most practical setup is both:

  1. Ollama running locally for everyday tasks, privacy-sensitive work, and development
  2. A cloud model (ChatGPT, Claude, Gemini) for tasks that need frontier capability

Develop locally, deploy to cloud when needed. Use local models as your default and escalate to cloud when the task demands it. This gives you the best of both worlds: privacy and zero cost for most tasks, maximum capability when you need it.

Quick Start Checklist

  1. Install Ollama: brew install ollama (macOS) or download from ollama.com
  2. Pull a model: ollama pull llama3.1
  3. Test it: ollama run llama3.1
  4. Install Continue in VS Code for editor integration
  5. Install the Python library: pip install ollama
  6. Try the OpenAI-compatible API for existing scripts

You can have a working local LLM setup in under 10 minutes. Start with a chat session, see how it fits your workflow, and build from there.