Loading...


Updated 4 Jun 2026 • 7 mins read

Running an LLM on your own machine is now realistic, but the best model depends entirely on your hardware. This guide explains how VRAM and quantization decide what you can run, which benchmarks matter, the best tools to run models locally, recommended models by hardware tier, and how the cost of local compares to the cloud.
Quick answer: The best local LLM for you is the highest-ranked model that fits in your memory while still running fast enough to be useful. As a rule of thumb, a 4-bit quantized model needs roughly half a gigabyte of memory per billion parameters, so an 8B model fits comfortably in about 6 GB of VRAM, a 14B model wants 10 GB or more, and a 70B model needs 40 GB or more. Pick the largest model your hardware can hold, then compare options on benchmarks for the tasks you care about (reasoning, coding, math) and on speed in tokens per second. Tools like Ollama and LM Studio make running them easy.
Running a capable language model on your own laptop or workstation used to be a fantasy. Now it is routine. Open models have caught up fast, and the tooling has gotten simple enough that you can have one running in a couple of minutes. The hard part is no longer whether you can run a local LLM, it is which one is actually best for your machine.
That answer is personal, because it is set by your hardware. The model that flies on a 24 GB GPU will not load on a laptop, and the model that fits a laptop will not match a frontier cloud model on the hardest tasks. The trick is matching model size, quantization, and benchmark performance to what you actually have.
This guide walks through how hardware decides what you can run, which benchmarks matter, the best tools to run models locally, recommended models by hardware tier, and how the cost of going local compares to using a cloud API.
A local LLM is a language model that runs entirely on your own hardware instead of calling a provider's cloud API. You download the model weights once and run inference on your machine, with no data leaving your device and no per-token bill. People run models locally for a few clear reasons:
The trade-off is that you are limited by your own hardware. The largest, most capable open models need serious memory, and even then a local model may trail the best cloud models on the hardest tasks. Picking well is about getting as close to that ceiling as your machine allows.
The single biggest constraint is memory: GPU video memory (VRAM) on a dedicated graphics card, or unified memory on an Apple Silicon Mac. The model has to fit, with room left over for the context window. Three factors decide the fit:
A simple rule of thumb: at 4-bit, budget roughly 0.5 to 0.6 GB of memory per billion parameters, plus extra for context. At 8-bit, closer to 1 GB per billion. Here is how that plays out:
| Model size | Approx VRAM at 4-bit | Approx VRAM at 8-bit | Typical hardware |
|---|---|---|---|
| 1B to 3B | ~1 to 3 GB | ~2 to 4 GB | Most laptops, even without a GPU |
| 7B to 8B | ~5 to 6 GB | ~9 to 10 GB | 8 GB+ GPU, or 16 GB Apple Silicon |
| 13B to 14B | ~9 to 10 GB | ~15 to 16 GB | 12 GB+ GPU, or 24 GB+ Mac |
| 27B to 34B | ~18 to 22 GB | ~32 to 36 GB | 24 GB GPU, or 32 GB+ Mac |
| 70B | ~40 to 48 GB | ~70 to 80 GB | Multi-GPU, or 64 GB+ unified memory |
When you download a model you will run into a few quantization formats and labels. A quick translation:
For most local users, a Q4 or Q5 GGUF build of the largest model that fits is the safe default. You only need to think harder about formats once you move to GPU serving at scale.
Once you know which models fit, ranking them is about matching benchmarks to your real tasks. A model that tops a general leaderboard may not be the best coder. The benchmarks worth knowing:
Quality is only half the ranking. Speed matters just as much for a local model, and it is measured in two ways: tokens per second (how fast it generates) and time to first token (how quickly it starts replying). A slightly weaker model that runs at a comfortable speed often beats a stronger one that crawls. Public leaderboards such as community Arena rankings and open evaluation leaderboards are the place to compare current scores, since the rankings shift with every new model release.
You do not run model weights directly, you run them through an inference tool. The main options, from easiest to most advanced:
For most people, Ollama or LM Studio is the right starting point. Move to llama.cpp or vLLM when you need more control or you are serving a real application.
Here is a practical, hardware-first view. Specific model versions change constantly, so these are model families and sizes to look for, ranked by what your hardware can hold. Check a current leaderboard for the latest version within each size class.
| Tool | What it is | Best for |
|---|---|---|
| Ollama | Simple local LLM runtime with one-command model downloads and an OpenAI-compatible API | Developers who want the easiest way to run models locally |
| LM Studio | Desktop application with a graphical interface for downloading, testing, and serving models | Non-technical users and rapid experimentation |
| Open WebUI | Self-hosted ChatGPT-style interface that connects to Ollama and other backends | Teams and users wanting a polished web interface |
| vLLM | High-performance inference server optimized for throughput and efficient GPU utilization | Production deployments and serving many requests |
| llama.cpp | Lightweight inference engine that runs quantized models on CPUs, GPUs, and Apple Silicon | Maximum hardware compatibility and efficiency |
| Text Generation WebUI | Feature-rich web interface supporting multiple inference backends and extensions | Power users who want extensive customization |
| Jan | Open-source desktop AI assistant focused on privacy and local execution | Users wanting an offline ChatGPT-like experience |
| KoboldCpp | Easy-to-run local inference tool built on llama.cpp with a simple UI | Hobbyists and lightweight local deployments |
Put it together in five steps:
Validate on your work. Benchmarks are a guide, not a verdict. Try your real prompts and pick the model that does your job best.
A few errors come up again and again when people pick their first local model. Avoiding them saves a lot of frustration:
Choosing local is also a cost decision, and it is the one most people get wrong. Local has no per-token fee, but it carries an upfront hardware cost, electricity, your setup time, and a lower ceiling on model quality. Cloud APIs charge per token but give instant access to the largest frontier models with no hardware to buy. For teams, this is a classic FinOps question rather than a purely technical one.
The rough decision rule:
The catch with hybrid is visibility. The moment some AI work runs locally and some runs in the cloud, the cloud spend still needs to be tracked, allocated, and optimized, or it quietly grows. Keeping that spend attributed to the right team and feature is exactly the kind of problem covered in AI vs manual cloud cost optimization
Running models locally cuts your token bill, but most teams still rely on the cloud for the hardest tasks, for scaling, and for the GPUs that train and serve larger models. That cloud and AI spend is where costs hide. opslyft is a FinOps platform built to bring it into view. With opslyft FinOps360, your cloud and AI spend sits in one allocated picture, so the local-versus-cloud decision is made with real numbers rather than guesses.
opslyft helps across the AI cost lifecycle:
Read more on the opslyft blog, or book a 20-minute demo to see how it fits your stack.
The best local LLM is the one your hardware can hold that still ranks well on the tasks you care about. Match size and quantization to your memory, then let benchmarks and your own testing break the tie.
Local cuts your token bill, but most teams stay hybrid, so keep the cloud side allocated and optimized. With AI, visibility is the real cost lever.
There is no single best one, because it depends on your hardware and tasks. The best choice is the highest-ranked model that fits in your memory and still runs at a usable speed. Pick the largest size your VRAM allows, then compare the current top models in that size class on the benchmarks that match your work.
At 4-bit quantization, budget roughly half a gigabyte per billion parameters plus context overhead. That means about 6 GB for an 8B model, 10 GB or more for a 14B model, and 40 GB or more for a 70B model. Smaller 1B to 3B models can run on almost any machine.
Yes. Tools like llama.cpp and Ollama can run smaller quantized models on a CPU, and modern small models are capable. It is slower than a GPU, but for 1B to 8B models on a reasonably modern machine it is perfectly usable.
It can be, for high and steady usage, since there is no per-token fee once you own the hardware. But you pay upfront for hardware, plus electricity and your time, and you give up access to the largest frontier models. Spiky or frontier-quality workloads often stay cheaper in the cloud.
Quantization stores model weights at lower precision to shrink memory use, for example 4-bit instead of 16-bit. A 4-bit model uses far less memory with only a small quality drop, which is why most local users run quantized models. Going too aggressive, below 4-bit, starts to noticeably hurt quality.
How many parameters do I need for good quality? For general use, models in the 7B to 14B range now handle most everyday tasks well, especially the newer ones. Quality keeps climbing up to 70B and beyond, but so do the hardware requirements. The honest answer is to run the largest model your memory allows and test whether it is good enough for your work before chasing a bigger one.