Ollama 使用教學:在自己電腦上跑本地 AI 模型入門

Ollama 把跑本地大型語言模型簡化成幾行指令,資料全留本機。這篇教你安裝、下載模型、用量化省記憶體、接 API。

Running a Cloud-Free AI Model on Your Laptop

Ollama solves a very practical problem - "I want to use a large language model, but I don't want to upload my data to someone else's cloud." It simplifies the process of running a local LLM, which was originally very cumbersome, into just a few commands: install, ollama pull to download the model, and ollama run to start the conversation. Under the hood, it uses llama.cpp, with models in GGUF format, and currently supports Mac's Metal, NVIDIA's CUDA, and Vulkan, all using the same file format. I've tried running it on my MacBook, and it's much easier than the process two years ago, which required compiling a lot of things.

What Ollama Can Do

  • One-line installation, automatic handling of model download, GPU detection, and API service
  • ollama pull and ollama run allow you to download and start a conversation, with a focus on CLI
  • Model library with over 100 varieties (Llama, Qwen, Gemma, DeepSeek, etc.)
  • Built-in REST API, compatible with OpenAI format, making it easy to integrate into your own programs
  • Use Modelfile to customize model parameters (like a Dockerfile for LLM)
  • Supports Docker, with all data stored locally and not leaked

How to Get Started (Steps)

  1. Download and install Ollama from the official website (available for macOS, Windows, and Linux)
  2. Open a terminal and input ollama pull qwen3:8b to download a model (8B size is suitable for most laptops)
  3. Input ollama run qwen3:8b, and after loading, you can start a conversation in the terminal
  4. To integrate it into a program, open the built-in API (default http://localhost:11434) and use the OpenAI-compatible format to call it
  5. To customize, write a Modelfile to set system prompts, temperature, and other parameters, then ollama create

Advanced Tips

  • If memory is insufficient, choose the quantized version: Q4_K_M (4-bit) saves about 75% of memory, and the 7B model can be reduced from ~16GB to ~4GB with minimal quality loss
  • Model size should match hardware: 8GB memory can run 3B-8B, while 16GB or more is needed to run 13B or larger models smoothly
  • GGUF files downloaded from Hugging Face can be imported using Modelfile, not limited to the official library
  • For a graphical interface, you can use LM Studio or a front-end UI, which is more user-friendly for those who are not comfortable with CLI
  • Connecting Ollama's OpenAI-compatible endpoint to your existing program requires almost no modification to the SDK

Things to Note

  • Local models usually can't match the capabilities of top-tier cloud models like GPT or Claude, so don't expect identical performance
  • It consumes hardware resources: without a dedicated graphics card or sufficient memory, large models may run slowly or not at all
  • The first model download can be several GB, so be mindful of network and disk space
  • Although data is not leaked, local security (who can access the machine, and whether the API is exposed) still needs to be taken care of

TheAI Academy Summary and Evaluation

Honestly, Ollama is a tool that I would directly recommend to two types of people: those who are concerned about privacy and those who want to learn how LLMs work. For scenarios where data cannot be transmitted externally, such as law, medicine, and research, local models are one of the few solutions. For developers, it reduces the experimental cost to almost zero - you can switch models with pull, and use the OpenAI-compatible endpoint to integrate it into your program with minimal modification. Its ceiling is your hardware and the capabilities of open-source models, which have made significant progress in the past two years. The 4-bit quantization also makes it possible to run on ordinary laptops. If you just want a convenient conversation, cloud-based ChatGPT or Claude may be more convenient. However, as soon as the "data cannot be leaked" requirement appears, Ollama is worth spending an afternoon to try. Further reading: AI Privacy and Security Practical Guide, How to Use LM Studio.

One-sentence evaluation: Ollama turns running a local LLM into just a few commands, most suitable for those who care about privacy and want to learn about LLMs; its capabilities are limited by your hardware and open-source models, but 4-bit quantization makes it possible to run on ordinary laptops.

Data Source

Compiled from official announcements and public data, with official information taken as accurate.

Frequently Asked Questions

Ollama 是什麼?

一個讓你在自己電腦上跑本地大型語言模型的工具,底層為 llama.cpp、模型用 GGUF 格式,一行指令安裝後用 ollama pull/run 即可下載與對話。

Ollama 怎麼用?

安裝後輸入 ollama pull 下載模型(如 qwen3:8b),再 ollama run 開始對話;要接程式可用內建的 OpenAI 相容 REST API(localhost:11434)。

跑 Ollama 需要什麼硬體?

看模型大小:8GB 記憶體可跑 3B~8B,16GB 以上較適合 13B 以上;用 Q4_K_M 4-bit 量化可省約 75% 記憶體,7B 模型約只要 4GB。

Ollama 和 ChatGPT 差在哪?

Ollama 在本機執行、資料不外流、可離線且免訂閱,但能力受硬體與開源模型限制;ChatGPT 是雲端頂規模型、更強更省事,但資料會上雲。

繁體中文版 →