by: D. Chisholm
Running Open-Source LLMs Locally: A Work-in-Progress
Our first attempt at running an LLM locally using quantized GGUF models and Python APIs
Introduction
This document is a living record of our journey to run open-source Large Language Models (LLMs) locally, using quantized GGUF models and Python APIs. We hope it helps others avoid some of the roadblocks we hit—and encourages experimentation!
Goals
- Download and run quantized LLMs (Phi-3 Mini, StarCoder2-7B) on local hardware (16GB RAM, 6GB VRAM).
- Expose both models via a simple FastAPI server for easy prompt/response cycles.
- Make the setup reproducible and scriptable.
Roadblocks & Lessons Learned
1. Model Download Confusion
- Problem: Hugging Face model repos often contain many quantizations. Downloading all of them is slow and unnecessary.
- Solution: Use
huggingface-cli download ... --include <filename>
to fetch only the quantization you want (e.g., Q4_K_M).
2. Python Package Naming
- Problem: Tried to install
llama_cpp
(doesn’t exist). - Solution: The correct package is
llama-cpp-python
.
3. Model Path Issues
- Problem: Model loading failed with
ValueError: Model path does not exist
. - Solution: The
<snapshot_id>
in Hugging Face cache paths must be replaced with the actual folder name (a long hash).
4. Prompt Formatting
- Problem: Short or ambiguous prompts led to odd or verbose answers.
- Solution: Structure prompts to match the model’s expected format (chat-style for Phi-3, code-style for StarCoder2).
5. FastAPI Request Validation
- Problem: Got
422 Unprocessable Entity
errors from FastAPI. - Solution: Use
embed=True
in endpoint signatures so FastAPI expects a JSON object like{"prompt": "..."}
.
Current Working Setup
- Model loading is handled in
model_startup.py
. - API server is in
api_server.py
, with endpoints for each model. - Testing is done via
api_test.py
, sending structured prompts to each endpoint.
Example test script:
import requests
def query_model(endpoint, prompt, model_name):
response = requests.post(endpoint, json={"prompt": prompt})
if response.ok:
print(f"{model_name} Response: {response.json()['response']}")
else:
print(f"{model_name} Error: {response.text}")
if __name__ == "__main__":
phi3_prompt = "User: What is 2+2?\nAssistant:"
starcoder_prompt = "# How do I add two numbers in Python?\n# Answer:"
query_model("http://localhost:8000/generate/phi3/", phi3_prompt, "Phi-3 Mini")
query_model("http://localhost:8000/generate/starcoder/",
## Observations on Model Responses
One thing we quickly noticed: the raw outputs from these models are not as polished or conversational as what you see in production LLM applications (like ChatGPT or Copilot). For example, when we sent the prompts:
- **"What is 2 + 2?"**
**Phi-3 Mini Response:**
`2 + 2 equals 4.`
- Support: Correct,
- **"How do I add two numbers in Python?"**
**Starcoder Response:**
Using the + operator print(1 + 2) # 3
These answers are technically correct, but they lack the natural, flowing style of a human conversation. This suggests that production-ready LLM apps do a significant amount of **pre-processing and post-processing**—rephrasing prompts, formatting outputs, and sometimes even filtering or re-ranking responses—to make the chat experience feel more natural and helpful.
**Takeaway:**
If you want your local LLM app to feel more like a real assistant, you'll need to invest in prompt engineering and output processing. This might include:
- Adding context or instructions to prompts.
- Stripping or reformatting model outputs.
- Chaining multiple prompts or using templates for chat history.
We'll continue to experiment with these techniques as we refine our setup!