Viewing en version

Goodbye Copilot: Running Llama 3 & Qwen-Coder Locally on an RTX 3090

Cover Image for Goodbye Copilot: Running Llama 3 & Qwen-Coder Locally on an RTX 3090

Data privacy in software development isn't just a preference—for many, it's a legal requirement. While GitHub Copilot is convenient, it requires sending your proprietary logic to a third-party server.

With the massive 24GB VRAM of the NVIDIA RTX 3090, we no longer need to compromise. We can run state-of-the-art models like Llama 3 and the specialized Qwen2.5-Coder (which currently rivals GPT-4 in coding benchmarks) entirely on our own hardware.

Why the RTX 3090?

The RTX 3090 is the "sweet spot" for local LLMs. Its 24GB of GDDR6X VRAM allows you to fit:

  • Llama 3 (8B): With near-instant tokens-per-second.
  • Qwen2.5-Coder (32B): Using 4-bit or 8-bit quantization, providing a massive upgrade in logic and reasoning over smaller models.

Step 1: Setting up the Backend with Ollama

Ollama makes managing local models as easy as managing Docker containers. If you haven't installed it yet, head over to ollama.com.

Once installed, open your terminal and pull the models we need:

ollama pull qwen2.5-coder:32b

Step 2: Integrating with VS Code (OpenCode / Continue)

To get the full "Copilot experience," you need a bridge between your IDE and Ollama. While there are many extensions, OpenCode (for agentic tasks) and Continue (for inline completions) are the current top-tier choices.

1. Install the Extensions

Search the VS Code Marketplace for:

  • Continue: Best for the side-panel chat and "Apply to File" features.
  • OpenCode: Best for "Agent Mode" where the AI can actually run terminal commands and read your whole directory.

2. Configure for your RTX 3090

With 24GB of VRAM, we don't need to settle for the tiny models. We will use Llama 3 (8B) for lightning-fast "Tab-Autocomplete" and Qwen-Coder (32B) for deep architectural questions.

Open your config.json (usually found in ~/.continue/config.json) and paste this optimized configuration:

{
  "models": [
    {
      "title": "Qwen-Coder 32B (RTX 3090)",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "contextLength": 32768
    },
    {
      "title": "Llama 3 8B",
      "provider": "ollama",
      "model": "llama3:8b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Tab Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b" 
  },
  "allowAnonymousTelemetry": false
}

3. Why the 3090 + Qwen-Coder is a (Nuanced) Power Play

While many developers try to run smaller 7B or 14B models for speed, the RTX 3090 is arguably the minimum entry point for high-precision coding intelligence.

The Qwen2.5-Coder 32B model is the current "sweet spot" for 24GB VRAM cards, though it pushes the hardware to its absolute limit. Here is the technical reality of this combination:

  • VRAM Near-Saturation: Using a Q4_K_M quantization, the 32B model occupies roughly 19GB to 22GB of VRAM. While this technically fits, it leaves very little overhead for large context windows. You can manage a 32k context, but pushing much further often results in "Out of Memory" errors or heavy slowing.
  • Functional Performance: On a 3090, you can expect generation speeds of ~35 to 45 tokens per second. This is snappy for chat, but for large-scale code generation, the "thinking" time becomes noticeable compared to cloud-based GPT-4.
  • Reasoning vs. Scale: The 32B version of Qwen-Coder is significantly more capable than the 8B variants at following complex logic. However, it still requires clear, modular instructions; even with 24GB of VRAM, the model can struggle with multi-file architectural changes if the prompt isn't tightly scoped.

4. Running OpenCode with Ollama

If you want to move beyond simple chat and let the AI actually interact with your local files, you can use OpenCode. Instead of messing with environment variables or manual JSON edits, you can now use a single command to link your local models to the agent.

The Launch Command

In your terminal, simply run:

ollama launch opencode

This command starts a guided setup where you can select the model you want to use. Since you have an RTX 3090, I recommend selecting Qwen3-Coder or GLM-4.7-flash if you have 24 GB VRAM.

Put it to work

Once launched, you can give it a task that targets your local workspace:

Prompt: "Explain the tech stack of this project." or Prompt: "Check the /src/components folder for any outdated prop types. Refactor them to use the new TypeScript interfaces in types.ts and let me know if any imports are missing."

Because the RTX 3090 has 24GB of VRAM, the agent can hold a significant amount of your code in its context while it works. You’ll see it scan your files, propose the diff, and wait for your approval—all without your data ever hitting a cloud server.

Update: The Reality of Local Agents

Having a local "Copilot" is one thing, but can it actually function as an autonomous senior developer? After putting this setup through a week of heavy production use, I've discovered where the 3090 shines—and where local models still hit a hard "reasoning wall."

Read Part 2 here: The Reality Check: Why Local Agents Often Forget the Plan

Nächste Artikel.

Cover Image for Securely Running OpenClaw with Ollama via Tailscale

Securely Running OpenClaw with Ollama via Tailscale

OpenClaw is a powerful AI agent, but giving it full host access can be risky. Learn how to run OpenClaw securely with Ollama by leveraging Tailscale to restrict access to a single port, while keeping your home network safe.

Cover Image for Bau des Ploopy Adept BLE (Any Ball Mod)

Bau des Ploopy Adept BLE (Any Ball Mod)

Ein umfassender Guide zum Bau eines kabellosen Ploopy Adept Trackballs mit dem Any Ball Mod, der PCB-Bestellung und der Montage der Komponenten.