Ask HN: What Does Your Self-Hosted LLM Stack Look Like in 2025?

17 points | by anditherobot 1 day ago

7 comments

  • bluejay2387 1 day ago
    2x 3090's running Ollama and VLLM... Ollama for most stuff and VLLM for the few models that I need to test that don't run on Ollama. Open Web UI as my primary interface. I just moved to Devstral for coding using the Continue plugin in VSCode. I use Qwen 3 32b for creative stuff and Flux Dev for images. Gemma 3 27b for most everything else (slightly less smart than Qwen, but its faster). Mixed Bread for embeddings (though apparently NV-Embed-v2 is better?). Pydantic as my main utility library. This is all for personal stuff. My stack at work is completely different and driven more by our Legal teams than technical decisions.
    • fazlerocks 1 day ago
      Running Llama 3.1 70B on 2x4090s with vLLM. Memory is a pain but works decent for most stuff.

      Tbh for coding I just use the smaller ones like CodeQwen 7B. way faster and good enough for autocomplete. Only fire up the big model when I actually need it to think.

      The annoying part is keeping everything updated, new model drops every week and half don't work with whatever you're already running.

      • runjake 1 day ago
        Ollama + M3 Max 36GB Mac. Usually with Python + SQLite3.

        The models vary depending on the task. DeepSeek distilled has been a favorite for the past several months.

        I use various smaller (~3B) models for simpler tasks.

        • v5v3 15 hours ago
          Ollama on a M1 MacBook pro but will be moving to a Nvidia GPU setup.
          • gabriel_dev 1 day ago
            Ollama + mac mini 24gb (inference)
            • xyc 1 day ago
              recurse.chat + M2 max Mac