LLM Infrastructure
Self-hosted Large Language Model serving rig
Hardware Specifications
A purpose-built dual-GPU workstation designed for running large language models locally. The system features modded RTX 2080 Ti GPUs with 22 GB VRAM each, providing 44 GB of total GPU memory for loading large FP8-quantized models.
| Motherboard | ASUS X99-A II |
|---|---|
| CPU | Intel Xeon E5-2650 v3 (12-core, 2.3 GHz base / 3.3 GHz turbo) |
| RAM | 64 GB DDR4 ECC @ 2133 MHz |
| Storage | 512 GB NVMe SSD |
| Power Supply | NCIX C1200 (1200 W) |
| Case | Antec C8 |
GPU Configuration
Dual-GPU setup with modded VRAM limiters removed, boosting each card from 11 GB to 22 GB. This enables loading larger models that would otherwise not fit in standard VRAM.
| GPU 0 (Primary) | NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded |
|---|---|
| GPU 1 (Secondary) | NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded |
| Total VRAM | 44 GB (across 2 GPUs) |
Model & Runtime
Running Qwen3.6-27B in FP8 precision via vLLM for high-throughput inference. The FP8 quantization allows the 27B parameter model to fit within the 44 GB VRAM budget while maintaining strong quality.
| Model | Qwen/Qwen3.6-27B-FP8 FP8 |
|---|---|
| Parameters | 27 Billion |
| Context Window | 200,000 tokens |
| Inference Engine | vLLM with CUDA backend PagedAttention |
| OS | Ubuntu Linux (x86_64) |
Infrastructure Architecture
The full stack runs on homelab hardware with production-grade tooling for monitoring and access.
-
▸
vLLM — High-throughput LLM inference engine with PagedAttention, exposing an OpenAI-compatible API at
api.tobilab.org:8081 - ▸ Open WebUI — Web-based chat interface for interacting with the model, deployed at web.tobilab.org
- ▸ Prometheus — Metrics collection scraping vLLM endpoints every 15 seconds, with custom recording rules for token throughput tracking
-
▸
vLLM Metrics — Real-time observability of prompt/generation tokens, request latency, KV cache usage, and GPU utilization via
/metricsendpoint
Services
OpenAI-Compatible API
Full OpenAI API compatibility for chat completions, embeddings, and model management. Drop-in replacement for any OpenAI client.
api.tobilab.org:8081Open WebUI
Feature-rich web chat interface with file upload, multi-model support, and conversation management.
web.tobilab.orgPrometheus Monitoring
Continuous metrics scraping with recording rules for token throughput, latency percentiles, and GPU utilization tracking.
Internal — 15s scrape interval