LLM Infrastructure

Self-hosted Large Language Model serving rig

90-Day Token Throughput
Loading...
Prompt + Generation tokens processed

Hardware Specifications

A purpose-built dual-GPU workstation designed for running large language models locally. The system features modded RTX 2080 Ti GPUs with 22 GB VRAM each, providing 44 GB of total GPU memory for loading large FP8-quantized models.

MotherboardASUS X99-A II
CPUIntel Xeon E5-2650 v3 (12-core, 2.3 GHz base / 3.3 GHz turbo)
RAM64 GB DDR4 ECC @ 2133 MHz
Storage512 GB NVMe SSD
Power SupplyNCIX C1200 (1200 W)
CaseAntec C8

GPU Configuration

Dual-GPU setup with modded VRAM limiters removed, boosting each card from 11 GB to 22 GB. This enables loading larger models that would otherwise not fit in standard VRAM.

GPU 0 (Primary) NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded
GPU 1 (Secondary) NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded
Total VRAM 44 GB (across 2 GPUs)

Model & Runtime

Running Qwen3.6-27B in FP8 precision via vLLM for high-throughput inference. The FP8 quantization allows the 27B parameter model to fit within the 44 GB VRAM budget while maintaining strong quality.

ModelQwen/Qwen3.6-27B-FP8 FP8
Parameters27 Billion
Context Window200,000 tokens
Inference EnginevLLM with CUDA backend PagedAttention
OSUbuntu Linux (x86_64)

Infrastructure Architecture

The full stack runs on homelab hardware with production-grade tooling for monitoring and access.

Services

OpenAI-Compatible API

Full OpenAI API compatibility for chat completions, embeddings, and model management. Drop-in replacement for any OpenAI client.

api.tobilab.org:8081

Open WebUI

Feature-rich web chat interface with file upload, multi-model support, and conversation management.

web.tobilab.org

Prometheus Monitoring

Continuous metrics scraping with recording rules for token throughput, latency percentiles, and GPU utilization tracking.

Internal — 15s scrape interval
“Design is a funny word. Some people think design means how it looks. But of course, if you dig deeper, it's really how it works.”