LLM Infrastructure

Hardware Specifications

A purpose-built dual-GPU workstation designed for running large language models locally. The system features modded RTX 2080 Ti GPUs with 22 GB VRAM each, providing 44 GB of total GPU memory for loading large FP8-quantized models.

Motherboard	ASUS X99-A II
CPU	Intel Xeon E5-2650 v3 (12-core, 2.3 GHz base / 3.3 GHz turbo)
RAM	64 GB DDR4 ECC @ 2133 MHz
Storage	512 GB NVMe SSD
Power Supply	NCIX C1200 (1200 W)
Case	Antec C8

GPU Configuration

Dual-GPU setup with modded VRAM limiters removed, boosting each card from 11 GB to 22 GB. This enables loading larger models that would otherwise not fit in standard VRAM.

GPU 0 (Primary)	NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded
GPU 1 (Secondary)	NVIDIA GeForce RTX 2080 Ti — 22 GB VRAM Modded
Total VRAM	44 GB (across 2 GPUs)

Model & Runtime

Running Qwen3.6-27B in FP8 precision via vLLM for high-throughput inference. The FP8 quantization allows the 27B parameter model to fit within the 44 GB VRAM budget while maintaining strong quality.

Model	Qwen/Qwen3.6-27B-FP8 FP8
Parameters	27 Billion
Context Window	200,000 tokens
Inference Engine	vLLM with CUDA backend PagedAttention
OS	Ubuntu Linux (x86_64)

Infrastructure Architecture

The full stack runs on homelab hardware with production-grade tooling for monitoring and access.

▸ vLLM — High-throughput LLM inference engine with PagedAttention, exposing an OpenAI-compatible API at api.tobilab.org:8081
▸ Open WebUI — Web-based chat interface for interacting with the model, deployed at web.tobilab.org
▸ Prometheus — Metrics collection scraping vLLM endpoints every 15 seconds, with custom recording rules for token throughput tracking
▸ vLLM Metrics — Real-time observability of prompt/generation tokens, request latency, KV cache usage, and GPU utilization via /metrics endpoint

Services

OpenAI-Compatible API

Full OpenAI API compatibility for chat completions, embeddings, and model management. Drop-in replacement for any OpenAI client.

api.tobilab.org:8081

Open WebUI

Feature-rich web chat interface with file upload, multi-model support, and conversation management.

web.tobilab.org

Prometheus Monitoring

Continuous metrics scraping with recording rules for token throughput, latency percentiles, and GPU utilization tracking.

Internal — 15s scrape interval