No description
Find a file
2026-05-28 13:54:49 -05:00
applications replaced all namespaces with argocd 2026-05-28 13:34:47 -05:00
bootstrap modified ArgoCD application manifests for talyn cluster bootstrapping with git repo 2026-05-28 10:07:39 -05:00
infrastructure replaced all namespaces with argocd 2026-05-28 13:34:47 -05:00
cloudflare-sealedsecret.yaml new secret for cloudflare 2026-05-28 13:54:49 -05:00
README.md docs: update README code blocks from powershell to bash and standardize command syntax 2026-05-28 10:39:00 -05:00

Homelab GitOps Multi-Cluster Repository

Welcome to your homelab's GitOps source of truth! This repository uses ArgoCD and Bitnami Sealed Secrets to manage two k3s clusters utilizing Longhorn storage:

  • Talyn (Dev)
  • Moya (Prod)

Directory Structure

gitops-homelab/
|── Talyn
├──── bootstrap/                   # Cluster bootstrap (ArgoCD & Sealed Secrets)
│     ├── apps/                   # Bootstraps Dev
├──── infrastructure/              # Infrastructure-level services
│     ├── bases/                   # Traefik, Authentik, PostgreSQL, Redis, Qdrant
│     └── overlays/                # Environment-specific configuration
└──── applications/                # Application-level workloads
      ├── bases/                   # Home Assistant, n8n, Media Stack (Arrs)
      └── overlays/                # Environment-specific overlays (ingress, hostNetwork)

Cluster Node Topology (Talyn - Dev)

Your Dev cluster is mapped on the 192.168.1.1/24 network as follows. Enceladus is configured as an untainted control plane, meaning it actively executes workloads as a worker node alongside Callisto and Triton:

Node Name Node Type IP Address Exposed Roles
Enceladus Control Plane / Worker 192.168.1.33 Master / API Server / Workload Execution
Callisto Worker 192.168.1.31 Workload Execution / Storage
Triton Worker 192.168.1.32 Workload Execution / Storage

Enceladus Hardware Specifications

  • CPU: Intel(R) Core(TM) i5-9500T @ 2.20GHz (6 Cores / 6 Threads, max 3.7GHz, Low-Power 35W TDP)
  • System RAM: 15GiB (16GB)
  • Graphics Accelerator (iGPU): Intel(R) CoffeeLake-S GT2 [UHD Graphics 630] (supports Intel Quick Sync Video (QSV) for low-power media transcoding)
  • Exposed Role: Control Plane / Worker Node (hosting Dev API/scheduling controls and actively executing workloads)

Triton Hardware Specifications

  • CPU: Intel(R) Core(TM) i5-9500T @ 2.20GHz (6 Cores / 6 Threads, max 3.7GHz, Low-Power 35W TDP)
  • System RAM: 7.5GiB (8GB)
  • Graphics Accelerator (iGPU): Intel(R) CoffeeLake-S GT2 [UHD Graphics 630] (supports Intel Quick Sync Video (QSV) for low-power media transcoding)
  • Note: Triton shares the exact same 6-Core processor and iGPU as Enceladus, giving your Dev cluster dual QSV-capable nodes.

Callisto Hardware Specifications

  • CPU: Intel(R) Core(TM) i5-9500T @ 2.20GHz (6 Cores / 6 Threads, max 3.7GHz, Low-Power 35W TDP)
  • System RAM: 7.5GiB (8GB)
  • Graphics Accelerator (iGPU): Intel(R) CoffeeLake-S GT2 [UHD Graphics 630] (supports Intel Quick Sync Video (QSV) for low-power media transcoding)
  • Note: Callisto forms an exact hardware triplet with Triton and Enceladus, creating a 100% homogenous CPU/GPU node pool for Talyn (Dev).

Cluster Node Topology (Moya - Prod)

Your Prod cluster is mapped on the 192.168.1.1/24 network with a robust 5-node setup. Titan is also untainted to maximize hardware utilization, serving worker duties in addition to hosting the control plane:

Node Name Node Type IP Address Exposed Roles
Titan Control Plane / Worker 192.168.1.38 Master / API Server / Workload Execution
Ganymede Worker (GPU Accelerated) 192.168.1.39 Workloads (RTX A2000 + Quadro K2200) / Storage
Europa Worker 192.168.1.40 Workload Execution / Storage
Phobos Worker 192.168.1.37 Workload Execution / Storage
Deimos Worker 192.168.1.36 Workload Execution / Storage

Titan Hardware Specifications

  • CPU: Intel(R) Core(TM) i9-10900 @ 2.80GHz (10 Cores / 20 Threads, max 5.2GHz)
  • System RAM: 31GiB (32GB)
  • Graphics Accelerator (iGPU): Intel(R) CometLake-S GT2 [UHD Graphics 630] (supports Intel Quick Sync Video (QSV) for ultra-efficient video transcoding)
  • Exposed Role: Control Plane / Worker Node (actively executing API, scheduling, database, and workload pods)

Ganymede Hardware Specifications

  • CPU: Intel(R) Core(TM) i7-7700 @ 3.60GHz (4 Cores / 8 Threads, max 4.2GHz)

  • System RAM: 62GiB (64GB)

  • Graphics Accelerators (GPUs):

    1. NVIDIA RTX A2000 (6GB VRAM, PCIe Gen3 x16, OffPersistence)
    2. NVIDIA Quadro K2200 (4GB VRAM, PCIe Gen3 x16, OffPersistence)
  • Driver & Tooling: NVIDIA-SMI Driver Version 550.163.01, CUDA 12.4

Deimos Hardware Specifications

  • CPU: Intel(R) Core(TM) i7-7700T @ 2.90GHz (4 Cores / 8 Threads, max 3.8GHz, Low-Power 35W TDP)
  • System RAM: 15GiB (16GB)
  • Graphics Accelerator (iGPU): Intel(R) HD Graphics 630 (supports Intel Quick Sync Video (QSV) for low-power media transcoding)

Phobos Hardware Specifications

  • CPU: Intel(R) Core(TM) i7-7700T @ 2.90GHz (4 Cores / 8 Threads, max 3.8GHz, Low-Power 35W TDP)
  • System RAM: 15GiB (16GB)
  • Graphics Accelerator (iGPU): Intel(R) HD Graphics 630 (supports Intel Quick Sync Video (QSV) for low-power media transcoding)
  • Note: Phobos is a hardware twin of Deimos, making them a perfect pair for high-availability application replicas.

Europa Hardware Specifications

  • CPU: Intel(R) Core(TM) i7-7700T @ 2.90GHz (4 Cores / 8 Threads, max 3.8GHz, Low-Power 35W TDP)
  • System RAM: 15GiB (16GB)
  • Graphics Accelerator (iGPU): Intel(R) HD Graphics 630 (supports Intel Quick Sync Video (QSV) for low-power media transcoding)
  • Note: Europa is a hardware triplet with Phobos and Deimos, forming a highly symmetric, identical node pool on Moya.

GPU Workload Acceleration in k3s

Having dual NVIDIA GPUs on Ganymede makes it the perfect host for Plex/Tdarr hardware transcoding or running local LLMs (like Ollama) for your AI workflows (n8n/Qdrant).

To expose these GPUs to your cluster pods:

  1. Ensure the NVIDIA Container Toolkit is installed on the host OS of Ganymede.
  2. In k3s, configure the container runtime to use nvidia-container-runtime by default.
  3. Deploy the NVIDIA Device Plugin for Kubernetes inside the cluster. You can then request GPU capacity directly in your Deployments:
    resources:
      limits:
        nvidia.com/gpu: 1 # Requests one GPU (either RTX A2000 or K2200)
    
  4. Tip: Use Kubernetes node taints or node selectors (kubernetes.io/hostname: Ganymede) to guarantee your media containers schedule exclusively on Ganymede!

Note

Taint Management: In lightweight distributions like k3s, control-plane nodes are untainted by default. If your control-plane nodes ever become tainted (preventing pods from scheduling), you can untaint them manually by running:

kubectl taint nodes Enceladus node-role.kubernetes.io/control-plane:NoSchedule- --context=talyn
kubectl taint nodes Enceladus node-role.kubernetes.io/master:NoSchedule- --context=talyn

External Homelab Storage Inventory

Your external NAS assets serve crucial data layers, persistent backups, and snapshots:

Server Type OS / System IP Address Primary Role Target Integration
TrueNAS TrueNAS Scale/Core 192.168.1.30 Shared Media / Pictures / Docs Mounted directly inside media pods via NFS
DSM Synology DSM 192.168.1.34 Production Backup Target Configured as Moya's Longhorn backup target (NFS)
OMV OpenMediaVault 192.168.1.35 Dev Storage / Snapshots / Backups Configured as Talyn's Longhorn backup target (NFS)

TrueNAS Server Hardware Specifications (Dell T330)

  • Hardware Chassis: Dell PowerEdge T330 Tower Server
  • CPU: Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 Cores / 8 Threads, max 3.9GHz, Enterprise Server Grade)
  • System RAM: 62GiB (64GB ECC - highly optimized for high-speed ZFS caching via ARC)
  • Graphics Accelerator (GPU): NVIDIA Corporation GP106GL [Quadro P2000] (5GB VRAM, Pascal architecture, famous for unlocked unlimited concurrent NVENC transcodes)
  • ZFS Storage Pool Layout (Massive ~98TB Raw Pool):
    • Data Drives (7 x ~14TB mechanical drives): sda through sdg (each partition is 12.7TiB).
      • Homelab Architecture: Typically running as a RAID-Z2 vdev pool (survives 2 simultaneous drive failures), providing ~70TiB of highly secure, redundant usable storage for Media, Pictures, and Documents.
    • System Boot Drive (240GB SSD): sdh (partitions sdh1, sdh2 bootloader, sdh3 TrueNAS OS core).
    • Application / Cache SSD (500GB): sdi (partition sdi1 @ 463.8GiB).
      • Homelab Utility: Serves as the high-speed pool for Plex metadata/index database containers, or as a dedicated ZFS read cache (L2ARC) or write log (SLOG) to accelerate spin-up access on the mechanical pool.
  • ZFS Storage Pools & Dataset Catalog:
    • AppPool (SSD Storage - ~450GB): Located on SSD sdi, currently utilizing 4.73 GiB for high-speed application runtime caches (e.g., Plex SQLite database, indexing metadata).
    • Pool1 (Redundant RAID-Z2 - ~50TB Usable): Grouping 7 x 14TB mechanical drives (sda - sdg), currently utilizing 6.62 TiB with 44.1 TiB available. It hosts:
      • .truenas_containers (internal TrueNAS containers)
      • Cloud (fully shared via SMB and NFS)
      • Media (mount path: /mnt/Pool1/Media), structured into dedicated datasets:
        • Books
        • Comics
        • CookingTV
        • HomeTV
        • Movies (~3.4 TiB used)
        • Software (~63 GiB used)
        • TV (~3.15 TiB used)
  • Active In-Cluster Mounting: Your Kubernetes workloads bind directly to this layout via /mnt/Pool1/Media using truenas-media-pvc. Because everything resides inside this single /Media NFS export, your Arrs will successfully perform instant atomic moves and hardlinks between your download directories (e.g., /mnt/Pool1/Media/downloads) and your final libraries (e.g., /mnt/Pool1/Media/Movies and /mnt/Pool1/Media/TV) without crossing filesystems!
  • Current Primary Workload: Hosting shared datasets (Media, Pictures, Documents) and running Plex Media Server natively as a container app using the Quadro P2000 for hardware-accelerated transcoding.

DNS & Ingress Routing (Cloudflare + Homelab LAN)

Since your domains are hosted on Cloudflare, and these clusters run in your local homelab, you have two elegant options to route traffic securely to Traefik:

To keep your dev/prod web traffic fully internal and secure:

  1. Keep your Cloudflare DNS public settings clean (no public records pointing to your private IPs).
  2. On your local LAN DNS server (e.g., your UniFi gateway or Pi-hole), set up a wildcard DNS Rewrite (or local A records):
    • *.davidcrilly.com -> Point to worker node IPs: 192.168.1.31 and 192.168.1.32
  3. Traefik's LoadBalancer (integrated via k3s Klipper) listens on all node IPs and will intercept and route the traffic to the correct pods based on the HTTP Host headers.

Option B: Cloudflare Tunnel (Cloudflared)

To securely expose select services (like Overseerr/Seerr) to the internet without opening ports on your router:

  1. We can deploy a lightweight cloudflared deployment to your cluster.
  2. The tunnel creates a secure outbound-only connection to Cloudflare's edge.
  3. You map subdomains in the Cloudflare Dashboard directly to the internal service name (e.g. http://overseerr.applications.svc.cluster.local:5055).

Prerequisites

  1. Make sure your local machine has kubectl configured with access to both clusters.
  2. Verify the context names match your configurations:
    kubectl config get-contexts
    # You should see contexts named: 'talyn' and 'moya'
    

Step-by-Step Bootstrap Guide

Choose the cluster you want to bootstrap first (e.g., Talyn for Dev):

Phase 1: Install ArgoCD & Sealed Secrets Controller

  1. Switch to the desired context:

    kubectl config use-context talyn
    
  2. Deploy the bootstrap layer (ArgoCD and Sealed Secrets):

    kubectl apply --server-side --force-conflicts -k bootstrap/talyn
    

    This creates the argocd and kube-system resources and spins up both controllers.

  3. Verify the pods are running:

    kubectl get pods -n argocd
    kubectl get pods -n kube-system -l app.kubernetes.io/name=sealed-secrets
    

Phase 2: Create Sealed Secrets for Databases & Credentials

Since credentials must not be stored in raw Git manifests, we use Bitnami Sealed Secrets.

1. Download the Public Encryption Key from your cluster:

Each cluster has its own distinct key pair. Fetch the active public key:

kubeseal --controller-name=sealed-secrets --controller-namespace=kube-system --fetch-cert > talyn-sealed-secrets-pub.pem

2. Generate a Secret and Seal it:

Let's create a database secret for PostgreSQL as an example.

Create a local raw secret file (do NOT commit this to Git!):

# Create database credentials
kubectl create secret generic postgres-secret \
  --namespace database \
  --from-literal=postgres-password="SuperSecurePassword123!" \
  --from-literal=password="UserSecurePassword123!" \
  --dry-run=client -o yaml > temp-postgres-secret.yaml

Encrypt (Seal) the secret using kubeseal and the cluster's public key:

kubeseal --format=yaml --cert=talyn-sealed-secrets-pub.pem < temp-postgres-secret.yaml > infrastructure/overlays/talyn/postgres-sealedsecret.yaml

Now you can safely delete temp-postgres-secret.yaml and talyn-sealed-secrets-pub.pem. The resulting postgres-sealedsecret.yaml is fully encrypted and safe to commit to your Git repository!

3. Update Kustomization:

Include the SealedSecret in your overlay kustomization.yaml:

resources:
  - ../../bases/postgresql
  - postgres-sealedsecret.yaml

4. Seal the Cloudflare API Token for cert-manager:

To enable cert-manager to automatically solve Let's Encrypt DNS-01 challenges, we must create a SealedSecret containing your Cloudflare API Token.

  1. Generate a Cloudflare API Token with Zone:DNS:Edit and Zone:Zone:Read permissions.
  2. Create a local raw secret in the cert-manager namespace (do NOT commit this to Git!):
    kubectl create secret generic cloudflare-api-token-secret \
      --namespace cert-manager \
      --from-literal=api-token="YOUR_CLOUDFLARE_API_TOKEN_HERE" \
      --dry-run=client -o yaml > temp-cloudflare-secret.yaml
    
  3. Seal the secret for the target cluster:
    kubeseal --format=yaml --cert=talyn-sealed-secrets-pub.pem < temp-cloudflare-secret.yaml > infrastructure/overlays/talyn/cloudflare-sealedsecret.yaml
    
    (For Prod, use the Moya context public key and output to the moya overlay directory)
  4. Include the SealedSecret in your overlay kustomization.yaml:
    resources:
      ...
      - cloudflare-sealedsecret.yaml
    

Homelab Storage Configurations & GitOps

1. Mounting TrueNAS NFS Media Share (192.168.1.30)

Inside applications/bases/media-stack/truenas-nfs.yaml, we have defined a PersistentVolume (PV) and PersistentVolumeClaim (PVC) specifically configured for mounting your TrueNAS NFS dataset:

  • PV: truenas-media-pv
  • PVC: truenas-media-pvc

To bind your media pods (Sonarr, Radarr, Overseerr, SABnzbd, qBittorrent) to TrueNAS directly, update your container deployments (e.g., sonarr.yaml, radarr.yaml) to mount the TrueNAS NFS PVC:

      volumes:
        - name: media
          persistentVolumeClaim:
            claimName: truenas-media-pvc

Tip: Standardizing on this single NFS mount for both download locations and library directories will allow atomic moves and hardlinks directly on your TrueNAS pools!

2. Automated Longhorn Backup Targets (OMV & DSM)

We have fully automated the configuration of Longhorn backup endpoints. When ArgoCD syncs the infrastructure layer, it applies a custom Setting CRD that directs all volume backups and snapshots directly to your target network shares:

  • Talyn (Dev): Streams snapshots to OMV at nfs://192.168.1.35:/export/dev-backups
  • Moya (Prod): Streams snapshots to DSM at nfs://192.168.1.34:/volume1/prod-backups

To check these in the Longhorn GUI, navigate to Settings > General > Backup Target and you will see the paths already pre-filled.


High-Availability PostgreSQL with CloudNativePG (CNPG)

Rather than utilizing single-replica databases, this repository deploys CloudNativePG (CNPG), an advanced PostgreSQL operator that implements highly-available, self-healing database clusters:

  • HA Design: Automatically provisions 3 instances (1 primary, 2 hot-standby replicas) utilizing Longhorn storage.
  • Self-Healing Failovers: If the primary node crashes (e.g., a worker node goes offline), the CNPG operator automatically promotes a healthy standby replica to primary within seconds.
  • Connection Routing: Applications connect using the read-write primary service name, eliminating connection configuration errors during failovers:
    • Hostname: postgresql-rw.database.svc.cluster.local (Write operations)
    • Port: 5432
    • Used by: Authentik and n8n.

To monitor your HA cluster state, install the kubectl cnpg plugin and run:

kubectl cnpg status postgresql -n database --context=moya

Cluster Monitoring & Observability (kube-prometheus-stack)

To keep an eye on your cluster operations, this repository installs kube-prometheus-stack in the monitoring namespace. It bundles the Prometheus Operator, Prometheus Server, node-exporter, kube-state-metrics, and Grafana:

  • Grafana Dashboards: Automatically exposes a web UI secured with cert-manager HTTPS:
    • Talyn (Dev): grafana.davidcrilly.com
    • Moya (Prod): grafana.thecrillys.com
    • Default login credentials: admin / admin (change immediately on first login!).
  • Longhorn Observability: Automatically scrapes metrics from your Longhorn storage controllers. Import Grafana Dashboard ID 16524 or 13070 to view real-time storage bandwidth, read/write I/O operations, replica health, and volume capacity!
  • PostgreSQL HA Observability: Automatically discovers your CloudNativePG database nodes and metrics. Import Grafana Dashboard ID 14114 to monitor write throughput, transaction latency, and replication lag between your primary database node and replicas.
  • Resource Constraints (Homelab Friendly): Configured with local storage constraints (7d retention, 15Gi capacity) and strict limits (2Gi RAM ceiling) to prevent Prometheus from draining your worker node resources.

Private Cloud & Local AI Suite (Nextcloud, Paperless, Ollama, Open WebUI)

This repository deploys a high-end, self-hosted private cloud and local AI processing suite in the applications namespace:

1. Nextcloud Private Cloud

  • High-Performance DB & Cache: Connected to the CloudNativePG HA PostgreSQL master (postgresql-rw) and Redis base for session/transient caching.
  • NFS Mount Target: Nextcloud's core data directory binds directly to your TrueNAS Cloud dataset (/mnt/Pool1/Cloud at 192.168.1.30) via truenas-cloud-pvc. This keeps your files safely stored on your redundant ZFS array rather than filling up local node disks.

2. Paperless-ngx & Paperless-ai Document Management

  • Paperless-ngx: Indexes and OCRs your scanned documents using the integrated Tesseract engine. All system configs, databases, and media paths utilize high-availability Longhorn storage.
  • Paperless-ai Integration: A background analyzer that runs side-by-side with Paperless-ngx. It automatically watches your uploaded documents, connects to your local LLM engine (Ollama), and applies automated tags, document categories, and metadata using local artificial intelligence!

3. Ollama & Open WebUI Chat (GPU Accelerated)

  • Ollama LLM Engine: Configured to download and run local LLMs (like Llama3 or Mistral).
    • Dev Overlay (Talyn): Runs on CPU.
    • Prod Overlay (Moya): Patched to schedule exclusively on your GPU-accelerated worker node Ganymede (nodeSelector), requesting direct access to your NVIDIA RTX A2000 / Quadro K2200 via CUDA resources (nvidia.com/gpu: 1). This executes AI model inference in milliseconds!
  • Open WebUI Dashboard: Exposes a beautiful, private ChatGPT-like chat dashboard.
    • Exposed on Dev: https://openwebui.davidcrilly.com
    • Exposed on Prod: https://openwebui.thecrillys.com

Phase 3: Apply the Root Application

ArgoCD uses the "App of Apps" pattern. Applying the root-app.yaml config will cause ArgoCD to recursively scan the repository and manage your apps automatically.

  1. Crucial Step: Open bootstrap/talyn/root-app.yaml (and/or bootstrap/moya/root-app.yaml) and update the repoURL to point to your actual Git repository URL.

  2. Apply the Root Application:

    kubectl apply -f bootstrap/talyn/root-app.yaml --context=talyn
    
  3. Access the ArgoCD UI to watch it sync:

    • Retrieve the auto-generated admin password:
      kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($_))
      
    • Port-forward to access the dashboard:
      kubectl port-forward svc/argocd-server -n argocd 8080:443
      
    • Open https://localhost:8080 in your browser, log in as admin, and watch the clusters bootstrap!