Self-Host AI Video Generation: Replace Runway and Sora With Your Own GPU

Runway Gen-4.5 costs $0.15-$0.20 per second of generated video. A one-minute clip runs $9-$12. OpenAI confirmed Sora is shutting down in March 2026. Kling, Pika, and the rest charge $10-$30/month for limited credits that vanish fast if you’re doing real production work.

Meanwhile, open-source video models got good — genuinely, practically good. Wan 2.2 generates cinematic video from text prompts on an 8GB GPU. LTX-2.3 produces 4K video with synchronized audio under Apache 2.0. Both run entirely on your hardware, no API keys, no per-second billing, no uploading your prompts to someone else’s servers.

Here’s how to set up a complete local video generation studio this weekend.

What You’re Choosing Between

Three open-source models dominate local video generation right now. Each has a different sweet spot.

Wan 2.2 (Alibaba) — The quality leader. Uses a Mixture-of-Experts architecture with 27B total parameters but only 14B active per step, keeping VRAM reasonable. Two specialized experts handle different denoising stages: one for layout, one for detail. Supports text-to-video, image-to-video, and video editing. The 1.3B light version runs on 8GB VRAM.

LTX-2.3 (Lightricks) — The speed demon. Generates 4K video up to 20 seconds with native audio — sound effects, ambient noise, and dialogue generated in sync with video. A 5-second clip at 480p renders in about 4 seconds on an RTX 4090. Apache 2.0 licensed. The 22B parameter model is available in fp8 quantized form (~20GB download) for consumer GPUs.

HunyuanVideo 1.5 (Tencent) — The face specialist. Produces the most convincing human faces and expressions, making it the go-to for any content involving people. Runs on ComfyUI with as little as 8GB VRAM using offloading, though 24GB is recommended for comfortable generation.

Hardware Reality Check

Let’s be honest about what you need.

Setup	What You Can Run	Quality
8GB VRAM (RTX 3060, RTX 4060)	Wan 2.2 1.3B at 480p, HunyuanVideo with heavy offloading	Usable for social media, drafts
12GB VRAM (RTX 3060 12GB, RTX 4070)	Wan 2.2 14B at 480-720p with quantization	Good for most content
16GB VRAM (RTX 4080, RTX 5080)	LTX-2.3 int8 at 720p with audio, Wan 2.2 at 720p	Broadcast-quality shorts
24GB VRAM (RTX 3090, RTX 4090)	Everything at full quality, 1080p+	Professional production

System RAM matters too. Models offload to regular memory when VRAM runs short. 32GB RAM is the practical minimum; 64GB gives you headroom.

Option 1: ComfyUI (Most Flexible)

ComfyUI is the Swiss Army knife. It’s a node-based interface that supports all three models, lets you chain workflows, and has the largest community.

Install ComfyUI

Windows: Download the installer from comfy.org. It handles Python, CUDA, and dependencies automatically.

Linux:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py

Mac (Apple Silicon):

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py --force-fp16

Download Wan 2.2 Models

Place models in the correct ComfyUI directories:

# Diffusion model (pick one based on your VRAM)
# 14B model (~28GB, needs 12GB+ VRAM):
ComfyUI/models/diffusion_models/wan2.2_t2v_14B_fp16.safetensors

# 1.3B model (~2.6GB, runs on 8GB VRAM):
ComfyUI/models/diffusion_models/wan2.1_t2v_1.3B_fp16.safetensors

# Text encoder (required, ~8GB):
ComfyUI/models/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors

# VAE (required, ~300MB):
ComfyUI/models/vae/wan_2.1_vae.safetensors

Download from Hugging Face. Use the fp16 versions for best quality, fp8 if you’re short on VRAM.

Generate Your First Video

Open ComfyUI at http://localhost:8188
Go to Workflows → Workflow Templates and select the Wan 2.1 template
In the CLIP Text Encoder node, type your prompt
In the Load VAE node, confirm wan_2.1_vae.safetensors is selected
Hit Ctrl+Enter to generate

A 5-second 480p clip takes about 4 minutes on an RTX 4090 with the 14B model. The 1.3B model is roughly 4x faster at lower quality.

Add LTX-2.3 for Audio+Video

Install the LTX ComfyUI extension:

cd ComfyUI/custom_nodes
git clone https://github.com/Lightricks/ComfyUI-LTXVideo.git

Download the LTX-2.3 checkpoint from Hugging Face. Place it in ComfyUI/models/checkpoints/. The fp8 version is ~20GB.

LTX-2.3’s killer feature is native audio. Your generated videos come with synchronized sound — footsteps match walking, explosions have impact, dialogue syncs with lip movement. No separate audio generation step needed.

Option 2: Wan2GP (Easiest Setup)

If ComfyUI’s node-based interface feels intimidating, Wan2GP is built specifically for making video generation simple. It describes itself as “a fast AI video generator for the GPU poor.”

Install

git clone https://github.com/deepbeepmeep/Wan2GP.git
cd Wan2GP
pip install -r requirements.txt
python app.py

That’s it. Wan2GP downloads models automatically on first run and provides a clean web interface.

What Makes It Different

Wan2GP uses aggressive memory management — it loads and unloads model components between VRAM and system RAM as needed. This means:

6GB VRAM: Run HunyuanVideo at 540p
8GB VRAM: Run Wan 2.2 14B at 480-720p
10GB VRAM: Run LTX-Video at 768p for up to 60 seconds

It also supports Wan 2.2, HunyuanVideo, LTX-Video, and Flux — all from a single interface. No node wiring, no workflow files. Type a prompt, pick a model, click generate.

The trade-off: less flexibility than ComfyUI. No chaining workflows, no custom pipelines, no ControlNet guidance. But for straightforward text-to-video or image-to-video, it’s the fastest path from zero to generated video.

Option 3: LTX Desktop (Most Polished)

LTX Desktop is Lightricks’ standalone app for running LTX-2.3 locally. It’s the most user-friendly option — a native desktop app rather than a web interface.

Download from ltx.io, install, and it handles model downloads (the fp8 variant is ~20GB, full bf16 is ~42GB).

Currently Windows-only for local GPU generation. macOS works through API mode, which defeats the self-hosting purpose. Linux users should stick with ComfyUI.

The Cost Math

Let’s say you generate 50 one-minute videos per month — a moderate workload for a content creator or small marketing team.

Runway Gen-4.5: 50 minutes × $9-$12/min = $450-$600/month

Kling Pro: $30/month plan, but limited to roughly 10-15 minutes of generation. You’d need multiple accounts or the enterprise tier at $100+/month.

Self-hosted (Wan 2.2 + ComfyUI): Electricity costs only. An RTX 4090 draws about 450W under full load. At US average electricity rates ($0.17/kWh), running the GPU for 40 hours of generation per month costs about $3/month.

The upfront GPU cost is real — an RTX 4090 runs about $1,600. But it pays for itself in under 3 months versus Runway, and you own the hardware for other work. An RTX 3060 12GB ($300) gets you started at lower quality.

Privacy Matters

Every cloud video service processes your prompts on their servers. This means:

Your creative ideas and scripts pass through third-party infrastructure
Generated content may be stored, logged, or used for training
Enterprise content involving proprietary products or unreleased designs creates IP exposure
Some services retain the right to use generated content

Self-hosted generation keeps everything local. Your prompts never leave your machine. Generated videos stay on your drives. No terms of service governing your output.

For businesses generating product videos, prototype visualizations, or internal training content, this isn’t a convenience — it’s a compliance requirement.

Tips for Better Results

Prompting matters more than model size. Describe camera movement, lighting, and scene composition explicitly. “A woman walking through a sunlit forest, medium tracking shot, golden hour lighting, shallow depth of field” beats “woman in forest” every time.

Start at 480p, refine at higher resolution. Generate quick drafts at low resolution to test prompts. Once you like the result, regenerate at 720p or 1080p. This saves hours of waiting on bad generations.

Use image-to-video for consistency. Generate or source a starting frame, then use I2V mode. This gives you far more control over the look and composition of your output than text-to-video alone.

Negative prompts work. “Blurry, distorted hands, text overlay, watermark, low quality” in the negative prompt noticeably improves output across all three models.

What You Can Do Today

If you have an NVIDIA GPU with 8GB+ VRAM, install Wan2GP. It’s the fastest path to your first locally-generated video.
If you want maximum flexibility and community support, set up ComfyUI with both Wan 2.2 and LTX-2.3. The learning curve pays off.
If you’re buying hardware specifically for this, get an RTX 4090. The 24GB VRAM runs everything at full quality, and it holds its value well.
If you’re on Apple Silicon, ComfyUI works but performance lags behind NVIDIA significantly. Consider a cloud GPU rental for heavy generation while keeping the workflow local.

The gap between cloud and local AI video closed faster than anyone expected. Twelve months ago, local generation was a novelty. Today it’s a production tool. The pricing of cloud services hasn’t caught up with that reality — and that’s your opportunity.