Video Super-Resolution Pipeline

As an Machine Learning Engineer intern at GMI Cloud, I built a video super-resolution pipeline that upscales AI-generated video from 360p to 720p using an ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) model. The pipeline was designed as a post-processing stage for an internal text-to-video generation system — rather than retraining the generative model at higher resolution (which would require significantly more compute and data), I inserted a dedicated upscaling step that takes low-resolution output and elevates it to a deliverable quality level.

What makes this more than a "run a model on frames" project: I reimplemented the ESRGAN network architecture from scratch in PyTorch, built an FFmpeg-based frame extraction/reassembly pipeline, wrote an automated PSNR/SSIM evaluation system, and wrapped everything in an orchestration script that runs the complete workflow in a single command.

31.45

Avg PSNR (dB)

0.916

Avg SSIM

7.83

FPS on GPU

360p→720p

Resolution Uplift

~10s

Per 81-Frame Clip

PyTorchESRGAN / RRDBNetFFmpeg / FFprobe PILPSNR / SSIMscikit-image torch.no_grad()GPU Inference

Network Architecture — Implemented from Scratch

Rather than using a pre-built library, I reimplemented the RRDBNet architecture directly in PyTorch for fine-grained control over the upsampling path. The architecture has three levels of hierarchy:

Residual Dense Block

5 conv layers, DenseNet-style concatenation, LeakyReLU (slope 0.2), residual scale 0.2

↓ × 3 chained

RRDB Block

3 RDBs in sequence + 0.2-scaled residual skip (residual-in-residual)

↓ × 23 blocks

Full RRDBNet

Init conv → 23 RRDB blocks → body conv → 2× upsample → 2× upsample → HR output

Why reimplement? The native model does 4× upsampling (360p → 1440p), but I only needed 2× (360p → 720p). By implementing it myself, I could run the full 4× path and crop the output tensor to 2× dimensions — which produces sharper results than architecturally limiting the model. The checkpoint loader also handles three weight formats (model_ema, params_ema, raw state dict) for compatibility across official and fine-tuned variants.

Processing Pipeline

FFprobe metadata

→

FFmpeg → PNG frames

→

PIL load + pad to mod 4

→

ESRGAN inference (no_grad)

→

Float→uint8 conversion

→

FFmpeg reassemble MP4

PNG frames (not JPEG) — avoids introducing compression artifacts before the upscaler.
PIL loading over OpenCV — more reliable for edge-case color space handling.
Pad to nearest mod 4 — required by the architecture's strided convolutions.
libx264 / yuv420p / crf=23 — visually lossless reassembly at the original framerate.
Auto-cleanup — temporary frame directories cleaned after completion; no disk residue on shared cluster storage.

Quality Evaluation System

I built a dedicated evaluation script computing frame-by-frame PSNR and SSIM between the upscaled output and the 720p reference — not on a sample, but on every single frame pair.

PSNR — measured in dB on full RGB and on the Y (luminance) channel per ITU-R BT.601. Y-channel PSNR is more perceptually meaningful; human vision is more sensitive to luminance than chrominance.
SSIM — compares luminance, contrast, and structural patterns. Values above 0.90 = high perceptual fidelity.
Shape mismatch handling — cubic interpolation resizes if reference and output dimensions differ slightly.
CSV output — all frame metrics written for downstream analysis.

Benchmark Results

Measured on 5 AI-generated videos (81 frames each, 1280×720 target resolution):

Video	PSNR (dB)	SSIM	Processing Time	Speed (fps)
Video A	31.06	0.927	10.25s	7.90
Video B	31.19	0.891	10.52s	7.70
Video C	31.73	0.926	10.36s	7.82
Video D	32.25 ★	0.924	10.24s	7.91 ★
Video E	31.04	0.910	10.34s	7.83
Average	31.45	0.916	10.34s	7.83

PSNR above 30 dB on natural-scene video = good reconstruction; above 32 dB (Video D) = excellent. All five videos hit SSIM > 0.89, solidly in the high-fidelity range. ~7.8 fps means a 5-second clip processes in about 10 seconds on a single GPU — practical for production post-processing.

Pipeline Orchestration

The full workflow runs as a single command with three independently-skippable stages:

Model download — fetches pre-trained ESRGAN weights from a remote URL if not cached locally.
Batch upscaling — processes all videos in the input directory, writing results to an output directory.
Metrics evaluation — computes PSNR/SSIM against the reference directory; generates a summary report identifying best and worst-performing clips.

Each stage is subprocess-isolated, so a failure gives a clear error without corrupting state for others.

Impact

This pipeline saved the team from retraining the upstream text-to-video model at 720p — a process that would have required significantly more GPU hours, training data, and iteration time. Instead, we got 720p output quality (PSNR > 31 dB, SSIM > 0.91) by adding a lightweight post-processing step that takes ~10 seconds per clip and runs on a single GPU, processing every incoming AI-generated video through the internal content pipeline.

Share on

Twitter Facebook LinkedIn

Yupeng Tang