Video Super-Resolution Pipeline

As an Machine Learning Engineer intern at GMI Cloud, I built a video super-resolution pipeline that upscales AI-generated video from 360p to 720p using an ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) model. The pipeline was designed as a post-processing stage for an internal text-to-video generation system — rather than retraining the generative model at higher resolution (which would require significantly more compute and data), I inserted a dedicated upscaling step that takes low-resolution output and elevates it to a deliverable quality level.

What makes this more than a "run a model on frames" project: I reimplemented the ESRGAN network architecture from scratch in PyTorch, built an FFmpeg-based frame extraction/reassembly pipeline, wrote an automated PSNR/SSIM evaluation system, and wrapped everything in an orchestration script that runs the complete workflow in a single command.

31.45
Avg PSNR (dB)
0.916
Avg SSIM
7.83
FPS on GPU
360p→720p
Resolution Uplift
~10s
Per 81-Frame Clip
PyTorchESRGAN / RRDBNetFFmpeg / FFprobe PILPSNR / SSIMscikit-image torch.no_grad()GPU Inference

Network Architecture — Implemented from Scratch

Rather than using a pre-built library, I reimplemented the RRDBNet architecture directly in PyTorch for fine-grained control over the upsampling path. The architecture has three levels of hierarchy:

Residual Dense Block
5 conv layers, DenseNet-style concatenation, LeakyReLU (slope 0.2), residual scale 0.2
↓ × 3 chained
RRDB Block
3 RDBs in sequence + 0.2-scaled residual skip (residual-in-residual)
↓ × 23 blocks
Full RRDBNet
Init conv → 23 RRDB blocks → body conv → 2× upsample → 2× upsample → HR output
Why reimplement? The native model does 4× upsampling (360p → 1440p), but I only needed 2× (360p → 720p). By implementing it myself, I could run the full 4× path and crop the output tensor to 2× dimensions — which produces sharper results than architecturally limiting the model. The checkpoint loader also handles three weight formats (model_ema, params_ema, raw state dict) for compatibility across official and fine-tuned variants.

Processing Pipeline

FFprobe metadata
FFmpeg → PNG frames
PIL load + pad to mod 4
ESRGAN inference (no_grad)
Float→uint8 conversion
FFmpeg reassemble MP4
  • PNG frames (not JPEG) — avoids introducing compression artifacts before the upscaler.
  • PIL loading over OpenCV — more reliable for edge-case color space handling.
  • Pad to nearest mod 4 — required by the architecture's strided convolutions.
  • libx264 / yuv420p / crf=23 — visually lossless reassembly at the original framerate.
  • Auto-cleanup — temporary frame directories cleaned after completion; no disk residue on shared cluster storage.

Quality Evaluation System

I built a dedicated evaluation script computing frame-by-frame PSNR and SSIM between the upscaled output and the 720p reference — not on a sample, but on every single frame pair.

  • PSNR — measured in dB on full RGB and on the Y (luminance) channel per ITU-R BT.601. Y-channel PSNR is more perceptually meaningful; human vision is more sensitive to luminance than chrominance.
  • SSIM — compares luminance, contrast, and structural patterns. Values above 0.90 = high perceptual fidelity.
  • Shape mismatch handling — cubic interpolation resizes if reference and output dimensions differ slightly.
  • CSV output — all frame metrics written for downstream analysis.

Benchmark Results

Measured on 5 AI-generated videos (81 frames each, 1280×720 target resolution):

VideoPSNR (dB)SSIMProcessing TimeSpeed (fps)
Video A31.060.92710.25s7.90
Video B31.190.89110.52s7.70
Video C31.730.92610.36s7.82
Video D32.25 ★0.92410.24s7.91 ★
Video E31.040.91010.34s7.83
Average31.450.91610.34s7.83
PSNR above 30 dB on natural-scene video = good reconstruction; above 32 dB (Video D) = excellent. All five videos hit SSIM > 0.89, solidly in the high-fidelity range. ~7.8 fps means a 5-second clip processes in about 10 seconds on a single GPU — practical for production post-processing.

Pipeline Orchestration

The full workflow runs as a single command with three independently-skippable stages:

  1. Model download — fetches pre-trained ESRGAN weights from a remote URL if not cached locally.
  2. Batch upscaling — processes all videos in the input directory, writing results to an output directory.
  3. Metrics evaluation — computes PSNR/SSIM against the reference directory; generates a summary report identifying best and worst-performing clips.

Each stage is subprocess-isolated, so a failure gives a clear error without corrupting state for others.


Impact

This pipeline saved the team from retraining the upstream text-to-video model at 720p — a process that would have required significantly more GPU hours, training data, and iteration time. Instead, we got 720p output quality (PSNR > 31 dB, SSIM > 0.91) by adding a lightweight post-processing step that takes ~10 seconds per clip and runs on a single GPU, processing every incoming AI-generated video through the internal content pipeline.