Video Super-Resolution Pipeline
As an Machine Learning Engineer intern at GMI Cloud, I built a video super-resolution pipeline that upscales AI-generated video from 360p to 720p using an ESRGAN (Enhanced Super-Resolution Generative Adversarial Network) model. The pipeline was designed as a post-processing stage for an internal text-to-video generation system — rather than retraining the generative model at higher resolution (which would require significantly more compute and data), I inserted a dedicated upscaling step that takes low-resolution output and elevates it to a deliverable quality level.
What makes this more than a "run a model on frames" project: I reimplemented the ESRGAN network architecture from scratch in PyTorch, built an FFmpeg-based frame extraction/reassembly pipeline, wrote an automated PSNR/SSIM evaluation system, and wrapped everything in an orchestration script that runs the complete workflow in a single command.
Network Architecture — Implemented from Scratch
Rather than using a pre-built library, I reimplemented the RRDBNet architecture directly in PyTorch for fine-grained control over the upsampling path. The architecture has three levels of hierarchy:
model_ema, params_ema, raw state dict) for compatibility across official and fine-tuned variants.Processing Pipeline
- PNG frames (not JPEG) — avoids introducing compression artifacts before the upscaler.
- PIL loading over OpenCV — more reliable for edge-case color space handling.
- Pad to nearest mod 4 — required by the architecture's strided convolutions.
- libx264 / yuv420p / crf=23 — visually lossless reassembly at the original framerate.
- Auto-cleanup — temporary frame directories cleaned after completion; no disk residue on shared cluster storage.
Quality Evaluation System
I built a dedicated evaluation script computing frame-by-frame PSNR and SSIM between the upscaled output and the 720p reference — not on a sample, but on every single frame pair.
- PSNR — measured in dB on full RGB and on the Y (luminance) channel per ITU-R BT.601. Y-channel PSNR is more perceptually meaningful; human vision is more sensitive to luminance than chrominance.
- SSIM — compares luminance, contrast, and structural patterns. Values above 0.90 = high perceptual fidelity.
- Shape mismatch handling — cubic interpolation resizes if reference and output dimensions differ slightly.
- CSV output — all frame metrics written for downstream analysis.
Benchmark Results
Measured on 5 AI-generated videos (81 frames each, 1280×720 target resolution):
| Video | PSNR (dB) | SSIM | Processing Time | Speed (fps) |
|---|---|---|---|---|
| Video A | 31.06 | 0.927 | 10.25s | 7.90 |
| Video B | 31.19 | 0.891 | 10.52s | 7.70 |
| Video C | 31.73 | 0.926 | 10.36s | 7.82 |
| Video D | 32.25 ★ | 0.924 | 10.24s | 7.91 ★ |
| Video E | 31.04 | 0.910 | 10.34s | 7.83 |
| Average | 31.45 | 0.916 | 10.34s | 7.83 |
Pipeline Orchestration
The full workflow runs as a single command with three independently-skippable stages:
- Model download — fetches pre-trained ESRGAN weights from a remote URL if not cached locally.
- Batch upscaling — processes all videos in the input directory, writing results to an output directory.
- Metrics evaluation — computes PSNR/SSIM against the reference directory; generates a summary report identifying best and worst-performing clips.
Each stage is subprocess-isolated, so a failure gives a clear error without corrupting state for others.
Impact
This pipeline saved the team from retraining the upstream text-to-video model at 720p — a process that would have required significantly more GPU hours, training data, and iteration time. Instead, we got 720p output quality (PSNR > 31 dB, SSIM > 0.91) by adding a lightweight post-processing step that takes ~10 seconds per clip and runs on a single GPU, processing every incoming AI-generated video through the internal content pipeline.
