Large-Scale Multimodal Inference Optimization

Overview

Optimized Flux-Schnell (12B parameter Diffusion Transformer) multimodal inference on H100 clusters, achieving dramatic performance improvements through GPU memory optimization and distributed computing.

Key Technologies

Deep Learning: Flux-Schnell, Diffusion Transformers, Real-ESRGAN
Optimization: NCCL, ONNX, TensorRT, GPU memory persistence
Infrastructure: H100 GPUs, GCS, FastAPI, ComfyUI

Performance Achievements

Reached ~30 images/min/GPU with 1-2s latency (10-15× faster than baseline)
Achieved linear throughput scaling across distributed multi-GPU nodes
Reduced video super-resolution runtime by ~65% (284s → 100s) for 5s@24fps clips

Production Systems

Video Super-Resolution: Real-ESRGAN pipeline with PSNR/SSIM evaluation, integrated with Wan2.2 text-to-video
E-commerce Try-On: AI-powered service using Flux-Kontext + Segformer, delivering <5s per image for outfit changes and style transfer
Infrastructure: Production-ready system with queuing, heartbeat monitoring, structured logging, and content moderation

Technical Innovations

GPU memory persistence and offload strategies for large model inference
NCCL all-reduce for distributed communication
ONNX → TensorRT optimization pipeline
Secure RESTful APIs for real-time image processing

Share on

Twitter Facebook LinkedIn