Large-Scale Multimodal Inference Optimization
Overview
Optimized Flux-Schnell (12B parameter Diffusion Transformer) multimodal inference on H100 clusters, achieving dramatic performance improvements through GPU memory optimization and distributed computing.
Key Technologies
- Deep Learning: Flux-Schnell, Diffusion Transformers, Real-ESRGAN
- Optimization: NCCL, ONNX, TensorRT, GPU memory persistence
- Infrastructure: H100 GPUs, GCS, FastAPI, ComfyUI
Performance Achievements
- Reached ~30 images/min/GPU with 1-2s latency (10-15× faster than baseline)
- Achieved linear throughput scaling across distributed multi-GPU nodes
- Reduced video super-resolution runtime by ~65% (284s → 100s) for 5s@24fps clips
Production Systems
- Video Super-Resolution: Real-ESRGAN pipeline with PSNR/SSIM evaluation, integrated with Wan2.2 text-to-video
- E-commerce Try-On: AI-powered service using Flux-Kontext + Segformer, delivering <5s per image for outfit changes and style transfer
- Infrastructure: Production-ready system with queuing, heartbeat monitoring, structured logging, and content moderation
Technical Innovations
- GPU memory persistence and offload strategies for large model inference
- NCCL all-reduce for distributed communication
- ONNX → TensorRT optimization pipeline
- Secure RESTful APIs for real-time image processing
