AI Virtual Try-On System

As an Machine Learning Engineer intern at GMI Cloud, I developed an automated AI virtual try-on pipeline that enables photorealistic garment swaps on e-commerce model images using a diffusion-based image generation API. The system takes a model photograph and a garment image, and produces a result where the model is wearing the target garment — with natural fabric draping, body-consistent proportions, and preserved facial identity.

This is not a single API call wrapped in a script. It's a batch orchestration system built to handle real production workloads: 2 model images × 9 garment items = 18 combinations per run, with asynchronous task tracking, empirically-tuned polling intervals, multi-format image handling, retry logic, and fault isolation so one API timeout doesn't kill the remaining 17 jobs.

Combinations / Run

~27s

Typical Completion

Avg Polls to Done

Retry Attempts

Disk Writes at Submission

Diffusion Inpainting APIAsync Task Orchestration Base64 EncodingChunked Download HMAC-SHA256 (SigV4)Python SDKBatch Processing

How Diffusion-Based Try-On Works

The underlying model is a diffusion-based virtual dressing system (v2) operating in three stages:

1. Body Segmentation

→

2. Conditioned Inpainting

→

3. Post-processing & Blend

The model automatically segments the input image to identify body regions (head, torso, arms, hands, legs, feet), masks the target region, and fills it with the garment — conditioned on the reference garment image. The diffusion model learns to respect fabric texture, color, draping physics, and body pose. Unmasked regions (face, background) are seamlessly blended back.

Generation Parameters I Tuned

✓ Super-resolution enabled

Output upsampled after generation for sharper fabric texture detail.

✓ Head preservation on

Model's face and hair locked — eliminates identity drift between input and output.

✓ Hand regeneration allowed

Hands re-generated to fit naturally with new sleeve length/style — not forced to original positions.

✓ Loose mask boundary

Relaxed segmentation mask → natural transitions at collars and hemlines; avoids visible seams.

Garment type is configurable per-job: "upper", "lower", or "full". I built separate test workflows for each mode before running full-outfit batch sweeps.

Async Task Architecture

Because the diffusion model takes 20–30 seconds per image, the API uses an asynchronous task model (submit → task_id → poll → result) rather than synchronous request-response.

Task Submission

Model and garment images are base64-encoded in memory and sent as a single JSON payload. The server queues the diffusion job and immediately returns a task_id. No image data touches disk during this step — everything stays in-process until the result comes back.

Smart Polling Strategy

My first version polled every 5 seconds uniformly — ~7 wasted polls per task. The calibrated version cuts that to ~4 average polls to completion:

Poll	Wait Before	Cumulative	Notes
1	5s	~5s	Early check — unlikely done
2	8s	~13s	Still too early
3	10s	~23s	Approaching typical completion
4	8s	~31s	★ Usually catches completion here (~27s typical)
5+	10s each	—	Fallback cadence until 300s timeout

Each poll returns one of four states: "in_queue" / "generating" (continue) · "done" (extract URLs) · "not_found" / "expired" (log and skip).

Result Download with Retry

Chunked streaming (8 KB chunks) — avoids loading large images entirely into memory
30-second timeout per download — handles slow CDN responses
3 retries with delay — transient network errors
Descriptive filenames — encode model ID, garment index, result number for full QA traceability

Batch Orchestration

The batch processor iterates the full cross-product of model images × garment images:

Submit task → get task_id

→

Smart poll → wait for done

→

Download results

→

Log success / failure

→

2s rate-limit delay

→

Next combination

Fault tolerance: If a single task fails — API error, timeout, download failure — the system logs and moves to the next combination. The final summary reports total successes vs. failures so I can immediately identify which combinations need retrying without parsing through logs.

Authentication & Security

I implemented two authentication approaches:

Signature-based (SigV4) — low-level API; each request cryptographically signed with HMAC-SHA256 over request method, host, path, query parameters, headers, and body. Same protocol as AWS. Proved valuable when debugging a signing bug that only appeared with certain garment image sizes (content hash computation sensitive to base64 padding).
SDK-based — production batch orchestrator uses the platform's official SDK, which handles signing internally and provides a higher-level interface (submit_task / get_result). I implemented both: low-level first to understand the auth mechanism, SDK second for production reliability.

Results & Impact

The system successfully processed all 18 model × garment combinations in a single automated run, producing photorealistic try-on images with consistent facial identity and natural garment appearance. The orchestration eliminated what would otherwise be a tedious 2-hour manual process of submitting images one-by-one through a web interface.

The pipeline was later adapted for internal A/B testing of garment presentation quality — the ability to quickly generate try-on images across model-garment pairings let the team evaluate visual consistency at scale rather than one-off spot checks.

What I Learned

The most interesting engineering challenge wasn't the ML model — it was designing robust orchestration around an asynchronous API I didn't control. Unlike the inference server where I owned the entire stack, here the model was behind someone else's API with its own queue, its own rate limits, and its own failure modes. Learning to build reliable orchestration on top of a service with variable latency and no SLA guarantees was a skill I hadn't developed in any academic project.

Share on

Twitter Facebook LinkedIn

Yupeng Tang