Inside the Pipeline: Building Models for Scene-Level Video Compression

Published March 12, 2026

The NVIDIA architecture behind Qencode’s ML video compression models.

Most video platforms spend six times more on CDN delivery than on encoding. Every oversized file compounds that gap across every future view.
A sports broadcast and a conference recording have almost nothing in common at the bit-allocation level, yet most encoding pipelines treat them identically. These inefficiencies can produce files that are an average of 60% larger than they need to be.
56% of our GPU compute runs production inference on real customer content. Between training and inference, we quadrupled our compute footprint in the last four months.

H.264, H.265, and AV1 can all produce excellent output. However, the key decision is what settings to use for each scene of each video to maximize video quality while minimizing file size.

Traditional encoders operate at the codec level, applying fixed bitrates and static quality targets uniformly across an entire file. They don’t always do a great job of distinguishing more static scenes from those full of complex motion and texture. The result is common, systematic over-allocation where bits are not needed and under-allocation where they are.

The output is a file that is consistently larger than it needs to be. Since load times and costs scale directly with file size, any inefficiency multiplies across every viewer, every replay, and every asset in your library.

To get the best possible bitrate allocation, the encoder needs to understand what is actually in the scene being encoded. Doing that at production scale, across millions of minutes per month, requires machine learning and a highly effective GPU infrastructure to run it on.

We have designed this architecture with NVIDIA hardware.

What the Intelligence Layer Does Frame by Frame

Before Qencode makes a single encoding decision, our ML models analyze each scene across five specific dimensions to choose the best encoding settings to maximize compression rate:

Motion estimation between frames. How much has changed from one frame to the next, and how complex is that change? A slow camera pan across a landscape is fundamentally different from a fast break in a basketball game, even if both register as “motion” to a basic scene detector.

Spatial texture complexity. How much fine detail exists within each frame? A close-up of grass or fabric requires more bits to render faithfully than a clean whiteboard or solid-color background.

Perceptual saliency. Where will the human eye actually look? Bits allocated to regions outside the viewer’s attention have diminishing perceptual return. Our models identify focal areas and weight allocation accordingly.

Color gradient transitions. Smooth gradients (like skies, studio lighting, and skin tones) are prone to banding artifacts when under-allocated. The model detects these regions and protects them.

Temporal redundancy. How much of the frame has not meaningfully changed since the last one? In a conference recording, this can be 60% or more of all frames. In a sports broadcast, it is far less. Identifying this accurately prevents wasting bits on information the viewer already has.

The outcome is an average 60% reduction in file size with no perceptual quality loss. Smaller files mean proportionally lower CDN costs, no workflow changes, no new codecs, and no player modifications required.

Built Using NVIDIA Hardware

Running five ML inference passes per frame, per video, at scale is not a workload you can easily approximate using CPUs. Each pass requires dense matrix operations, so running them in parallel fast enough to be economically viable in a production pipeline requires the kind of hardware NVIDIA builds.

Qencode’s infrastructure reflects this in the following three-tier architecture:

NVIDIA A100 (37% of compute): model training
NVIDIA L40S (56% of compute): production inference
NVIDIA T4 (7% of compute): hardware transcoding

The 56% inference allocation is the number that tells the real story. This number grows over time and more than half of Qencode’s recent GPU compute is dedicated to serving real customer workloads.

Qencode is self-funded. The GPU compute footprint has quadrupled in four months. Our team is fully focused on scaling customer demand.

The NVIDIA Inception Program has been a huge part of Qencode’s success, providing technical resources that let a capital-efficient team like ours build on GPU-intensive infrastructure that would otherwise be out of reach. The program’s value cannot be overstated. It has been one of the key drivers that has allowed us to drastically reduce the time between “we have a great idea” and “we shipped an amazing new production release.”

Try it with Your Content

Upload a sample of your content at Qencode.com and compare file size and quality output side by side with your current pipeline. No workflow changes are required to run the benchmark.

Try it here: Qencode

Learn more about our per-title encoding here: qencode.com/per-title-encoding

Qencode is a member of the NVIDIA Inception Program.

We love creating powerful solutions that are aligned with the needs of your business.

Please send us a message if you have a question or Schedule a call for a demo to discuss your integration.

Let's talk

Contact us with any questions. We'd love to help.

Los Angeles, CA - (HQ)

+1 (310) 905-8600

San Francisco, CA

+1 (415) 887-6500

New York, NY

+1 (917) 525-4500