---
title: "How Video Compression Actually Works (In Plain English)"
author: "Cutsio Team"
date: "2026-05-14"
lastmod: "2026-05-14"
category: "Video Technology"
excerpt: "Video compression achieves 100x to 1,000x size reduction by exploiting how human vision works, removing redundant data, and using mathematical transforms that prioritize what our eyes actually notice. Here is how the technology behind FFmpeg and modern codecs makes video streaming possible."
tags: ["Video Compression", "Codecs", "FFmpeg", "H.264", "AV1", "Video Technology"]
---

## How does video compression achieve 1,000x size reduction?

Video compression achieves 1,000x size reduction by exploiting three fundamental principles: spatial redundancy (neighboring pixels are often similar), temporal redundancy (adjacent frames are often nearly identical), and perceptual redundancy (details invisible to the human eye can be discarded without affecting perceived quality).

A single frame of uncompressed 4K video at 30 frames per second requires roughly 700 megabytes per second of data. That is 2.5 terabytes per hour. Streaming that over an average internet connection would be impossible. The same video encoded with a modern codec like AV1 might use 5 to 10 megabits per second — a reduction of over 99.9%.

This compression is not lossless. Unlike a ZIP file where the decompressed data is identical to the original, video compression discards information. The goal is to discard only the information that humans will not notice is missing. "Compression is not like a ZIP," as the FFmpeg developers explain. "With a ZIP, you have data in, you get data out, and you try to arrive at the mathematical limit. Here we are degrading the signal. We need to degrade in the best way possible."

The entire field of video compression is built on the intersection of signal processing, human perception, and computational efficiency. Every codec represents a different set of trade-offs between these three factors, and each generation of codec finds new ways to push the boundaries of what can be discarded without the viewer noticing.

## What is the difference between a container and a codec?

A container is the file format that holds video, audio, subtitles, and metadata together, while a codec is the algorithm that compresses and decompresses the actual video data — and confusing the two is one of the most common mistakes in video.

The most common confusion is between MP4 and H.264. MP4 is a container format. H.264 is a video codec. When people say "an MP4 file," they usually mean a file that uses an MP4 container with H.264 video and AAC audio. But technically, an MP4 container can hold any codec, and H.264 video can be stored in many containers including MOV, MKV, and even AVI.

The confusion is perpetuated by the naming conventions of the standards bodies. H.264 is technically called MPEG-4 Part 10, which sounds like it should be related to MP4. But MPEG-4 is a meta-specification that covers multiple parts, including the MP4 container format (Part 14) and multiple video codecs (Part 2 and Part 10). "It is completely the fault of the industry to make things difficult to understand," the FFmpeg developers acknowledge.

There is a practical consequence to this confusion. A file ending in .MP4 may contain video encoded with H.264, H.265, VP9, AV1, or any other codec. The file extension is a hint, not a guarantee. VLC and FFmpeg do not trust file extensions at all — they probe the actual byte content of the file to determine what format and codec they are dealing with. This philosophy of "never trust your input" is why VLC can play files that other players reject.

| Component | What It Does | Examples |
|---|---|---|
| Container | Wraps video, audio, subtitle, and metadata streams | MP4, MOV, MKV, AVI, WebM |
| Video codec | Compresses and decompresses the pixel data | H.264, H.265, AV1, VP9, VP8 |
| Audio codec | Compresses and decompresses the sound data | AAC, Opus, MP3, Vorbis |
| Muxer | Combines streams into a container | Part of the encoding pipeline |
| Demuxer | Separates streams from a container | Part of the decoding pipeline |

## How does spatial compression remove redundancy within a single frame?

Spatial compression removes redundancy by dividing each frame into blocks, predicting the content of each block based on its neighbors, transforming the prediction error into the frequency domain, and quantizing the high-frequency components that human vision is less sensitive to.

The process starts by converting the image from RGB to YUV color space. Human vision is much more sensitive to brightness (luminance) than to color (chrominance). The YUV representation separates these, allowing the encoder to store color information at half the resolution of brightness with no visible quality loss. This single trick reduces the data by about 50%.

The frame is then divided into blocks, typically 16x16 pixels for older codecs and as large as 128x128 for modern ones. Each block is predicted from the pixels above and to the left that have already been encoded. The encoder subtracts the prediction from the actual block, leaving a residual that is much smaller than the original data.

The residual is transformed using a Discrete Cosine Transform, converting the pixel values from the spatial domain to the frequency domain. Low-frequency components (smooth gradients, large features) end up in one set of coefficients. High-frequency components (sharp edges, fine details, noise) end up in another. The encoder then quantizes these coefficients, dividing them by a value that determines how much precision to keep. Higher quantization means more compression and more quality loss.

The critical insight is that the quantization is guided by human perception. The encoder uses different quantization levels for different frequencies, matching the contrast sensitivity function of the human visual system. High-frequency detail that the eye cannot perceive anyway gets heavy quantization. Low-frequency information that defines the structure of the image gets lighter quantization.

## How does temporal compression achieve even greater savings?

Temporal compression achieves greater savings by identifying areas of the frame that have not changed since the previous frame and only encoding the differences, which is why a static background compresses far better than fast-moving action.

Most video contains enormous amounts of temporal redundancy. A news anchor sitting at a desk: the background does not change for minutes at a time. A landscape shot in a documentary: the clouds move slowly, but the mountains are static. Even in fast-paced content, large areas of each frame remain similar to the previous one.

Video codecs exploit this with motion-compensated prediction. The encoder divides the frame into blocks and searches the previously decoded frame for a block that matches each one. If the camera panned to the right, the encoder does not need to re-encode the entire background — it just stores a motion vector saying "this block came from 47 pixels to the left." The decoder then reconstructs the frame by moving the previous frame's blocks according to these vectors.

The efficiency of temporal compression depends on the complexity of the motion. A simple panning shot requires almost nothing beyond motion vectors. A scene with complex motion, multiple moving objects, and camera shake requires more residual data to correct the prediction.

Modern codecs store three types of frames. I-frames (intra frames) are complete frames encoded without reference to any other frame — they are the largest but allow random access. P-frames (predicted frames) reference previous frames. B-frames (bi-directional frames) reference both previous and future frames, providing the highest compression efficiency. A typical encoding pattern might have one I-frame every two seconds, with P and B frames filling the gaps.

## Why do modern codecs use more computational power for encoding?

Modern codecs use dramatically more computational power for encoding because they are designed as collections of tools that can be selectively applied depending on the content — each tool adds compression efficiency but requires additional computation to evaluate.

Early codecs like MPEG-2 used a relatively small set of tools. The encoder made a limited number of decisions. Modern codecs like H.264, H.265, AV1, and VVC have dozens of coding tools, each providing a few percent improvement in compression efficiency. The encoder must decide which tools to use for each block of each frame, and the wrong decision can mean significantly worse compression.

This turns encoding into an optimization problem with an enormous search space. Should this block use intra prediction mode 3 or mode 17? Should it use a 4x4 transform or a 32x32 transform? Should it reference the previous frame or the frame before that? Should it use the sample adaptive offset filter? Each decision affects the others.

The result is that encoding is orders of magnitude more expensive than decoding. "Each generation of codec is 30% better compression but an order of magnitude more compression power," as Kieran Kunhya explains. A real-time software encoder for AV1 requires a powerful multi-core processor or GPU hardware acceleration. A decoder for the same stream can run on a mobile phone.

| Codec | Year | Compression vs MPEG-2 | Relative Encode Complexity |
|---|---|---|---|
| MPEG-2 | 1995 | Baseline | 1x |
| H.264 / AVC | 2003 | 2x to 3x better | 10x |
| H.265 / HEVC | 2013 | 4x to 5x better | 100x |
| AV1 | 2018 | 5x to 7x better | 500x to 1,000x |
| VVC / H.266 | 2020 | 7x to 8x better | 1,000x to 2,000x |

## What determines perceived video quality in compression?

Perceived video quality is determined by how well the compression artifacts match the characteristics of human vision — a codec that removes the wrong information will look worse at the same bitrate than one that targets the perceptual weaknesses of the eye.

Human vision is not a uniform sensor. The eye has high resolution only in a small central region (the fovea). Peripheral vision has much lower resolution but is highly sensitive to motion. The visual system is also highly adaptive — it adjusts to brightness levels, ignores constant stimuli, and fills in missing details based on context.

Video codecs exploit these properties in multiple ways. Chroma subsampling reduces color resolution because the eye is less sensitive to color detail. Quantization matrices apply more compression to high frequencies that the eye cannot resolve. Adaptive quantization allocates more bits to regions where artifacts are visible (smooth gradients, skin tones) and fewer bits to regions where they are not (textures, fast motion).

The most visible artifacts in compressed video reflect the limits of the codec. Blocking artifacts appear as visible squares at block boundaries when quantization is too aggressive. Ringing artifacts appear as ghost edges around sharp transitions. Banding artifacts appear as visible steps in smooth gradients. Each codec generation targets these specific artifacts with new tools.

AV1 introduced tools like the CDEF filter (Constrained Directional Enhancement Filter) specifically to reduce ringing artifacts without blurring genuine edges. The sample adaptive offset filter in H.265 reduces banding in smooth areas. These tools add decoding complexity but significantly improve perceived quality at the same bitrate.

## How does understanding compression help video editors?

Understanding compression helps video editors make better decisions about recording formats, editing codecs, export settings, and delivery parameters — decisions that directly affect the quality and efficiency of their work.

When recording, choosing the right codec and bitrate determines how much latitude you have in post-production. A highly compressed recording from a consumer camera will show artifacts when color graded. A high-bitrate recording in a professional codec like ProRes or DNxHD preserves more information for editing.

When editing, understanding the difference between intraframe codecs (where every frame is an I-frame, like ProRes) and interframe codecs (where most frames depend on neighbors, like H.264) explains why some formats are dramatically faster to edit. Interframe codecs require the NLE to decode the entire GOP (Group of Pictures) to access a single frame, while intraframe codecs give instant access to any frame.

When exporting, the choice of codec, bitrate, encoding speed preset, and resolution determines the balance between file size and quality. A slower encoding preset allows the encoder to search more thoroughly for compression opportunities, producing better quality at the same bitrate at the cost of longer export times.

When delivering, understanding the constraints of the playback platform helps you choose the right format. YouTube prefers VP9 or AV1. Broadcast requires specific bitrates and color spaces. Social media platforms have their own optimized parameters. Choosing the wrong settings means your carefully edited video will not look its best on the viewer's screen.

<div class="not-prose blog-large-cta">
  <div class="max-w-3xl mx-auto text-center">
    <h3>
      Know your formats. Edit faster. Deliver better.
    </h3>
    <p>
      Understanding video compression helps you make smarter choices at every stage of your workflow. Cutsio complements that knowledge by handling the tedious pre-processing: upload your footage, remove silences with AI, generate transcripts, and export clean XML to your NLE.
    </p>
    <ul>
      <li>
        <svg class="h-6 w-6 text-emerald-400 shrink-0 mt-0.5" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="20 6 9 17 4 12"/></svg>
        <span>AI-powered silence removal and rough-cut assembly</span>
      </li>
      <li>
        <svg class="h-6 w-6 text-emerald-400 shrink-0 mt-0.5" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="20 6 9 17 4 12"/></svg>
        <span>Visual Intelligence search — find any frame by describing what you see</span>
      </li>
      <li>
        <svg class="h-6 w-6 text-emerald-400 shrink-0 mt-0.5" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><polyline points="20 6 9 17 4 12"/></svg>
        <span>Clean XML/EDL exports to DaVinci Resolve, Final Cut Pro, or Premiere Pro</span>
      </li>
    </ul>
    <div class="flex flex-col sm:flex-row items-center justify-center gap-4">
      <a href="https://studio.cutsio.com" target="_blank" rel="noopener noreferrer"
         class="no-underline inline-flex items-center justify-center rounded-full bg-indigo-600 px-8 py-3.5 text-sm font-semibold text-white hover:bg-indigo-700 dark:bg-white dark:text-slate-900 dark:hover:bg-neutral-100 transition-colors shadow-sm">
        Try Cutsio Free
        <svg class="ml-2 h-4 w-4" xmlns="http://www.w3.org/2000/svg" width="24" height="24" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M5 12h14"/><path d="m12 5 7 7-7 7"/></svg>
      </a>
      <button type="button" onclick="window.dispatchEvent(new CustomEvent('open-contact-modal'))"
              class="inline-flex items-center justify-center rounded-full border border-white/20 px-8 py-3.5 text-sm font-medium text-white hover:bg-white/10 transition-colors">
        Book a demo
      </button>
    </div>
    <p class="mt-4 text-xs text-slate-500">No credit card required. 60 minutes of free processing.</p>
  </div>
</div>

## FAQ

**What is the difference between lossy and lossless video compression?**
Lossy compression discards information permanently to achieve smaller file sizes. Lossless compression preserves every pixel exactly. Video delivery always uses lossy compression. Professional editing uses lossless or visually lossless intermediate codecs.

**Why does 4K video from a streaming service look worse than a local 4K file?**
Streaming services use aggressive compression to deliver video over limited bandwidth. A local 4K file might use 50 Mbps, while a streaming service delivers the same resolution at 15 Mbps. The compression artifacts are more visible despite the same pixel count.

**Do newer codecs like AV1 make older codecs obsolete?**
Newer codecs gradually replace older ones as hardware support becomes widespread. H.264 remains the most universally compatible codec. AV1 provides better compression but requires more recent hardware for decoding.

**Why do my export settings matter so much for final quality?**
Export settings determine how much compression is applied to your final video. A low bitrate, high-speed encoding preset, or incorrect color space settings can introduce visible artifacts that degrade your carefully edited footage.

**What is the best codec for archiving video?**
For archiving, use a visually lossless or near-lossless codec like ProRes 4444, DNxHR HQ, or FFV1. These preserve maximum quality for future re-encoding while providing significant compression over uncompressed video.