APR
02

The previous three posts describe the properties of the transport stream being interpreted. The first thing to note is that there is no problem with the MPEG transport stream coming from the capture device; if a player properly interprets the presentation times, the video will play correctly. However, very few tools I use seem to properly handle these timestamps.

To “fix” the video, I’ll need to alter it in such a way that the resulting stream will play correctly when every frame is played, and the frames are played at a fixed rate, rather than by timestamp. Remember, timestamps mean nothing to the software in question, so the only adjustments that matter are adding or dropping whole frames. Given that, let’s take a look at fixing the stream.

The First Problem - Start Time

To start in sync, the audio and video streams should start with frames from (close to) the same presentation time. The tool can either synthesize frames to extend the shorter stream, or drop frames to trim the longer stream. Deleting is always easier than creating, so I’ll go for that. An elementary (i.e. audio or video) packet filter is perfect for this.

using MattBlagden.Mpeg;
using MattBlagden.Mpeg.ElementaryStream;
using MattBlagden.Mpeg.TransportStream;

namespace MattBlagden.FrameSniper
{
    
internal sealed class TimeLimiter : ElementaryPacketFilter
    {
        
private Timestamp minimumTime;

        
public TimeLimiter(ITransportPacketReader transportPacketReader,
            
short transportStreamId, Timestamp minimumTime)
            :
base(transportPacketReader, transportStreamId)
        {
            
this.minimumTime = minimumTime;
        }

        
protected override bool FilterElementaryPacket(byte elementaryStreamId,
            
Timestamp presentationTime, byte[] content)
        {
            
return presentationTime >= this.minimumTime;
        }
    }
}

Using this, any packets before a certain time will be dropped. So long as the specified time is at or after the beginning of both streams, both streams will begin with frames from (very close to) the same time.

The Second Problem - Frame Rate

In the video file from the previous post, the video source generated frames slightly too quickly. This resulted in a recording that contained slightly more frames than should be present for the frame rate. This problem can be fixed by dropping frames as well. This filter counts frames and calculates when they would be displayed if all frames are being displayed sequentially at a fixed rate. If the “sequential” display time is too much later than the correct display time, the frame is dropped.

The other consideration for video is which type of frame to drop. Not all frames are equal; way back in the second post of this series, I showed that some frames (B-frames and P-frames) depend on other frames (P-frames and I-frames). To prevent any video corruption, only frames with no dependents should be dropped. Videos have a fixed pattern of frame types (a Group of Pictures), so this filter also accepts the index of a position in the GoP that’s acceptable to drop.

using MattBlagden.Mpeg;
using MattBlagden.Mpeg.ElementaryStream;
using MattBlagden.Mpeg.TransportStream;
using MattBlagden.Mpeg.Video;

namespace MattBlagden.FrameSniper
{
    
internal sealed class VideoFrameRateLimiter : ElementaryPacketFilter
    {
        
private const int MillisecondsPerKilosecond = 1000000;

        
private readonly Timestamp startTime;
        
private readonly short sequenceNumberToDrop;
        
private readonly int maximumSkew;
        
private readonly int framesPerKilosecond;

        
private long frameCount = 0;

        
public VideoFrameRateLimiter(
            
ITransportPacketReader transportPacketReader,
            
short transportStreamId, Timestamp startTime,
            
short sequenceNumberToDrop, int framesPerKilosecond,
            
int maximumSkew)
            :
base(transportPacketReader, transportStreamId)
        {
            
this.startTime = startTime;
            
this.sequenceNumberToDrop = sequenceNumberToDrop;
            
this.framesPerKilosecond = framesPerKilosecond;
            
this.maximumSkew = maximumSkew;
        }

        
protected override bool FilterElementaryPacket(byte elementaryStreamId,
            
Timestamp presentationTime, byte[] content)
        {
            
VideoFrameProperties frameProperties =
                
VideoFrame.GetFrameProperties(content);

            
long sequentialPresentationTime =
                (frameCount * MillisecondsPerKilosecond) /
                
this.framesPerKilosecond;

            
long correctPresentationTime =
                (
long)(presentationTime - startTime).TotalMilliseconds;

            
long skew = sequentialPresentationTime - correctPresentationTime;

            
if (skew >= maximumSkew &&
                frameProperties.SequenceNumber ==
this.sequenceNumberToDrop)
            {
                
return false;
            }

            frameCount++;
            
return true;
        }
    }
}

Although it wasn’t a problem in the sample video, it’s possible that audio frames exhibit similar behavior and may need to be dropped to maintain synchronization. A similar filter can selectively drop audio frames. Audio frames are independent, so any one can be dropped when out of sync.

using MattBlagden.Mpeg;
using MattBlagden.Mpeg.Audio;
using MattBlagden.Mpeg.ElementaryStream;
using MattBlagden.Mpeg.TransportStream;

namespace MattBlagden.FrameSniper
{
    
internal sealed class AudioFrameRateLimiter : ElementaryPacketFilter
    {
        
private const int SamplesPerFrame = 1152;
        
private const int MillisecondsPerSecond = 1000;

        
private readonly Timestamp startTime;
        
private readonly int maximumSkew;

        
private int frameCount = 0;

        
public AudioFrameRateLimiter(
            
ITransportPacketReader transportPacketReader,
            
short transportStreamId, Timestamp startTime, int maximumSkew)
            :
base(transportPacketReader, transportStreamId)
        {
            
this.startTime = startTime;
            
this.maximumSkew = maximumSkew;
        }

        
protected override bool FilterElementaryPacket(byte elementaryStreamId,
            
Timestamp presentationTime, byte[] content)
        {
            
AudioProperties audioProperties =
                
AudioFrame.GetAudioProperties(content);

            
long sequentialPresentationTime = (long)frameCount * SamplesPerFrame *
                MillisecondsPerSecond / audioProperties.SamplesPerSecond;

            
long correctPresentationTime =
                (
long)(presentationTime - startTime).TotalMilliseconds;

            
long skew = sequentialPresentationTime - correctPresentationTime;

            
if (skew > maximumSkew)
            {
                
return false;
            }

            frameCount++;
            
return true;
        }
    }
}

Putting It All Together

The tool has a section of argument parsing and validation, followed by some DelegateElementaryPacketFilters to gather audio/video properties, just like in the previous blog post. There is one new DelegateElementaryPacketFilter to gather the GoP pattern:

bool doneGroupOfPictures = false;

Dictionary<short, VideoFrameType> groupOfPicturesFrameTypes =
    
new Dictionary<short, VideoFrameType>();

transportPacketReader =
new DelegateElementaryPacketFilter(
    transportPacketReader, videoTransportStreamId,
    (transportStreamId, presentationTime, content) =>
    {
        
VideoFrameProperties frameProperties =
            
VideoFrame.GetFrameProperties(content);

        
short sequenceNumber = frameProperties.SequenceNumber;
        
if (groupOfPicturesFrameTypes.ContainsKey(sequenceNumber))
        {
            doneGroupOfPictures =
true;
        }

        groupOfPicturesFrameTypes[sequenceNumber] = frameProperties.FrameType;
        
return true;
    });

After this comes the core functionality of the tool: dropping packets.

Timestamp startTime = (videoStartTime.Value > audioStartTime.Value) ?
    videoStartTime.Value : audioStartTime.Value;

short sequenceNumberToDrop = groupOfPicturesFrameTypes
    .Where(x => x.Value ==
VideoFrameType.BidirectionallyPredicted)
    .Max(x => x.Key);

Console.WriteLine("Video:");
Console.WriteLine("  Dimensions: {0}x{1}",
    videoProperties.Width, videoProperties.Height);
Console.WriteLine("  Aspect Ratio: {0}", videoProperties.AspectRatio);
Console.WriteLine("  Frame Rate: {0}.{1} frames/second",
    videoProperties.FramesPerKilosecond /
1000,
    videoProperties.FramesPerKilosecond %
1000);
Console.WriteLine();

Console.WriteLine("Audio:");
Console.WriteLine("  Channel Mode: {0}", audioProperties.ChannelMode);
Console.WriteLine("  Sample Rate: {0}.{1} KHz",
    audioProperties.SamplesPerSecond /
1000,
    audioProperties.SamplesPerSecond %
1000);
Console.WriteLine();

Console.WriteLine("Dropping:");
Console.WriteLine("  All frames before {0}",
    startTime.ToTimeSpan().ToString(
@"hh\:mm\:ss\.fff"));
Console.Write("  Frame #{0} of GOP (", sequenceNumberToDrop);
groupOfPicturesFrameTypes.OrderBy(frame => frame.Key).ToList()
    .ForEach(frame =>
Console.Write(frame.Value.ToString().Substring(0, 1)));
Console.WriteLine(") when more than {0}ms out of sync", skewTolerance);

input.Position =
0;
transportPacketReader =
new ContinuityCounterVerifier(transportStreamReader);

transportPacketReader =
new TimeLimiter(transportPacketReader,
    videoTransportStreamId, startTime);
transportPacketReader =
new TimeLimiter(transportPacketReader,
    audioTransportStreamId, startTime);

transportPacketReader =
new VideoFrameRateLimiter(transportPacketReader,
    videoTransportStreamId, startTime, sequenceNumberToDrop,
    videoProperties.FramesPerKilosecond, skewTolerance);
transportPacketReader =
new AudioFrameRateLimiter(transportPacketReader,
    audioTransportStreamId, startTime, skewTolerance);

transportPacketReader =
new ContinuityCounterAssigner(transportPacketReader);

TransportStreamWriter transportStreamWriter = new TransportStreamWriter(output);
while (transportPacketReader.TryReadPacket(out transportPacket))
{
    transportStreamWriter.Write(transportPacket);
}

The code starts off by selecting a common start time, which is simply the later of the audio start time and video start time. Next, the group-of-pictures pattern is inspected to select the most appropriate part to drop. The selected frame is the last B-frame of the GoP. A B-frame is chosen as it is not a dependency for any other frames.

A chain of readers is then created to:

  • Validate the integrity of the incoming stream before it’s edited (by checking that the continuity counters on each stream are, in fact, continuous)
  • Drop any video frames before the common start time
  • Drop any audio frames before the common start time
  • Drop the last B-frame of a GoP if the video is out of sync
  • Drop audio frames if the audio is out of sync
  • Reassign continuity counter values, as they are now discontinuous

Any transport packets that successfully pass through the filters are written to the output file.

The only command-line configuration option (other than the file paths) is the loss-of-synchronization tolerance (or “skew tolerance”). At 29.97 frames per second, one frame lasts approximately 33 milliseconds. The tool should wait until the streams are at least 16 milliseconds out of sync before dropping a video frame, as anything less than that would only worsen the synchronization issues (e.g. if a frame was dropped when the video is just one millisecond behind the video, the video would then be 32 milliseconds ahead of the video... worse than if nothing was done at all). In fact, the stream may go slightly out of sync and regain sync within a few frames, so a bit more room than 16 milliseconds is best to ensure the tool only drops frames when the video is actually losing sync.

For 29.97 fps NTSC video I use a tolerance of 50 milliseconds. 50 milliseconds is small enough that it’s not a noticeable loss of synchronization, but large enough to be more than just slight wobble in the stream (it’s more than a whole video frame out of sync).

Video:
  Dimensions: 720x480
  Aspect Ratio: FourThree
  Frame Rate: 29.970 frames/second

Audio:
  Channel Mode: Stereo
  Sample Rate: 48.0 KHz

Dropping:
  All frames before 00:00:38.690
  Frame #10 of GOP (BBIBBPBBPBBP) when more than 50ms out of sync

Finally, even in slow-mo, my audio and video are just right.

MAR
31

The previous two posts described MPEG encoding at a high level. The important thing to note is that MPEG transport streams differ from many other formats in the way that frames are timed. Many video formats are simply a sequence of frames (or blocks of frames) that are presented at a regular interval (the “frame rate”). MPEG transport streams, however, have a “presentation time” assigned to each frame. In theory, a decoder could properly display an MPEG transport stream without any notion of “frame rates” or “sample rates”; the decoder would simply need to display each frame at the labeled time, and the output would be correct. That said, the particular stream I have is recorded from a device that produces frames at regular intervals.

Whenever I transcode one of these videos (or even play them in certain players), the audio and video are out of sync. The video is out of sync by a small amount right from the beginning, and the later sections of the video are even more out of sync. My guess is that these applications are processing frames according to the frame-rate, rather than handling each frame at its labeled presentation time. Time for some investigation! Be warned: this is just throwaway code to examine a video, so it might be a bit on the ugly side (and will certainly be lacking proper error-handling).

First up is a simple helper method. Recall that a single MPEG transport stream can carry many programs (e.g. TV channels), and each program can have multiple streams (e.g. video, English audio, French audio, closed captions, etc.). Thus, the first thing I need to do is get the ID’s of streams I care about.

private static void GetProgramStreams(
    
ITransportPacketReader transportPacketReader, int programNumber,
    
out short videoTransportStreamId, out short audioTransportStreamId)
{
    
// Get the list of programs
    
ProgramListReader programListReader =
        
new ProgramListReader(transportPacketReader);
    
ProgramMapStreamId[] programMapStreamIds;
    programListReader.TryReadProgramList(
out programMapStreamIds);

    
// Get the map for the desired program
    
IEnumerable<ProgramMapStreamId> mapStream =
        programMapStreamIds.Where(x => x.ProgramNumber == programNumber);
    
ProgramMapReader programMapReader =
        
new ProgramMapReader(transportPacketReader, programMapStreamIds);
    
ProgramMap programMap;
    programMapReader.TryReadProgramMap(
out programMap);

    
// Get the first video and audio streams for the program
    videoTransportStreamId = programMap.Entries
        .Where(x => x.StreamType ==
ProgramStreamType.Mpeg2Video)
        .First().TransportStreamId;
    audioTransportStreamId = programMap.Entries
        .Where(x => x.StreamType ==
ProgramStreamType.Mpeg2Audio)
        .First().TransportStreamId;
}

static void Main(string[] args)
{
    
string filePath = args[0];
    
FileStream fileStream = File.OpenRead(filePath);
    
ITransportPacketReader reader = new TransportStreamReader(fileStream);

    
short videoTransportStreamId;
    
short audioTransportStreamId;
    GetProgramStreams(reader,
1, out videoTransportStreamId,
        
out audioTransportStreamId);
}

The ElementaryPacketFilter class accumulates transport packets from a specified stream until a full elementary packet can be formed (e.g. a video frame or audio frame). The elementary packet is then either dropped or passed on to lower layers, depending on the return value from a filter method. The DelegateElementaryPacketFilter is an implementation that calls a delegate to perform the filtering. Using this, I can gather some information: the start time, duration, frame count, and frame properties for each of the video and audio stream.

// Start from the beginning
fileStream.Position =
0;

// Gather video stream statistics
VideoProperties videoProperties = null;
int videoFrameCount = 0;
Timestamp videoStartTime = Timestamp.Maximum;
Timestamp videoEndTime = Timestamp.Zero;

reader =
new DelegateElementaryPacketFilter(reader, videoTransportStreamId,
    (elementaryStreamId, presentationTime, content) =>
    {
        videoFrameCount++;

        
if (presentationTime < videoStartTime)
        {
            videoStartTime = presentationTime;
        }

        
if (presentationTime > videoEndTime)
        {
            videoEndTime = presentationTime;
        }

        
if (videoProperties == null)
        {
            
if (!VideoFrame.TryGetVideoProperties(content, out videoProperties))
            {
                videoProperties =
null;
            }
        }

        
return false;
    });

// Gather audio stream statistics
AudioProperties audioProperties = null;
int audioFrameCount = 0;
Timestamp audioStartTime = Timestamp.Maximum;
Timestamp audioEndTime = Timestamp.Zero;

reader =
new DelegateElementaryPacketFilter(reader, audioTransportStreamId,
    (elementaryStreamId, presentationTime, content) =>
    {
        audioFrameCount++;
                    
        
if (presentationTime < audioStartTime)
        {
            audioStartTime = presentationTime;
        }

        
if (presentationTime > audioEndTime)
        {
            audioEndTime = presentationTime;
        }

        
if (audioProperties == null)
        {
            audioProperties =
AudioFrame.GetAudioProperties(content);
        }

        
return false;
    });

// Process the entire stream
TransportPacket transportPacket;
while (reader.TryReadPacket(out transportPacket));

fileStream.Close();

// Display the results!
Console.WriteLine("Video:");
Console.WriteLine("  Start: {0}", videoStartTime.ToTimeSpan());
Console.WriteLine("  Duration: {0}", videoEndTime - videoStartTime);
Console.WriteLine("  Frames: {0}", videoFrameCount);
Console.WriteLine("  Frame rate: {0}.{1} frames/second",
    videoProperties.FramesPerKilosecond /
1000,
    videoProperties.FramesPerKilosecond %
1000);

Console.WriteLine();

Console.WriteLine("Audio: ");
Console.WriteLine("  Start: {0}", audioStartTime.ToTimeSpan());
Console.WriteLine("  Duration: {0}", audioEndTime - audioStartTime);
Console.WriteLine("  Frames: {0}", audioFrameCount);
Console.WriteLine("  Sample rate: {0}.{1} KHz",
    audioProperties.SamplesPerSecond /
1000,
    audioProperties.SamplesPerSecond %
1000);

And the output:

Video:
  Start: 00:00:38.2514111
  Duration: 01:04:51.2928666
  Frames: 116927
  Frame rate: 29.970 frames/second

Audio:
  Start: 00:00:37.3577555
  Duration: 01:04:51.7094111
  Frames: 162155
  Sample rate: 48.0 KHz

This output confirms my suspicions above.

The First Problem

As soon as the video begins, the audio plays about a second too late. The output above supports this. The audio stream has frames for a full second before the first video frame. When the two streams are simply played simultaneously at a fixed rate, without regard for MPEG presentation timestamps, the audio will be from a second earlier than the video. This gives the viewer the impression that the audio is “late”: when the video from 00:00:40 is on screen, audio from 00:00:39 will be played. The audio for 00:00:40 won’t play until the video has progressed to 00:00:41.

The Second Problem

Later portions of the video have even worse synchronization issues. Sampling several sections of the video revealed that the audio is about a second late at the beginning, but synchronization slowly improves. Some time into the video, the audio and video are actually in sync for a while. As the video progresses, the audio continues to play earlier and earlier, and the streams are out of sync in the opposite direction (the audio plays far too soon). This implies that (at least) one of the two streams is being played at an incorrect rate.

The timestamps indicate that the audio should be played over a span of 1:04:51.7. Let’s see what happens if the data is just played sequentially, at the rates indicated in the stream properties.

There are 162,155 audio frames full of samples, being played back at 48,000 samples per second. MPEG-2 audio frames always have a fixed 1152 samples per frame.

162155 frames × 1152 samples/frame ÷ 48000 samples/second = 3891.7 seconds = 01:04:51.7

Perfect! Playing the audio at a fixed rate will play it over the same duration as playing it according to the timestamps. How about the video?

116927 frames ÷ 29.970 frames/second = 3901.5 seconds = 01:05:01.5

If played at a fixed rate of 29.970 frames per second, the last frame would end up being played a full ten seconds later than it should be. This explains the shift over time; the longer the video plays at an incorrect rate, the further it will drift from the proper presentation time.

Why?

For the first problem: audio and video are captured independently, so the capture device simply started recording them at slightly different times (perhaps from waiting to fill a buffer before emitting data, or doing different amounts of preprocessing). This is fine, because all of the data is stamped with the capture time, so even if it’s actually stored some time later, a (properly-implemented) decoder can still present everything at the appropriate time.

As for the second problem, I’m guessing that the video frames were generated at a rate that differed ever so slightly from the correct NTSC frame rate. If you read previous couple posts, you’ll know this is determined by the video source, rather than the capture device. The video source device in question is from a couple decades ago, so I’m not too surprised that the timing is ever so slightly off (a ten second deviation over a span of 3900 seconds is still over 99.7% accurate).

This doesn’t matter when being displayed on a TV, because the frames are traced out whenever the video source emits a new-frame signal, not at a perfectly fixed rate. This also doesn’t matter for the capture device, as it simply records a frame whenever the video source signals a new frame is beginning, and stores the time of that event so it can be played back at the proper time. The only thing that causes problems is the software that ignores these timestamps and assumes the video source generated frames at exactly 29.970 frames per second. The next post will show what can be done about it.

NOV
07

As of the last post, we had captured and digitized both the audio and video streams. Even though this is a “standard definition” stream, there’s still tons of data. Before the data goes anywhere (even over USB from the capture device to the PC), it needs to be compressed.

Both the audio and video streams are split into frames. The most straightforward way to reduce the data size is to compress the contents of individual frames. This type of compression is performed on both the audio and video streams, and is sufficient for audio, so we’ll just pay attention to video.

Compressing individual frames in isolation still results in a gigantic video stream, so video needs some additional processing. Video frames are typically very similar to adjacent frames (a video consisting of completely dissimilar frames would be quite disorienting), so compression can take advantage of these similarities and compress frames relative to each other.

For example, if frame Y is a darker version of frame X, it’s much shorter to say “take frame X and make it 10% darker”, instead of storing the entire slightly-darker frame. Similarly, if the video is panning, the encoder could say “take frame Y and move it a bit to the left, then fill in the now-empty right-hand column with this new data” instead of storing the entire slightly-shifted image. The actual encoding is a bit less specific than this, but has the same effect.

The encoder collects several frames into a “Group of Pictures” (or GoP), and then compresses the frames relative to each other, using three types of dependencies:

  • Intra-frame (I-Frame) - a standalone frame that contains an entire image. In MPEG2 it’s referred to as an I-Frame, but across a variety of encodings this type of frame is called a “standalone” frame or “key” frame.
  • Predicted Frame (P-Frame) - a frame that is predicted by the previous I-Frame or P-Frame
  • Bi-directionally Predicted Frame (B-Frame) - a frame that is predicted by two other frames: the last I-Frame or P-Frame before it, and the next I-Frame or P-Frame after it.

MPEG 3

What’s interesting about this ordering is that the quality of (or damage to) I-frames and P-frames will affect other frames that depend on the I-frame or P-frame (e.g. taking a corrupt image and making it 10% darker or shifting it to the left produces another corrupt image). Subsequent frames may then depend on these now-corrupted frames and also become corrupted. A change to a single frame may cause a chain reaction, damaging many frames that follow. B-frames, however, do not affect any other frames; none of the frame types may depend on the contents of B-frames. This will come in handy later.

Because of the new dependencies, these frames can’t be decoded in their current order; some frames depend on the contents of subsequent frames. To address this, each frame is assigned a sequence number indicating the display order, then rearranged so that they can be decoded in the order they are received.

MPEG 4

Each frame is moved to appear after its dependencies, so the decoder has everything it requires to decode a frame as soon as the frame arrives. The display will, of course, have to reorder the decoded frames according to the display order.

The last thing the encoder does is wrap everything in a transport stream. The transport stream can contain multiple “programs” (think of TV channels), and each program contains several streams of data (perhaps a video stream, an English audio stream, Spanish audio stream, closed-caption data stream, and so on). To differentiate all these various streams, each stream is assigned a unique ID. Stream 0 is a list that indicates the stream ID of each program’s “map”. The program map is a list that indicates the stream ID of each component belonging to that program.

MPEG 5

As a receiver may begin receiving the stream at any point in time, the metadata (the data in stream 0, and program map streams) is repeated periodically. This stream is the final data that is stored from the capture device. The next post will detail the problems that arose when using this stream, and the final post will dig into the strategy and code to fix these problems.

OCT
24

Recently I have been capturing and processing MPEG-2 videos, and have run into some video processing trouble. The only software I can find that successfully fixes the problems costs about fifty bucks, and has a really ugly UI. That’s about all it takes for me to want to create my own solution, so it’s time to write a MPEG-2 processing tool!

I haven’t written any MPEG-related tools before this one, so the first step is to understand what’s actually going on. This post - the first in a series of four - focuses on how audio and video data is acquired. Let’s take a look at capturing an analog stream.

Audio and video come from an analog source and arrive at the capture device. The voltage level on the audio connection traces out the audio waveform over time. The MPEG-2 file will store (an approximation of) this waveform, so the capture device simply samples the voltage frequently, recording a value representing the current level each time. Sequential sets of samples are grouped together, and then stored in a frame labeled with the capture time.

MPEG 1

Of course, this single waveform only represents a single channel of audio. If the source provided stereo audio, there would be two signals. 5.1 audio would have 6 signals, and so on.

Next up is video, which also arrives in analog form (as a waveform over time). MPEG-2 stores video as frames, not just a waveform, so the capture device will have do more than just record level samples. The video source emits a start-of-frame marker (shown as a spike below), then adjusts the signal level over time to indicate the color of each pixel in the frame. Whenever the capture device sees this sequence, it accumulates the pixel values over time, and then emits an assembled frame along with the time at which it was captured.

MPEG 2

Audio is one-dimensional and continuous; the signal indicates the speaker’s position at any given time. Hence, the capture device can simply examine the current audio level at any instant, and receive a standalone sample of audio data.

Although the video signal is continuous, it actually represents discrete frames that are traced out over time. The capture device can’t take an instantaneous sample and receive the current frame; it must wait for a start-of-frame trigger, and then accumulate the frame’s contents over time (until the next start-of-frame marker).

In other words, audio sampling can be driven by the capture device, sampling at whatever frequency and interval the capture device pleases. The timing of video capture, however, is dictated entirely by the video source.

There’s a lot more information in the video signal, all of which is completely irrelevant to this discussion, so I’ve left it out. If you care about these extra bits, such as start-of-line markers, odd/even field identifiers for interlaced video, color bursts, and so on, feel free to read some documentation on your favorite analog video signalling standard. Otherwise, just check back later for the next post, which will give a high-level view of the MPEG-2 video compression.