Hardware Video Encoding with VideoToolbox: Zero to HEVC in 1ms

Hardware Video Encoding with VideoToolbox: Zero to HEVC in 1ms

VideoToolbox is Apple's framework for hardware-accelerated video encoding and decoding. On Apple Silicon, the dedicated Media Engine can encode 4K HEVC in under 1 millisecond — making real-time screen streaming not just possible, but trivial from a computational perspective.

This post covers configuring VideoToolbox for the lowest possible latency in a remote desktop streaming context.

Why Hardware Encoding Matters

| Approach | Encode Time (1080p60) | CPU Usage | Power | |----------|----------------------|-----------|-------| | x264 (software) | 15-30ms | 100%+ | High | | VideoToolbox H.264 | ~2ms | <5% | Low | | VideoToolbox HEVC | ~2ms | <5% | Low | | VideoToolbox HEVC (M4) | <1ms | <2% | Minimal |

Astropad Workbench uses VideoToolbox exclusively (VideoToolbox.framework linked in binary). Their internal module liquid_codec::video_encoder::apple wraps the VTCompressionSession API.

The Encoding Pipeline

CVPixelBuffer (from ScreenCaptureKit)
    → VTCompressionSession (hardware encoder)
    → CMSampleBuffer (encoded NAL units)
    → Extract H.264/HEVC bitstream
    → Network transport

Step 1: Create the Compression Session

var session: VTCompressionSession?

let status = VTCompressionSessionCreate(
    allocator: kCFAllocatorDefault,
    width: Int32(width),
    height: Int32(height),
    codecType: kCMVideoCodecType_HEVC,  // or kCMVideoCodecType_H264
    encoderSpecification: nil,  // Use default (hardware) encoder
    imageBufferAttributes: nil,
    compressedDataAllocator: nil,
    outputHandler: { status, flags, sampleBuffer in
        guard status == noErr, let sampleBuffer = sampleBuffer else { return }
        self.handleEncodedFrame(sampleBuffer)
    },
    refcon: nil,
    compressionSessionOut: &session
)

guard status == noErr, let session = session else {
    fatalError("Failed to create compression session: \(status)")
}

Step 2: Configure for Real-Time Streaming

This is where most developers go wrong. The default settings optimize for file recording, not streaming.

// CRITICAL: Enable real-time encoding
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_RealTime, value: kCFBooleanTrue)

// Disable B-frames (they add latency due to reordering)
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_AllowFrameReordering, value: kCFBooleanFalse)

// Keyframe interval — every 2 seconds for recovery
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_MaxKeyFrameIntervalDuration, value: 2.0 as CFNumber)

// Target bitrate — 20 Mbps for Retina quality
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_AverageBitRate, value: 20_000_000 as CFNumber)

// Expected frame rate — helps the rate controller
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_ExpectedFrameRate, value: 60 as CFNumber)

// Profile — Main for broad compatibility
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_ProfileLevel,
                     value: kVTProfileLevel_HEVC_Main_AutoLevel)

// Allow open GOP — slightly better compression
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_AllowOpenGOP, value: kCFBooleanTrue)

// Prepare the session
VTCompressionSessionPrepareToEncodeFrames(session)

Key insight from Lumen (Sunshine fork): The upstream Sunshine had a bug on Apple Silicon where it set max_ref_frames=1 for H.264, forcing every frame to be a keyframe. This tripled bandwidth. The fix: only set max_ref_frames=1 for HEVC, not H.264. Let the encoder use P-frames.

Step 3: Encode Frames

func encode(pixelBuffer: CVPixelBuffer, presentationTime: CMTime) {
    let duration = CMTime(value: 1, timescale: 60) // 60fps

    VTCompressionSessionEncodeFrame(
        session,
        imageBuffer: pixelBuffer,
        presentationTimeStamp: presentationTime,
        duration: duration,
        frameProperties: nil,
        infoFlagsOut: nil,
        outputHandler: nil  // Using session-level handler
    )
}

The CVPixelBuffer from ScreenCaptureKit is IOSurface-backed, so VideoToolbox processes it directly on the GPU — zero CPU copy.

Step 4: Extract the Bitstream

The output handler receives CMSampleBuffer containing encoded data. Extract it:

func handleEncodedFrame(_ sampleBuffer: CMSampleBuffer) {
    // Check if this is a keyframe
    let attachments = CMSampleBufferGetSampleAttachmentsArray(sampleBuffer, createIfNecessary: false)
    let isKeyframe: Bool
    if let dict = (attachments as? [[CFString: Any]])?.first {
        isKeyframe = !(dict[kCMSampleAttachmentKey_NotSync] as? Bool ?? false)
    } else {
        isKeyframe = true
    }

    // For keyframes, extract parameter sets (SPS/PPS for H.264, VPS/SPS/PPS for HEVC)
    if isKeyframe {
        extractParameterSets(from: sampleBuffer)
    }

    // Get the encoded data
    guard let dataBuffer = CMSampleBufferGetDataBuffer(sampleBuffer) else { return }
    var length: Int = 0
    var dataPointer: UnsafeMutablePointer<Int8>?
    CMBlockBufferGetDataPointer(dataBuffer, atOffset: 0, lengthAtOffsetOut: nil,
                                totalLengthOut: &length, dataPointerOut: &dataPointer)

    guard let pointer = dataPointer else { return }
    let data = Data(bytes: pointer, count: length)

    // data now contains AVCC/HVCC formatted NAL units
    // Convert to Annex B format for network transport if needed
    let pts = CMSampleBufferGetPresentationTimeStamp(sampleBuffer)
    sendToNetwork(data: data, isKeyframe: isKeyframe, pts: pts)
}

Step 5: Extract Parameter Sets

The decoder needs SPS/PPS (H.264) or VPS/SPS/PPS (HEVC) to initialize:

func extractParameterSets(from sampleBuffer: CMSampleBuffer) {
    guard let formatDesc = CMSampleBufferGetFormatDescription(sampleBuffer) else { return }

    // For HEVC
    var parameterSetCount: Int = 0
    CMVideoFormatDescriptionGetHEVCParameterSetAtIndex(
        formatDesc, parameterSetIndex: 0, parameterSetPointerOut: nil,
        parameterSetSizeOut: nil, parameterSetCountOut: &parameterSetCount,
        nalUnitHeaderLengthOut: nil
    )

    var parameterSets: [Data] = []
    for i in 0..<parameterSetCount {
        var pointer: UnsafePointer<UInt8>?
        var size: Int = 0
        CMVideoFormatDescriptionGetHEVCParameterSetAtIndex(
            formatDesc, parameterSetIndex: i,
            parameterSetPointerOut: &pointer, parameterSetSizeOut: &size,
            parameterSetCountOut: nil, nalUnitHeaderLengthOut: nil
        )
        if let pointer = pointer {
            parameterSets.append(Data(bytes: pointer, count: size))
        }
    }

    // Send parameter sets to decoder via reliable control channel
    sendParameterSets(parameterSets)
}

HEVC vs H.264: Which to Choose?

| Aspect | H.264 | HEVC (H.265) | |--------|-------|------| | Encode speed (Apple Silicon) | ~2ms | ~2ms | | Compression ratio | Baseline | 30-40% better | | Hardware support | All Macs | Apple Silicon + some Intel | | Decode compatibility | Universal | Apple Silicon preferred | | Latency | Same | Same |

Recommendation: Use HEVC as primary, fall back to H.264 for Intel Macs. Astropad does exactly this — their binary shows h264_support and hevc_support capability flags exchanged during handshake.

Bitrate Control for Streaming

For streaming, you want Constant Bitrate (CBR) or constrained VBR:

// Average bitrate
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_AverageBitRate,
                     value: targetBitrate as CFNumber)

// Data rate limits (for constrained VBR)
// Format: [bytes_per_second, time_window_in_seconds]
let limits: [Int] = [targetBitrate / 8, 1]  // Allow burst within 1-second window
VTSessionSetProperty(session, key: kVTCompressionPropertyKey_DataRateLimits,
                     value: limits as CFArray)

Astropad uses adaptive bitrate via their liquid_rate_control module with a Kalman filter for bandwidth estimation. For a simpler approach, adjust bitrate based on round-trip time:

func adjustBitrate(currentRTT: TimeInterval) {
    let targetBitrate: Int
    switch currentRTT {
    case ..<0.01:  targetBitrate = 30_000_000  // <10ms RTT: 30 Mbps
    case ..<0.03:  targetBitrate = 20_000_000  // <30ms: 20 Mbps
    case ..<0.1:   targetBitrate = 10_000_000  // <100ms: 10 Mbps
    default:       targetBitrate = 5_000_000   // >100ms: 5 Mbps
    }
    VTSessionSetProperty(session, key: kVTCompressionPropertyKey_AverageBitRate,
                         value: targetBitrate as CFNumber)
}

Requesting Keyframes on Demand

When a new client connects or packet loss is detected, force an immediate keyframe:

let frameProperties: [CFString: Any] = [
    kVTEncodeFrameOptionKey_ForceKeyFrame: true
]

VTCompressionSessionEncodeFrame(
    session, imageBuffer: pixelBuffer,
    presentationTimeStamp: pts, duration: duration,
    frameProperties: frameProperties as CFDictionary,
    infoFlagsOut: nil, outputHandler: nil
)

Astropad tracks codec.keyframes.regular and codec.keyframes.recovery separately — recovery keyframes are requested when the decoder detects corruption.

Measuring Encode Latency

let startTime = mach_absolute_time()
VTCompressionSessionEncodeFrame(session, ...)
// In the output handler:
let endTime = mach_absolute_time()
let elapsedNs = machTimeToNanoseconds(endTime - startTime)
print("Encode latency: \(elapsedNs / 1_000_000)ms")

On an M1 Pro, expect ~2ms for 1080p60 HEVC. On M4, under 1ms.


Part 3 of the "Building a Remote Desktop from Scratch" series. Based on reverse engineering analysis of Astropad Workbench 1.1.0.

← All notes