Zero-Copy Video Pipeline on Apple Silicon: IOSurface to Metal to Screen

Zero-Copy Video Pipeline on Apple Silicon: IOSurface to Metal to Screen

The difference between a 50ms and a 5ms remote desktop is not the codec or the network — it is how many times pixel data gets copied between CPU and GPU memory. On Apple Silicon, with unified memory architecture, you can build a pipeline where pixels never leave the GPU from capture to display.

This post documents the zero-copy pipeline that makes sub-16ms remote desktop streaming possible.

The Problem: Memory Copies Kill Latency

A naive pipeline copies pixels at every stage:

Screen → CPU buffer (copy 1)
CPU buffer → Encoder input (copy 2)
Encoder output → CPU buffer (copy 3)
CPU buffer → Network (copy 4)
Network → CPU buffer (copy 5)
CPU buffer → Decoder input (copy 6)
Decoder output → CPU buffer (copy 7)
CPU buffer → GPU texture (copy 8)
GPU texture → Display (copy 9)

Each copy of a 4K BGRA frame (3840x2160x4 = 33MB) takes ~2-5ms on CPU. Nine copies = 18-45ms of pure memcpy overhead.

The Solution: IOSurface + Unified Memory

On Apple Silicon, CPU and GPU share the same physical memory. IOSurface is the kernel primitive that makes GPU memory accessible to multiple frameworks without copying:

Screen → IOSurface (GPU capture, zero-copy)
IOSurface → Metal texture (pointer cast, zero-copy)
Metal texture → Tile diff shader (GPU compute, zero-copy)
IOSurface → VideoToolbox encoder (hardware encoder reads directly, zero-copy)
---network---
NAL units → VideoToolbox decoder (hardware decoder, zero-copy)
Decoder output → CVPixelBuffer/IOSurface (zero-copy)
IOSurface → Metal texture (pointer cast, zero-copy)
Metal texture → Display (GPU render, zero-copy)

Zero CPU pixel copies in the entire pipeline.

Host Side: Capture to Encode

Step 1: ScreenCaptureKit delivers IOSurface-backed buffers

func stream(_ stream: SCStream, didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
            of type: SCStreamOutputType) {
    guard type == .screen else { return }
    guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
    
    // This CVPixelBuffer is backed by an IOSurface
    // Verify:
    let surface = CVPixelBufferGetIOSurface(pixelBuffer)
    assert(surface != nil, "ScreenCaptureKit should always provide IOSurface-backed buffers")
    
    // Pass directly to encoder — no copy needed
    processFrame(pixelBuffer: pixelBuffer.takeUnretainedValue())
}

Step 2: Create Metal texture from IOSurface (zero-copy)

let device = MTLCreateSystemDefaultDevice()!

func createMetalTexture(from pixelBuffer: CVPixelBuffer) -> MTLTexture? {
    let ioSurface = CVPixelBufferGetIOSurface(pixelBuffer)!.takeUnretainedValue()
    
    let descriptor = MTLTextureDescriptor.texture2DDescriptor(
        pixelFormat: .bgra8Unorm,
        width: CVPixelBufferGetWidth(pixelBuffer),
        height: CVPixelBufferGetHeight(pixelBuffer),
        mipmapped: false
    )
    descriptor.usage = [.shaderRead]
    descriptor.storageMode = .shared  // Critical for unified memory
    
    // This creates a texture VIEW of the IOSurface — no data copy
    return device.makeTexture(descriptor: descriptor, iosurface: ioSurface, plane: 0)
}

Step 3: GPU tile diffing on the same texture

func runTileDiff(current: MTLTexture, previous: MTLTexture) -> MTLBuffer {
    let commandBuffer = commandQueue.makeCommandBuffer()!
    let encoder = commandBuffer.makeComputeCommandEncoder()!
    
    encoder.setComputePipelineState(tileDiffPipeline)
    encoder.setTexture(previous, index: 0)
    encoder.setTexture(current, index: 1)
    encoder.setBuffer(changedTilesBuffer, offset: 0, index: 0)
    
    let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
    let threadgroupCount = MTLSize(
        width: (current.width + 63) / 64,   // One threadgroup per tile column
        height: (current.height + 63) / 64,  // One threadgroup per tile row
        depth: 1
    )
    encoder.dispatchThreadgroups(threadgroupCount, threadsPerThreadgroup: threadgroupSize)
    encoder.endEncoding()
    
    commandBuffer.commit()
    commandBuffer.waitUntilCompleted()
    
    return changedTilesBuffer
}

Step 4: Feed same CVPixelBuffer to VideoToolbox

// The SAME pixelBuffer from ScreenCaptureKit goes directly to the encoder
// VideoToolbox reads the IOSurface on the GPU — zero copy
VTCompressionSessionEncodeFrame(
    session,
    imageBuffer: pixelBuffer,
    presentationTimeStamp: pts,
    duration: duration,
    frameProperties: nil,
    infoFlagsOut: nil,
    outputHandler: nil
)

Client Side: Decode to Display

Step 5: CVMetalTextureCache for decoder output

var textureCache: CVMetalTextureCache?

// Create once at init
CVMetalTextureCacheCreate(
    kCFAllocatorDefault,
    nil,
    device,
    nil,
    &textureCache
)

func textureFromDecodedFrame(_ pixelBuffer: CVPixelBuffer) -> MTLTexture? {
    let width = CVPixelBufferGetWidth(pixelBuffer)
    let height = CVPixelBufferGetHeight(pixelBuffer)
    
    var cvTexture: CVMetalTexture?
    let status = CVMetalTextureCacheCreateTextureFromImage(
        kCFAllocatorDefault,
        textureCache!,
        pixelBuffer,
        nil,
        .bgra8Unorm,
        width,
        height,
        0,  // plane index
        &cvTexture
    )
    
    guard status == kCVReturnSuccess, let cvTexture = cvTexture else { return nil }
    return CVMetalTextureGetTexture(cvTexture)
}

Step 6: Render with MTKView

class MetalRenderer: NSObject, MTKViewDelegate {
    var latestTexture: MTLTexture?
    
    func draw(in view: MTKView) {
        guard let texture = latestTexture,
              let drawable = view.currentDrawable,
              let commandBuffer = commandQueue.makeCommandBuffer(),
              let renderPassDesc = view.currentRenderPassDescriptor else { return }
        
        let encoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDesc)!
        encoder.setRenderPipelineState(renderPipeline)
        encoder.setFragmentTexture(texture, index: 0)
        
        // Draw fullscreen quad
        encoder.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: 4)
        encoder.endEncoding()
        
        commandBuffer.present(drawable)
        commandBuffer.commit()
    }
}

Measuring the Improvement

| Stage | With Copies | Zero-Copy | |-------|------------|-----------| | Capture → GPU | 3-5ms | <0.1ms | | GPU → Encoder | 2-3ms | <0.1ms | | Decoder → GPU | 2-3ms | <0.1ms | | GPU → Display | 1-2ms | <0.1ms | | Total overhead | 8-13ms | <0.5ms |

That 8-13ms savings is the difference between "fast enough" and "feels instant."

What Astropad Does

From our binary analysis, Astropad's pipeline is built on the same principles:

  • liquid_screencap::io_surface — IOSurface buffer management
  • liquid_image_processing::wgpu::resources::mapped_buffer — GPU↔CPU buffer mapping (they use wgpu instead of Metal directly)
  • liquid_image_processing::pool — Buffer pool to avoid allocation overhead
  • The entire liquid_codec::encoder pipeline operates on IOSurface-backed textures

Their Rust wgpu approach adds a translation layer (wgpu → Metal backend), but the underlying principle is the same: keep data on the GPU, avoid CPU copies.

Key Gotchas

  1. storageMode must be .shared — this is the default on Apple Silicon (unified memory) but matters on Intel Macs where GPU/CPU memory is separate

  2. Don't call CVPixelBufferLockBaseAddress — this forces a CPU-accessible mapping and may trigger a copy. Only use it if you actually need CPU access (e.g., for software-based tile compression)

  3. Texture format must match — ScreenCaptureKit outputs BGRA, so use .bgra8Unorm everywhere. Mismatched formats trigger an implicit conversion copy

  4. Buffer pooling — create a pool of MTLTexture objects and cycle through them, rather than creating/destroying textures per frame. Astropad does this via their liquid_image_processing::pool module

  5. CVMetalTextureCache flush — call CVMetalTextureCacheFlush(textureCache, 0) periodically to release old texture references


Part 5 of the "Building a Remote Desktop from Scratch" series. Based on reverse engineering analysis of Astropad Workbench 1.1.0.

← All notes