Zero-Copy Video Pipeline on Apple Silicon: IOSurface to Metal to Screen
Zero-Copy Video Pipeline on Apple Silicon: IOSurface to Metal to Screen
The difference between a 50ms and a 5ms remote desktop is not the codec or the network — it is how many times pixel data gets copied between CPU and GPU memory. On Apple Silicon, with unified memory architecture, you can build a pipeline where pixels never leave the GPU from capture to display.
This post documents the zero-copy pipeline that makes sub-16ms remote desktop streaming possible.
The Problem: Memory Copies Kill Latency
A naive pipeline copies pixels at every stage:
Screen → CPU buffer (copy 1)
CPU buffer → Encoder input (copy 2)
Encoder output → CPU buffer (copy 3)
CPU buffer → Network (copy 4)
Network → CPU buffer (copy 5)
CPU buffer → Decoder input (copy 6)
Decoder output → CPU buffer (copy 7)
CPU buffer → GPU texture (copy 8)
GPU texture → Display (copy 9)
Each copy of a 4K BGRA frame (3840x2160x4 = 33MB) takes ~2-5ms on CPU. Nine copies = 18-45ms of pure memcpy overhead.
The Solution: IOSurface + Unified Memory
On Apple Silicon, CPU and GPU share the same physical memory. IOSurface is the kernel primitive that makes GPU memory accessible to multiple frameworks without copying:
Screen → IOSurface (GPU capture, zero-copy)
IOSurface → Metal texture (pointer cast, zero-copy)
Metal texture → Tile diff shader (GPU compute, zero-copy)
IOSurface → VideoToolbox encoder (hardware encoder reads directly, zero-copy)
---network---
NAL units → VideoToolbox decoder (hardware decoder, zero-copy)
Decoder output → CVPixelBuffer/IOSurface (zero-copy)
IOSurface → Metal texture (pointer cast, zero-copy)
Metal texture → Display (GPU render, zero-copy)
Zero CPU pixel copies in the entire pipeline.
Host Side: Capture to Encode
Step 1: ScreenCaptureKit delivers IOSurface-backed buffers
func stream(_ stream: SCStream, didOutputSampleBuffer sampleBuffer: CMSampleBuffer,
of type: SCStreamOutputType) {
guard type == .screen else { return }
guard let pixelBuffer = CMSampleBufferGetImageBuffer(sampleBuffer) else { return }
// This CVPixelBuffer is backed by an IOSurface
// Verify:
let surface = CVPixelBufferGetIOSurface(pixelBuffer)
assert(surface != nil, "ScreenCaptureKit should always provide IOSurface-backed buffers")
// Pass directly to encoder — no copy needed
processFrame(pixelBuffer: pixelBuffer.takeUnretainedValue())
}
Step 2: Create Metal texture from IOSurface (zero-copy)
let device = MTLCreateSystemDefaultDevice()!
func createMetalTexture(from pixelBuffer: CVPixelBuffer) -> MTLTexture? {
let ioSurface = CVPixelBufferGetIOSurface(pixelBuffer)!.takeUnretainedValue()
let descriptor = MTLTextureDescriptor.texture2DDescriptor(
pixelFormat: .bgra8Unorm,
width: CVPixelBufferGetWidth(pixelBuffer),
height: CVPixelBufferGetHeight(pixelBuffer),
mipmapped: false
)
descriptor.usage = [.shaderRead]
descriptor.storageMode = .shared // Critical for unified memory
// This creates a texture VIEW of the IOSurface — no data copy
return device.makeTexture(descriptor: descriptor, iosurface: ioSurface, plane: 0)
}
Step 3: GPU tile diffing on the same texture
func runTileDiff(current: MTLTexture, previous: MTLTexture) -> MTLBuffer {
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(tileDiffPipeline)
encoder.setTexture(previous, index: 0)
encoder.setTexture(current, index: 1)
encoder.setBuffer(changedTilesBuffer, offset: 0, index: 0)
let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
let threadgroupCount = MTLSize(
width: (current.width + 63) / 64, // One threadgroup per tile column
height: (current.height + 63) / 64, // One threadgroup per tile row
depth: 1
)
encoder.dispatchThreadgroups(threadgroupCount, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
return changedTilesBuffer
}
Step 4: Feed same CVPixelBuffer to VideoToolbox
// The SAME pixelBuffer from ScreenCaptureKit goes directly to the encoder
// VideoToolbox reads the IOSurface on the GPU — zero copy
VTCompressionSessionEncodeFrame(
session,
imageBuffer: pixelBuffer,
presentationTimeStamp: pts,
duration: duration,
frameProperties: nil,
infoFlagsOut: nil,
outputHandler: nil
)
Client Side: Decode to Display
Step 5: CVMetalTextureCache for decoder output
var textureCache: CVMetalTextureCache?
// Create once at init
CVMetalTextureCacheCreate(
kCFAllocatorDefault,
nil,
device,
nil,
&textureCache
)
func textureFromDecodedFrame(_ pixelBuffer: CVPixelBuffer) -> MTLTexture? {
let width = CVPixelBufferGetWidth(pixelBuffer)
let height = CVPixelBufferGetHeight(pixelBuffer)
var cvTexture: CVMetalTexture?
let status = CVMetalTextureCacheCreateTextureFromImage(
kCFAllocatorDefault,
textureCache!,
pixelBuffer,
nil,
.bgra8Unorm,
width,
height,
0, // plane index
&cvTexture
)
guard status == kCVReturnSuccess, let cvTexture = cvTexture else { return nil }
return CVMetalTextureGetTexture(cvTexture)
}
Step 6: Render with MTKView
class MetalRenderer: NSObject, MTKViewDelegate {
var latestTexture: MTLTexture?
func draw(in view: MTKView) {
guard let texture = latestTexture,
let drawable = view.currentDrawable,
let commandBuffer = commandQueue.makeCommandBuffer(),
let renderPassDesc = view.currentRenderPassDescriptor else { return }
let encoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDesc)!
encoder.setRenderPipelineState(renderPipeline)
encoder.setFragmentTexture(texture, index: 0)
// Draw fullscreen quad
encoder.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: 4)
encoder.endEncoding()
commandBuffer.present(drawable)
commandBuffer.commit()
}
}
Measuring the Improvement
| Stage | With Copies | Zero-Copy | |-------|------------|-----------| | Capture → GPU | 3-5ms | <0.1ms | | GPU → Encoder | 2-3ms | <0.1ms | | Decoder → GPU | 2-3ms | <0.1ms | | GPU → Display | 1-2ms | <0.1ms | | Total overhead | 8-13ms | <0.5ms |
That 8-13ms savings is the difference between "fast enough" and "feels instant."
What Astropad Does
From our binary analysis, Astropad's pipeline is built on the same principles:
liquid_screencap::io_surface— IOSurface buffer managementliquid_image_processing::wgpu::resources::mapped_buffer— GPU↔CPU buffer mapping (they use wgpu instead of Metal directly)liquid_image_processing::pool— Buffer pool to avoid allocation overhead- The entire
liquid_codec::encoderpipeline operates on IOSurface-backed textures
Their Rust wgpu approach adds a translation layer (wgpu → Metal backend), but the underlying principle is the same: keep data on the GPU, avoid CPU copies.
Key Gotchas
-
storageModemust be.shared— this is the default on Apple Silicon (unified memory) but matters on Intel Macs where GPU/CPU memory is separate -
Don't call
CVPixelBufferLockBaseAddress— this forces a CPU-accessible mapping and may trigger a copy. Only use it if you actually need CPU access (e.g., for software-based tile compression) -
Texture format must match — ScreenCaptureKit outputs BGRA, so use
.bgra8Unormeverywhere. Mismatched formats trigger an implicit conversion copy -
Buffer pooling — create a pool of
MTLTextureobjects and cycle through them, rather than creating/destroying textures per frame. Astropad does this via theirliquid_image_processing::poolmodule -
CVMetalTextureCache flush — call
CVMetalTextureCacheFlush(textureCache, 0)periodically to release old texture references
Part 5 of the "Building a Remote Desktop from Scratch" series. Based on reverse engineering analysis of Astropad Workbench 1.1.0.