Tile-Based Screen Streaming: How Astropad Achieves 16ms Latency
Tile-Based Screen Streaming: How Astropad Achieves 16ms Latency
Most remote desktop apps encode the entire screen as a video stream using H.264 or HEVC. Astropad does something different: their LIQUID protocol divides the screen into a grid of independent tiles and only transmits the ones that changed. This is how they achieve ≤16ms end-to-end latency.
The Core Insight
When you're editing code, drawing in Photoshop, or browsing the web, less than 10% of the screen changes between frames. A traditional video codec doesn't care — it encodes the entire frame, wasting bandwidth and latency on unchanged regions.
LIQUID's tile-based approach:
Frame N: [████████████████████████████] (full screen)
Frame N+1: [ ██ ][ ][ ][ ][ ] (only cursor moved + one widget updated)
↑ ↑
Changed tiles (2 of 60) = 3.3% of screen
Instead of encoding 33MB of 4K pixels, you compress and send 2 tiles (64x64x4 = 16KB each, ~8KB compressed). That's a 2000x reduction in data.
How It Works (From Binary Analysis)
Our reverse engineering of Astropad Workbench revealed the exact architecture:
Dual Codec System
Astropad's binary contains two distinct codec modes:
TileCodecOnly — Raw compressed tiles (for static/drawing content)
H26xCodecOnly — Full-frame H.264/HEVC (for video playback/motion)
The system switches automatically based on how much of the screen is changing.
GPU Tile Diffing
The liquid_image_processing module uses wgpu compute shaders with two strategies:
all_pixels_diffing— Compare every pixel between framescheckboard_diffing— Compare only a checkerboard pattern (50% of pixels, catches virtually all real changes)
The Pipeline
ScreenCaptureKit capture
→ IOSurface (GPU memory)
→ Metal/wgpu compute shader: compare with previous frame
→ Output: changed_tiles[] bitmap
→ For each changed tile:
→ Extract 64x64 pixel region
→ Compress with fast algorithm (LZ4-like)
→ Packetize for network
→ If too many tiles changed (>60%):
→ Switch to H.264/HEVC full-frame encoding instead
Building It: Metal Compute Shader
Here's a Metal compute shader for tile change detection:
#include <metal_stdlib>
using namespace metal;
constant uint TILE_SIZE = 64;
constant float THRESHOLD = 0.01; // Minimum per-pixel difference to count as "changed"
kernel void tileDiff(
texture2d<float, access::read> prevFrame [[texture(0)]],
texture2d<float, access::read> currFrame [[texture(1)]],
device atomic_uint* changedTiles [[buffer(0)]],
uint2 gid [[thread_position_in_grid]],
uint2 gridSize [[threads_per_grid]]
) {
// Which tile does this pixel belong to?
uint tileX = gid.x / TILE_SIZE;
uint tileY = gid.y / TILE_SIZE;
uint tilesPerRow = (gridSize.x + TILE_SIZE - 1) / TILE_SIZE;
uint tileIdx = tileY * tilesPerRow + tileX;
// Bounds check
if (gid.x >= prevFrame.get_width() || gid.y >= prevFrame.get_height()) return;
// Compare pixels
float4 prev = prevFrame.read(gid);
float4 curr = currFrame.read(gid);
float diff = abs(prev.r - curr.r) + abs(prev.g - curr.g) + abs(prev.b - curr.b);
// If any pixel in the tile changed, mark the tile
if (diff > THRESHOLD) {
atomic_fetch_or_explicit(&changedTiles[tileIdx], 1, memory_order_relaxed);
}
}
Checkerboard Optimization
Only check every other pixel in a checkerboard pattern — saves 50% GPU work:
kernel void tileDiffCheckerboard(
texture2d<float, access::read> prevFrame [[texture(0)]],
texture2d<float, access::read> currFrame [[texture(1)]],
device atomic_uint* changedTiles [[buffer(0)]],
uint2 gid [[thread_position_in_grid]]
) {
// Skip alternate pixels in checkerboard pattern
if ((gid.x + gid.y) % 2 != 0) return;
// ... same comparison logic as above
}
Swift Orchestration
class TileEncoder {
let tileSize: Int = 64
var previousTexture: MTLTexture?
let changedTilesBuffer: MTLBuffer
let tileDiffPipeline: MTLComputePipelineState
func processFrame(_ currentTexture: MTLTexture) -> [ChangedTile] {
guard let previous = previousTexture else {
previousTexture = currentTexture
return allTiles() // First frame: send everything
}
// Reset changed tiles buffer
memset(changedTilesBuffer.contents(), 0, changedTilesBuffer.length)
// Run GPU diff
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(tileDiffPipeline)
encoder.setTexture(previous, index: 0)
encoder.setTexture(currentTexture, index: 1)
encoder.setBuffer(changedTilesBuffer, offset: 0, index: 0)
let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
let gridSize = MTLSize(width: currentTexture.width, height: currentTexture.height, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
// Read results
let tileCount = tilesPerRow * tilesPerColumn
let tileData = changedTilesBuffer.contents().bindMemory(to: UInt32.self, capacity: tileCount)
var changedTiles: [ChangedTile] = []
for i in 0..<tileCount {
if tileData[i] != 0 {
changedTiles.append(extractTile(from: currentTexture, tileIndex: i))
}
}
previousTexture = currentTexture
return changedTiles
}
}
Adaptive Codec Switching
The magic is in knowing when to switch between tiles and H.264/HEVC:
class AdaptiveCodecController {
var consecutiveHighMotionFrames = 0
var consecutiveIdleFrames = 0
enum CodecMode { case tiles, h26x, idle }
func selectMode(changedRatio: Float) -> CodecMode {
if changedRatio > 0.6 {
consecutiveHighMotionFrames += 1
consecutiveIdleFrames = 0
if consecutiveHighMotionFrames >= 3 {
return .h26x // Video playback, scrolling — use full-frame codec
}
} else if changedRatio < 0.01 {
consecutiveIdleFrames += 1
consecutiveHighMotionFrames = 0
if consecutiveIdleFrames >= 5 {
return .idle // Nothing changed — don't send anything
}
} else {
consecutiveHighMotionFrames = 0
consecutiveIdleFrames = 0
}
return .tiles // Default: send only changed tiles
}
}
Tile Compression
For the tile codec path, each 64x64 BGRA tile is 16KB uncompressed. Options:
| Algorithm | Compress Speed | Decompress Speed | Ratio | Best For | |-----------|---------------|-----------------|-------|----------| | Raw (no compression) | ∞ | ∞ | 1:1 | Gigabit LAN | | LZ4 | 780 MB/s | 4970 MB/s | 2-3x | LAN streaming | | zstd (level 1) | 510 MB/s | 1380 MB/s | 3-4x | WiFi streaming | | zstd (level 3) | 200 MB/s | 1380 MB/s | 4-5x | WAN streaming |
LZ4 is the clear winner for LAN — decompression is literally faster than memcpy, and the 2-3x compression halves bandwidth usage.
Wire Format
Each tile packet:
[4B] frame_id
[1B] codec_mode (0=tile, 1=h26x)
[2B] tile_x, tile_y (grid coordinates)
[2B] compressed_size
[NB] compressed_tile_data (LZ4)
The client maintains a tile grid and updates individual tiles as they arrive. Even if some tiles are lost (UDP/QUIC datagrams), the rest of the screen remains correct — only the lost tiles show stale content until the next update.
Latency Breakdown
| Stage | Full-Frame H.264 | Tile-Based | |-------|-----------------|------------| | Capture | 1ms | 1ms | | Diff/Encode | 2ms (full encode) | 0.5ms (GPU diff + LZ4) | | Network (LAN) | 1ms (large frame) | 0.2ms (small tiles) | | Decode/Render | 2ms (full decode) | 0.3ms (tile blit) | | Total | ~6ms | ~2ms |
The tile approach wins because less data means less encoding time, less network time, and less decoding time. The GPU diff adds negligible overhead.
Part 6 of the "Building a Remote Desktop from Scratch" series. Based on reverse engineering analysis of Astropad Workbench 1.1.0.