Tile-Based Screen Streaming: How Astropad Achieves 16ms Latency

Tile-Based Screen Streaming: How Astropad Achieves 16ms Latency

Most remote desktop apps encode the entire screen as a video stream using H.264 or HEVC. Astropad does something different: their LIQUID protocol divides the screen into a grid of independent tiles and only transmits the ones that changed. This is how they achieve ≤16ms end-to-end latency.

The Core Insight

When you're editing code, drawing in Photoshop, or browsing the web, less than 10% of the screen changes between frames. A traditional video codec doesn't care — it encodes the entire frame, wasting bandwidth and latency on unchanged regions.

LIQUID's tile-based approach:

Frame N:     [████████████████████████████]  (full screen)
Frame N+1:   [  ██  ][    ][    ][    ][  ]  (only cursor moved + one widget updated)
                ↑                        ↑
            Changed tiles (2 of 60) = 3.3% of screen

Instead of encoding 33MB of 4K pixels, you compress and send 2 tiles (64x64x4 = 16KB each, ~8KB compressed). That's a 2000x reduction in data.

How It Works (From Binary Analysis)

Our reverse engineering of Astropad Workbench revealed the exact architecture:

Dual Codec System

Astropad's binary contains two distinct codec modes:

TileCodecOnly   — Raw compressed tiles (for static/drawing content)
H26xCodecOnly   — Full-frame H.264/HEVC (for video playback/motion)

The system switches automatically based on how much of the screen is changing.

GPU Tile Diffing

The liquid_image_processing module uses wgpu compute shaders with two strategies:

  • all_pixels_diffing — Compare every pixel between frames
  • checkboard_diffing — Compare only a checkerboard pattern (50% of pixels, catches virtually all real changes)

The Pipeline

ScreenCaptureKit capture
    → IOSurface (GPU memory)
    → Metal/wgpu compute shader: compare with previous frame
    → Output: changed_tiles[] bitmap
    → For each changed tile:
        → Extract 64x64 pixel region
        → Compress with fast algorithm (LZ4-like)
        → Packetize for network
    → If too many tiles changed (>60%):
        → Switch to H.264/HEVC full-frame encoding instead

Building It: Metal Compute Shader

Here's a Metal compute shader for tile change detection:

#include <metal_stdlib>
using namespace metal;

constant uint TILE_SIZE = 64;
constant float THRESHOLD = 0.01;  // Minimum per-pixel difference to count as "changed"

kernel void tileDiff(
    texture2d<float, access::read> prevFrame [[texture(0)]],
    texture2d<float, access::read> currFrame [[texture(1)]],
    device atomic_uint* changedTiles [[buffer(0)]],
    uint2 gid [[thread_position_in_grid]],
    uint2 gridSize [[threads_per_grid]]
) {
    // Which tile does this pixel belong to?
    uint tileX = gid.x / TILE_SIZE;
    uint tileY = gid.y / TILE_SIZE;
    uint tilesPerRow = (gridSize.x + TILE_SIZE - 1) / TILE_SIZE;
    uint tileIdx = tileY * tilesPerRow + tileX;
    
    // Bounds check
    if (gid.x >= prevFrame.get_width() || gid.y >= prevFrame.get_height()) return;
    
    // Compare pixels
    float4 prev = prevFrame.read(gid);
    float4 curr = currFrame.read(gid);
    float diff = abs(prev.r - curr.r) + abs(prev.g - curr.g) + abs(prev.b - curr.b);
    
    // If any pixel in the tile changed, mark the tile
    if (diff > THRESHOLD) {
        atomic_fetch_or_explicit(&changedTiles[tileIdx], 1, memory_order_relaxed);
    }
}

Checkerboard Optimization

Only check every other pixel in a checkerboard pattern — saves 50% GPU work:

kernel void tileDiffCheckerboard(
    texture2d<float, access::read> prevFrame [[texture(0)]],
    texture2d<float, access::read> currFrame [[texture(1)]],
    device atomic_uint* changedTiles [[buffer(0)]],
    uint2 gid [[thread_position_in_grid]]
) {
    // Skip alternate pixels in checkerboard pattern
    if ((gid.x + gid.y) % 2 != 0) return;
    
    // ... same comparison logic as above
}

Swift Orchestration

class TileEncoder {
    let tileSize: Int = 64
    var previousTexture: MTLTexture?
    let changedTilesBuffer: MTLBuffer
    let tileDiffPipeline: MTLComputePipelineState
    
    func processFrame(_ currentTexture: MTLTexture) -> [ChangedTile] {
        guard let previous = previousTexture else {
            previousTexture = currentTexture
            return allTiles()  // First frame: send everything
        }
        
        // Reset changed tiles buffer
        memset(changedTilesBuffer.contents(), 0, changedTilesBuffer.length)
        
        // Run GPU diff
        let commandBuffer = commandQueue.makeCommandBuffer()!
        let encoder = commandBuffer.makeComputeCommandEncoder()!
        encoder.setComputePipelineState(tileDiffPipeline)
        encoder.setTexture(previous, index: 0)
        encoder.setTexture(currentTexture, index: 1)
        encoder.setBuffer(changedTilesBuffer, offset: 0, index: 0)
        
        let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
        let gridSize = MTLSize(width: currentTexture.width, height: currentTexture.height, depth: 1)
        encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
        encoder.endEncoding()
        commandBuffer.commit()
        commandBuffer.waitUntilCompleted()
        
        // Read results
        let tileCount = tilesPerRow * tilesPerColumn
        let tileData = changedTilesBuffer.contents().bindMemory(to: UInt32.self, capacity: tileCount)
        
        var changedTiles: [ChangedTile] = []
        for i in 0..<tileCount {
            if tileData[i] != 0 {
                changedTiles.append(extractTile(from: currentTexture, tileIndex: i))
            }
        }
        
        previousTexture = currentTexture
        return changedTiles
    }
}

Adaptive Codec Switching

The magic is in knowing when to switch between tiles and H.264/HEVC:

class AdaptiveCodecController {
    var consecutiveHighMotionFrames = 0
    var consecutiveIdleFrames = 0
    
    enum CodecMode { case tiles, h26x, idle }
    
    func selectMode(changedRatio: Float) -> CodecMode {
        if changedRatio > 0.6 {
            consecutiveHighMotionFrames += 1
            consecutiveIdleFrames = 0
            if consecutiveHighMotionFrames >= 3 {
                return .h26x  // Video playback, scrolling — use full-frame codec
            }
        } else if changedRatio < 0.01 {
            consecutiveIdleFrames += 1
            consecutiveHighMotionFrames = 0
            if consecutiveIdleFrames >= 5 {
                return .idle  // Nothing changed — don't send anything
            }
        } else {
            consecutiveHighMotionFrames = 0
            consecutiveIdleFrames = 0
        }
        return .tiles  // Default: send only changed tiles
    }
}

Tile Compression

For the tile codec path, each 64x64 BGRA tile is 16KB uncompressed. Options:

| Algorithm | Compress Speed | Decompress Speed | Ratio | Best For | |-----------|---------------|-----------------|-------|----------| | Raw (no compression) | ∞ | ∞ | 1:1 | Gigabit LAN | | LZ4 | 780 MB/s | 4970 MB/s | 2-3x | LAN streaming | | zstd (level 1) | 510 MB/s | 1380 MB/s | 3-4x | WiFi streaming | | zstd (level 3) | 200 MB/s | 1380 MB/s | 4-5x | WAN streaming |

LZ4 is the clear winner for LAN — decompression is literally faster than memcpy, and the 2-3x compression halves bandwidth usage.

Wire Format

Each tile packet:

[4B] frame_id
[1B] codec_mode (0=tile, 1=h26x)
[2B] tile_x, tile_y (grid coordinates)
[2B] compressed_size
[NB] compressed_tile_data (LZ4)

The client maintains a tile grid and updates individual tiles as they arrive. Even if some tiles are lost (UDP/QUIC datagrams), the rest of the screen remains correct — only the lost tiles show stale content until the next update.

Latency Breakdown

| Stage | Full-Frame H.264 | Tile-Based | |-------|-----------------|------------| | Capture | 1ms | 1ms | | Diff/Encode | 2ms (full encode) | 0.5ms (GPU diff + LZ4) | | Network (LAN) | 1ms (large frame) | 0.2ms (small tiles) | | Decode/Render | 2ms (full decode) | 0.3ms (tile blit) | | Total | ~6ms | ~2ms |

The tile approach wins because less data means less encoding time, less network time, and less decoding time. The GPU diff adds negligible overhead.


Part 6 of the "Building a Remote Desktop from Scratch" series. Based on reverse engineering analysis of Astropad Workbench 1.1.0.

← All notes