Blog

How we rebuilt our Electron recording engine in Swift

Arthur Guiot

Our desktop app captures meetings without a bot and streams them to the cloud. For months, the recording engine was the hardest part of the product to make reliable. We'd fix one class of edge case, ship it, and a new one would surface the next week. Different root causes, same pattern.

The engine ran in the render process of our Electron app. We tried the obvious fixes: tighter lifecycle management, moving work off the main thread, isolating it from React's render cycle. Each change helped at the margin, but none addressed the real issue. A render process is the wrong place to do realtime audio and video capture. A capture engine can't tolerate GC pauses, throttling, or any of the other things a browser runtime does to stay responsive.

So we went native: ScreenCaptureKit on macOS, libobs on Windows, and a shared Swift layer tying it together.

Atomic: our Combine-to-Jotai bridge

Bridging a native runtime to React usually means writing native addon bindings by hand. You serialize every value that crosses the boundary, route events through stringly-typed names, and update three files whenever you add a property: the Swift class, the C++ binding, and the TypeScript wrapper. It works, but it's out of sync the moment anyone forgets a step.

What if every @Published property in Swift automatically became a Jotai atom in React? Fully reactive, type-safe, no glue code. That's what our internal tool Atomic does.

@NodeExport
public final class AudioPlayer {
    @Published public var isPlaying: Bool = false
    @Published public var volume: Float = 1.0

    public func play() { isPlaying = true }
    public func pause() { isPlaying = false }
}

#AtomicExport(AudioPlayer.self)
const player = new AudioPlayer();
const volumeAtom = atomWithNativeState<number>(player.volume);

store.set(volumeAtom, 0.5);     // Flows into Swift.
player.play();                  // Updates flow back into React.

From React's perspective, these atoms are indistinguishable from any other Jotai atom. The fact that the data lives in a Swift runtime on a different thread is invisible.

The @NodeExport macro generates the entire bridge at compile time. Types map automatically (Intnumber, String?string | null). Value changes in Swift schedule callbacks on Node's event loop. Every new property we add on the Swift side is instantly available in React. And because Atomic is built on Swift, not Apple frameworks (we use OpenCombine on Windows), the same bridge runs on both platforms.

Two capture engines, one interface

On macOS, ScreenCaptureKit gives us hardware-accelerated capture and the native content picker. On Windows, we use libobs through a Swift wrapper we call OBSKit. The two engines have fundamentally different architectures.

On macOS, we receive raw sample buffers from three independent sources and assemble the file ourselves. On Windows, capture, mixing, encoding, and muxing run as a single graph. We configure it and a file monitor streams newly written bytes to our upload session.

Windows capture has its own challenges. We use Windows Graphics Capture (WGC) as the primary method, and if it doesn't deliver frames in time, we fall back to BitBlt. We also detect all-black frames (common with some emulated windows or games) and switch methods mid-recording.

When clocks disagree

This is where the macOS engine earns its complexity. Three capture sources, three hardware clocks, three different ideas of what time it is.

Both audio sources get timestamped and converted to a global frame index. The mixer drains both queues in lockstep, only producing output when both have enough data. If one source stalls (muted mic, frozen virtual device), the mixer detects it after 500ms and switches to single-source mode until it resumes.

Then there's a subtler problem: audio drivers that lie about their sample rate. Some virtual drivers report 48kHz but deliver buffers at 44.1kHz. Over a 30-minute meeting, this drift becomes audible. Our fix is confidence-based correction: we measure actual buffer cadence, and if it consistently disagrees with the reported format across three consecutive buffers, we reinterpret the stream at the correct rate with a crossfade to avoid clicks.

On Windows, most of this complexity is abstracted away by the capture engine's internal mixer. The tradeoff is control: on macOS we detect and fix edge cases like lying drivers ourselves; on Windows we trade that granularity for simplicity.

Recordings that survive crashes

A regular MP4 writes its metadata at the end of the file. Crash before that, and the recording is gone. On both platforms, we use fragmented MP4 instead.

Each segment is self-contained. A crash at minute 30 loses at most the last second. Segments go to both local storage and the cloud simultaneously. If the network drops, segments persist locally and the upload resumes automatically when connectivity returns.

Desktop recording used to be one of our most common sources of support tickets. Now it's a boring part of the app that just works. The entire rewrite shipped in two months, and Atomic is why: once the bridge existed, adding a feature meant writing Swift and watching the UI update in real time.

Not everything belongs in a render process. Sometimes you need to go native.

If taking on problems like this sound interesting to you, consider joining us.

Blog

How we rebuilt our Electron recording engine in Swift

Arthur Guiot

Our desktop app captures meetings without a bot and streams them to the cloud. For months, the recording engine was the hardest part of the product to make reliable. We'd fix one class of edge case, ship it, and a new one would surface the next week. Different root causes, same pattern.

The engine ran in the render process of our Electron app. We tried the obvious fixes: tighter lifecycle management, moving work off the main thread, isolating it from React's render cycle. Each change helped at the margin, but none addressed the real issue. A render process is the wrong place to do realtime audio and video capture. A capture engine can't tolerate GC pauses, throttling, or any of the other things a browser runtime does to stay responsive.

So we went native: ScreenCaptureKit on macOS, libobs on Windows, and a shared Swift layer tying it together.

Atomic: our Combine-to-Jotai bridge

Bridging a native runtime to React usually means writing native addon bindings by hand. You serialize every value that crosses the boundary, route events through stringly-typed names, and update three files whenever you add a property: the Swift class, the C++ binding, and the TypeScript wrapper. It works, but it's out of sync the moment anyone forgets a step.

What if every @Published property in Swift automatically became a Jotai atom in React? Fully reactive, type-safe, no glue code. That's what our internal tool Atomic does.

@NodeExport
public final class AudioPlayer {
    @Published public var isPlaying: Bool = false
    @Published public var volume: Float = 1.0

    public func play() { isPlaying = true }
    public func pause() { isPlaying = false }
}

#AtomicExport(AudioPlayer.self)
const player = new AudioPlayer();
const volumeAtom = atomWithNativeState<number>(player.volume);

store.set(volumeAtom, 0.5);     // Flows into Swift.
player.play();                  // Updates flow back into React.

From React's perspective, these atoms are indistinguishable from any other Jotai atom. The fact that the data lives in a Swift runtime on a different thread is invisible.

The @NodeExport macro generates the entire bridge at compile time. Types map automatically (Intnumber, String?string | null). Value changes in Swift schedule callbacks on Node's event loop. Every new property we add on the Swift side is instantly available in React. And because Atomic is built on Swift, not Apple frameworks (we use OpenCombine on Windows), the same bridge runs on both platforms.

Two capture engines, one interface

On macOS, ScreenCaptureKit gives us hardware-accelerated capture and the native content picker. On Windows, we use libobs through a Swift wrapper we call OBSKit. The two engines have fundamentally different architectures.

On macOS, we receive raw sample buffers from three independent sources and assemble the file ourselves. On Windows, capture, mixing, encoding, and muxing run as a single graph. We configure it and a file monitor streams newly written bytes to our upload session.

Windows capture has its own challenges. We use Windows Graphics Capture (WGC) as the primary method, and if it doesn't deliver frames in time, we fall back to BitBlt. We also detect all-black frames (common with some emulated windows or games) and switch methods mid-recording.

When clocks disagree

This is where the macOS engine earns its complexity. Three capture sources, three hardware clocks, three different ideas of what time it is.

Both audio sources get timestamped and converted to a global frame index. The mixer drains both queues in lockstep, only producing output when both have enough data. If one source stalls (muted mic, frozen virtual device), the mixer detects it after 500ms and switches to single-source mode until it resumes.

Then there's a subtler problem: audio drivers that lie about their sample rate. Some virtual drivers report 48kHz but deliver buffers at 44.1kHz. Over a 30-minute meeting, this drift becomes audible. Our fix is confidence-based correction: we measure actual buffer cadence, and if it consistently disagrees with the reported format across three consecutive buffers, we reinterpret the stream at the correct rate with a crossfade to avoid clicks.

On Windows, most of this complexity is abstracted away by the capture engine's internal mixer. The tradeoff is control: on macOS we detect and fix edge cases like lying drivers ourselves; on Windows we trade that granularity for simplicity.

Recordings that survive crashes

A regular MP4 writes its metadata at the end of the file. Crash before that, and the recording is gone. On both platforms, we use fragmented MP4 instead.

Each segment is self-contained. A crash at minute 30 loses at most the last second. Segments go to both local storage and the cloud simultaneously. If the network drops, segments persist locally and the upload resumes automatically when connectivity returns.

Desktop recording used to be one of our most common sources of support tickets. Now it's a boring part of the app that just works. The entire rewrite shipped in two months, and Atomic is why: once the bridge existed, adding a feature meant writing Swift and watching the UI update in real time.

Not everything belongs in a render process. Sometimes you need to go native.

If taking on problems like this sound interesting to you, consider joining us.

Blog

How we rebuilt our Electron recording engine in Swift

Arthur Guiot

Our desktop app captures meetings without a bot and streams them to the cloud. For months, the recording engine was the hardest part of the product to make reliable. We'd fix one class of edge case, ship it, and a new one would surface the next week. Different root causes, same pattern.

The engine ran in the render process of our Electron app. We tried the obvious fixes: tighter lifecycle management, moving work off the main thread, isolating it from React's render cycle. Each change helped at the margin, but none addressed the real issue. A render process is the wrong place to do realtime audio and video capture. A capture engine can't tolerate GC pauses, throttling, or any of the other things a browser runtime does to stay responsive.

So we went native: ScreenCaptureKit on macOS, libobs on Windows, and a shared Swift layer tying it together.

Atomic: our Combine-to-Jotai bridge

Bridging a native runtime to React usually means writing native addon bindings by hand. You serialize every value that crosses the boundary, route events through stringly-typed names, and update three files whenever you add a property: the Swift class, the C++ binding, and the TypeScript wrapper. It works, but it's out of sync the moment anyone forgets a step.

What if every @Published property in Swift automatically became a Jotai atom in React? Fully reactive, type-safe, no glue code. That's what our internal tool Atomic does.

@NodeExport
public final class AudioPlayer {
    @Published public var isPlaying: Bool = false
    @Published public var volume: Float = 1.0

    public func play() { isPlaying = true }
    public func pause() { isPlaying = false }
}

#AtomicExport(AudioPlayer.self)
const player = new AudioPlayer();
const volumeAtom = atomWithNativeState<number>(player.volume);

store.set(volumeAtom, 0.5);     // Flows into Swift.
player.play();                  // Updates flow back into React.

From React's perspective, these atoms are indistinguishable from any other Jotai atom. The fact that the data lives in a Swift runtime on a different thread is invisible.

The @NodeExport macro generates the entire bridge at compile time. Types map automatically (Intnumber, String?string | null). Value changes in Swift schedule callbacks on Node's event loop. Every new property we add on the Swift side is instantly available in React. And because Atomic is built on Swift, not Apple frameworks (we use OpenCombine on Windows), the same bridge runs on both platforms.

Two capture engines, one interface

On macOS, ScreenCaptureKit gives us hardware-accelerated capture and the native content picker. On Windows, we use libobs through a Swift wrapper we call OBSKit. The two engines have fundamentally different architectures.

On macOS, we receive raw sample buffers from three independent sources and assemble the file ourselves. On Windows, capture, mixing, encoding, and muxing run as a single graph. We configure it and a file monitor streams newly written bytes to our upload session.

Windows capture has its own challenges. We use Windows Graphics Capture (WGC) as the primary method, and if it doesn't deliver frames in time, we fall back to BitBlt. We also detect all-black frames (common with some emulated windows or games) and switch methods mid-recording.

When clocks disagree

This is where the macOS engine earns its complexity. Three capture sources, three hardware clocks, three different ideas of what time it is.

Both audio sources get timestamped and converted to a global frame index. The mixer drains both queues in lockstep, only producing output when both have enough data. If one source stalls (muted mic, frozen virtual device), the mixer detects it after 500ms and switches to single-source mode until it resumes.

Then there's a subtler problem: audio drivers that lie about their sample rate. Some virtual drivers report 48kHz but deliver buffers at 44.1kHz. Over a 30-minute meeting, this drift becomes audible. Our fix is confidence-based correction: we measure actual buffer cadence, and if it consistently disagrees with the reported format across three consecutive buffers, we reinterpret the stream at the correct rate with a crossfade to avoid clicks.

On Windows, most of this complexity is abstracted away by the capture engine's internal mixer. The tradeoff is control: on macOS we detect and fix edge cases like lying drivers ourselves; on Windows we trade that granularity for simplicity.

Recordings that survive crashes

A regular MP4 writes its metadata at the end of the file. Crash before that, and the recording is gone. On both platforms, we use fragmented MP4 instead.

Each segment is self-contained. A crash at minute 30 loses at most the last second. Segments go to both local storage and the cloud simultaneously. If the network drops, segments persist locally and the upload resumes automatically when connectivity returns.

Desktop recording used to be one of our most common sources of support tickets. Now it's a boring part of the app that just works. The entire rewrite shipped in two months, and Atomic is why: once the bridge existed, adding a feature meant writing Swift and watching the UI update in real time.

Not everything belongs in a render process. Sometimes you need to go native.

If taking on problems like this sound interesting to you, consider joining us.

Try it free.
Subscribe if you love it.