GPU-Powered Video Editor for Storybeat

When I joined Storybeat, we were a tiny team of two developers. The app existed as a basic MVP built on top of FFmpeg, but slow. Rendering a 15-second video could take over a minute. Users would tap “Export” and wait, staring at a progress bar, hoping their phone wouldn’t fall asleep.

We knew this wasn’t sustainable. As we planned new features—transitions, filters, stickers, text overlays—the FFmpeg approach would only get slower. We needed a different foundation.

I started looking into OpenGL as an alternative. The idea was straightforward: if video frames already live on the GPU, why move them to the CPU just to move them back? It turned out to be a much deeper rabbit hole than expected.

Why OpenGL?

Using the regular Android UI framework for an editor involves decoding frames to bitmaps, manipulating them on CPU, then encoding them back. This works, but it’s slow and memory-hungry. Every frame makes a round trip: GPU → CPU → GPU.

OpenGL lets you stay on the GPU. Decode a video frame? It lands in a texture. Apply a filter? That’s a shader. Composite multiple layers? More shaders. Encode the result? The encoder reads directly from the GPU surface.

No round trips. No bitmap allocations. No waiting.

But OpenGL comes with complexity. You’re managing EGL contexts, shader programs, texture lifecycles, and frame synchronization. Everything should be moved or resized using matrix transformations. It’s a steep learning curve for what seems like a simple goal: make videos faster.

It took me a while to feel comfortable in that world, but the investment started paying off once the basic pipeline was in place.

How the Architecture Evolved

After months of iteration, two patterns emerged that shaped the rest of the project. We built on top of ChillingVan’s android-openGL-canvas library, which provided an excellent Canvas-like API for OpenGL operations. But the real discovery was figuring out how to extend it for shared contexts between different rendering surfaces.

Idea #1: One Canvas, Many Surfaces

Every video editor faces the same problem: you need to render the same content to different places.

The naive approach is to write separate rendering code for each. We tried this initially and quickly ran into the classic problem: “why does my export look different from the preview?”

The solution we landed on was abstracting the rendering target behind a common interface. We called it ICanvasGL:

interface ICanvasGL {
    val width: Int
    val height: Int

    fun drawBitmap(bitmap: Bitmap, x: Int, y: Int, w: Int, h: Int, filter: TextureFilter)
    fun drawRect(rect: Rect, paint: GLPaint)
    fun save()
    fun restore()
    fun rotate(degrees: Float, px: Float, py: Float)
    fun clearBuffer(color: Int)
}

This looks like Android’s regular Canvas API, but it runs on OpenGL. Any code that draws to ICanvasGL works identically whether the target is a TextureView on screen, an off-screen buffer for recording, or a small surface for thumbnail generation.

We built a family of renderers around this interface — each one responsible for a different content type (photos, video placeholders, animated stickers, transitions), but all of them drawing through ICanvasGL without knowing what’s on the other side:

flowchart TB
    classDef green fill:#c0f8d066,stroke:#c0f8d0,stroke-width:2px,color:black
    classDef blue fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black
    classDef yellow fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black
    classDef red fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black
    linkStyle default stroke:black,stroke-width:2px,color:black

    subgraph Input["Input Sources"]
        Video["Video Files"]
        Photos["Photos"]
        Stickers["Stickers/GIFs"]
    end

    subgraph Renderers["Renderer Family"]
        direction TB
        PlaceholderR["PlaceholderRenderer<br/>Video/Photo placeholders"]
        SlideshowR["SlideshowRenderer<br/>Photo transitions"]
        OverlayR["OverlayRenderer<br/>Stickers, GIFs, Text"]
    end

    subgraph Canvas["ICanvasGL Interface"]
        DrawBitmap["drawBitmap()"]
        DrawRect["drawRect()"]
        DrawTexture["drawTexture()"]
    end

    subgraph Targets["Rendering Targets"]
        Screen["StoryRendererView<br/>(On-screen preview)"]
        OffScreen["OffScreenCanvas<br/>(Video recording)"]
        Snapshot["SnapshotCanvas<br/>(Thumbnails)"]
    end

    Input:::green --> Renderers
    Renderers:::blue --> Canvas
    Canvas:::yellow --> Targets
    Targets:::red
    
    class Input green
    class Renderers blue
    class Canvas yellow
    class Targets red

The key property: no renderer knows or cares where it’s drawing. Screen? Video file? Thumbnail? Same code, same result.

Where this matters most is time-sensitive content. Take animated GIFs — each sticker needs to show the right frame for a given moment in the video:

class OverlayRenderer(private val bitmapProvider: BitmapProvider) {

    fun drawOn(canvas: ICanvasGL, presentationTimeMs: Milliseconds) {
        layers.forEach { layer ->
            // Only render if within layer's time span
            if (presentationTimeMs in layer.startAt..layer.stopAt) {
                when (layer) {
                    is Layer.Sticker -> {
                        // Get the correct GIF frame for this exact timestamp
                        val frame = gifFrameProvider.getNextFrame(presentationTimeMs - layer.startAt)
                        drawTexture(canvas, frame, layer)
                    }
                    else -> drawBitmap(canvas, layer.data, layer)
                }
            }
        }
    }
}

During editing, presentationTimeMs comes from the system clock. During recording, it comes from the video encoder’s frame counter. The renderer doesn’t care — it draws the right frame for the given timestamp. This small detail turned out to be surprisingly important: it meant that GIF stickers, transition effects, and every other time-dependent element stayed perfectly synchronized between preview and export, without any extra logic on our side.

With this abstraction working, we had a clean rendering layer. But there was still a problem: how do you share GPU resources between the editing surface and the recording surface?

Idea #2: Discovering Shared OpenGL Contexts

This part took me longer to figure out than I’d like to admit.

When you create an OpenGL context, you get your own isolated world of textures, shaders, and buffers. If you create a second context for recording, it can’t see anything from the first. You’d have to copy textures between them—exactly the kind of GPU → CPU → GPU round trip we were trying to avoid.

I spent a few days stuck on this until I found that OpenGL has a feature called context sharing. Two contexts can share the same resources if you set them up correctly. Once I understood how it worked, a lot of the architecture fell into place.

flowchart TB
    classDef green fill:#c0f8d066,stroke:#c0f8d0,stroke-width:2px,color:black
    classDef blue fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black
    classDef yellow fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black
    classDef red fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black
    linkStyle default stroke:black,stroke-width:2px,color:black

    subgraph EGL["EGL Layer"]
        EGLDisplay["EGLDisplay"]
        EGLConfig["EGLConfig"]
    end

    subgraph SharedResources["Shared OpenGL Resources"]
        direction LR
        Textures["Video Textures"]
        Shaders["Compiled Shaders"]
        Filters["Filter Programs"]
        FBOs["Framebuffer Objects"]
    end

    subgraph EditContext["Editing Context"]
        EditSurface["EGLSurface<br/>(Window Surface)"]
        EditThread["GL Thread #1"]
        TextView["TextureView<br/>(On-screen)"]
    end

    subgraph RecordContext["Recording Context"]
        RecordSurface["EGLSurface<br/>(PBuffer Surface)"]
        RecordThread["GL Thread #2"]
        Encoder["MediaCodec<br/>Encoder Surface"]
    end

    EGLDisplay --> EditContext
    EGLDisplay --> RecordContext
    EGLConfig --> EditContext
    EGLConfig --> RecordContext

    SharedResources <--> EditContext
    SharedResources <--> RecordContext

    EditSurface --> TextView
    RecordSurface --> Encoder

    style SharedResources fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black
    style EditContext fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black
    style RecordContext fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black

When a video decoder produces a frame, it writes to a texture. That texture is visible to both contexts. During editing, we read it and composite it onto the screen. During recording, we read the same texture and composite it into the export.

No copying. No duplication. One texture, two readers.

Here’s how we set up the shared context:

class MultiTexOffScreenCanvas(
    var width: Int = 0,
    var height: Int = 0,
    private val initialTexCount: Int = 0,
    renderMode: Int = GLThread.RENDERMODE_WHEN_DIRTY,
    sharedEglContext: EglContextWrapper = EglContextWrapper.EGL_NO_CONTEXT_WRAPPER
) : GLViewRenderer {

    init {
        val builder = GLThread.Builder()
            .setRenderMode(renderMode)
            .setSharedEglContext(sharedEglContext)  // Share with editing context
            .setRenderer(this)

        mGLThread = builder.createGLThread()
        mGLThread.start()
    }
}

The recording surface receives the editing context’s EglContextWrapper. From that point on, they share resources: textures created in one context are visible in the other, shaders compiled once are used everywhere, and video frames stay on the GPU throughout the entire pipeline.

The initialization sequence looks like this — the important moment is when the recording context is created with a reference to the editing context:

sequenceDiagram
    participant Edit as StoryRendererView
    participant EGL as EGL System
    participant Record as OffScreenCanvas
    participant Encoder as MediaCodec

    rect rgba(192, 248, 208, 0.4)
    Note over Edit,Encoder: Initialization Phase

    Edit->>EGL: eglCreateContext(display, config, EGL_NO_CONTEXT)
    EGL-->>Edit: editContext
    Edit->>EGL: eglCreateWindowSurface(display, config, textureView)
    EGL-->>Edit: editSurface
    end

    rect rgba(199, 220, 252, 0.4)
    Note over Edit,Encoder: Recording Phase

    Edit->>Record: Create OffScreenCanvas(sharedEglContext)
    Record->>EGL: eglCreateContext(display, config, editContext)
    Note right of EGL: Pass editContext as share_context
    EGL-->>Record: recordContext (shares resources)

    Record->>EGL: eglCreatePbufferSurface(width, height)
    Note right of EGL: Off-screen, no window needed
    EGL-->>Record: recordSurface

    Record->>Encoder: Create encoder input surface
    end

    rect rgba(247, 248, 192, 0.4)
    Note over Edit,Encoder: Shared Resource Access

    Edit->>EGL: Upload video texture
    Note over EGL: Texture visible in BOTH contexts
    Record->>EGL: Read same video texture (no copy)
    Record->>Encoder: Render to encoder surface
    end

There’s a subtle detail in that sequence worth highlighting. The recording context doesn’t render to a window — there’s nothing to display on screen. Instead, it uses an EGL PBuffer surface: an off-screen, in-memory rendering target. This is how we render frames that go directly to the video encoder without ever touching the display:

private inner class SurfaceFactory : GLThread.EGLWindowSurfaceFactory {
    override fun createWindowSurface(
        egl: EGL10,
        display: EGLDisplay,
        config: EGLConfig,
        nativeWindow: Any?
    ): EGLSurface {
        val attributes = intArrayOf(
            EGL10.EGL_WIDTH, width,
            EGL10.EGL_HEIGHT, height,
            EGL10.EGL_NONE
        )
        // PBuffer = off-screen, in-memory surface (no window required)
        return egl.eglCreatePbufferSurface(display, config, attributes)
    }
}

How Recording Works

With those two patterns in place, the recording side turned out to be simpler than I initially expected.

The StoryRendererView — our main editing surface — implements a CanvasFrameRecorder interface, making it both the preview renderer and the recording source. The core idea is that both paths converge on the same drawing function:

// Pseudocode — simplified from actual implementation
class StoryRendererView : GLMultiTexProducerView, CanvasFrameRecorder {

    // During editing (60 FPS): canvas targets the screen
    override fun onGLDraw(canvas: ICanvasGL, textures: List<GLTexture>) {
        drawFrame(canvas, textures, elapsedTime)
    }

    // During recording (30 FPS): canvas targets the encoder
    override fun drawFrameForRecording(canvas: ICanvasGL, textures: List<GLTexture>, timeMs: Milliseconds) {
        drawFrame(canvas, textures, timeMs)
        overlayRenderer.drawOn(canvas, timeMs)
    }

    // Same renderers, same logic, different target
    private fun drawFrame(canvas: ICanvasGL, textures: List<GLTexture>, timeMs: Milliseconds) {
        canvas.clearBuffer(backgroundColor)
        layers.forEach { layer ->
            layer.renderer.drawOn(canvas, layer, timeMs)
        }
    }
}

Both onGLDraw and drawFrameForRecording call the same drawFrame method. The only difference is where the canvas points and where the timestamp comes from. This is where the ICanvasGL abstraction and the shared context come together — the renderers don’t need to know anything about the recording pipeline.

The recording pipeline then orchestrates the rest — setting up the encoder, connecting the decoder outputs to shared textures, and driving the frame-by-frame loop:

internal class CanvasRecorder(
    private val mediaMuxer: EncodeMediaMuxer,
    private val fileManager: FileManager
) {
    private lateinit var encoder: Encoder
    private lateinit var outputSurface: VideoRenderOutputSurface
    private lateinit var offscreenCanvas: OffScreenCanvas

    fun init(width: Int, height: Int, videos: List<Video>, onDrawListener: OnCanvasDrawCallback?) {
        // 1. Create H.264 encoder
        encoder = DualMediaCodec()
        encoder.init(getOutputMediaFormat(width, height), MediaCodecActionType.ENCODE)

        // 2. Wrap encoder's input surface for EGL rendering
        outputSurface = VideoRenderOutputSurface(encoder.createInputSurface())

        // 3. Create off-screen canvas with SHARED context
        offscreenCanvas = OffScreenCanvas(width, height, videos.size)
        offscreenCanvas.onDrawListener = ::onGLDraw

        // 4. Video decoders output directly to canvas textures
        videoDecoders = videos.mapIndexed { index, video ->
            VideoDecoder(fileManager).apply {
                init(video, Surface(offscreenCanvas.producedTextureList[index].surfaceTexture))
            }
        }
    }

    fun processNextFrame(): Int {
        // Decode video frame → shared texture
        val drawResult = videoDecoders.first().processNextFrame()

        if (drawResult == RESULT_FRAME_PROCESSED) {
            // Composite all layers via the same renderers used for editing
            offscreenCanvas.drawFrame()

            // Send to encoder with precise timestamp
            outputSurface.setPresentationTime(elapsedTimeUs * 1000)
            outputSurface.swapBuffers()
        }

        return writeEncodedOutputFrame(durationUs, encoder)
    }
}

Looking at the complete picture, every frame follows the same path — from decoding through shared textures to either the screen or the encoder. There’s no branching logic, no special cases for “recording mode”. The architecture handles it:

flowchart TB
    classDef green fill:#c0f8d066,stroke:#c0f8d0,stroke-width:2px,color:black
    classDef blue fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black
    classDef yellow fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black
    classDef red fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black
    linkStyle default stroke:black,stroke-width:2px,color:black

    subgraph Decode["Video Decoding"]
        MediaExtractor["MediaExtractor"]
        VideoDecoder["VideoDecoder<br/>(MediaCodec)"]
        SurfaceTexture["SurfaceTexture"]
    end

    subgraph SharedGL["Shared GL Context"]
        VideoTex["Video Texture<br/>(GL_TEXTURE_EXTERNAL_OES)"]
        Shaders["Transition Shaders<br/>Filter Shaders"]
    end

    subgraph EditPath["Editing Path"]
        EditCanvas["StoryRendererView<br/>ICanvasGL"]
        Screen["Screen Display"]
    end

    subgraph RecordPath["Recording Path"]
        OffCanvas["OffScreenCanvas<br/>ICanvasGL"]
        EncoderSurface["Encoder Input Surface"]
    end

    subgraph Encode["Video Encoding"]
        VideoEncoder["VideoEncoder<br/>(MediaCodec)"]
        AudioEncoder["AudioEncoder"]
        Muxer["MediaMuxer"]
        MP4["Output MP4"]
    end

    MediaExtractor --> VideoDecoder
    VideoDecoder --> SurfaceTexture
    SurfaceTexture --> VideoTex

    VideoTex --> EditCanvas
    VideoTex --> OffCanvas
    Shaders --> EditCanvas
    Shaders --> OffCanvas

    EditCanvas --> Screen
    OffCanvas --> EncoderSurface
    EncoderSurface --> VideoEncoder
    VideoEncoder --> Muxer
    AudioEncoder --> Muxer
    Muxer --> MP4

    style SharedGL fill:#c7dcfc66,stroke:#c7dcfc,stroke-width:2px,color:black
    style EditPath fill:#f7f8c066,stroke:#f7f8c0,stroke-width:2px,color:black
    style RecordPath fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black

The same renderers that drew to the screen now draw to the recording buffer. The same textures that displayed the preview now feed the export. The same shaders that applied filters live now apply them in the video file.

What you see is what you get — not because we tested exhaustively, but because the architecture makes it hard to get a different result.

The Results

This architecture powered Storybeat through three years of development. Over time we added features that would have been much harder with the old approach: GL transitions (fade, zoom, warp, morph running as fragment shaders), real-time filters with identical preview and export, animated GIF stickers composited at the correct frame, and multi-video templates with synchronized decoding.

Export time dropped from 60+ seconds to near-realtime for simple stories. Memory during export fell by roughly 40% thanks to texture sharing — no more duplicating video frames between contexts.

flowchart LR
    subgraph Without["Without Context Sharing"]
        direction TB
        V1["Video Texture<br/>(Edit Context)<br/>~50MB"]
        V2["Video Texture Copy<br/>(Record Context)<br/>~50MB"]
        S1["Shaders<br/>(Edit Context)<br/>~5MB"]
        S2["Shaders Copy<br/>(Record Context)<br/>~5MB"]
        Total1["Total: ~110MB"]

        V1 -.->|"CPU Copy"| V2
        S1 -.->|"Recompile"| S2
    end

    subgraph With["With Context Sharing"]
        direction TB
        VS["Video Texture<br/>(Shared)<br/>~50MB"]
        SS["Shaders<br/>(Shared)<br/>~5MB"]
        Total2["Total: ~55MB"]

        Edit2["Edit Context"] --> VS
        Record2["Record Context"] --> VS
        Edit2 --> SS
        Record2 --> SS
    end

    Without ---|"~50% Memory<br/>Reduction"| With

    style Without fill:#ffc0cb66,stroke:#ffc0cb,stroke-width:2px,color:black
    style With fill:#c0f8d066,stroke:#c0f8d0,stroke-width:2px,color:black

The app eventually reached over 10 million downloads with a 4.5-star rating from more than 300,000 reviews. Users created millions of stories without ever knowing about the OpenGL pipeline underneath — which is probably the best compliment a rendering engine can get.

Conclusions

The ICanvasGL abstraction paid for itself many times over. We created it early because we were tired of preview-vs-export inconsistencies, not because we foresaw how useful it would become. When new rendering targets came up months later, they just worked. Sometimes the best architectural decisions come from solving an immediate pain rather than planning ahead.

The weeks spent learning EGL contexts felt slow at the time. I remember wondering if we were over-investing in low-level details. But that knowledge became the foundation for everything else — every new feature built on top of it without requiring architectural changes.

Keeping data on the GPU matters more than it seems. Every time you move data between GPU and CPU, you pay in latency and memory. We learned this the hard way with the original FFmpeg pipeline. The entire OpenGL rewrite was essentially one idea: stop moving video frames back and forth.

Acknowledgments

We built on top of ChillingVan’s android-openGL-canvas library, which provided an excellent Canvas-like API for OpenGL. We extended it for multi-surface rendering and context sharing, but the foundation saved us months of work.