02 foundation

From Nothing to a Triangle

A minimal modern rendering pipeline in C# and Vulkan. Device, swapchain, shaders, handles, descriptors, and the first render command — explained from the ground up.

From Nothing to a Triangle: A Minimal Modern Rendering Pipeline in C#

What it actually takes to put a single triangle on screen with Vulkan — and why every piece exists.


What Does “Nothing” Look Like?

You open a blank window. The operating system gave you a rectangle of pixels. The GPU is right there — powerful, parallel, waiting. And yet the distance between that blank window and a single colored triangle is surprisingly vast.

Most rendering tutorials treat the setup as boilerplate — something to copy-paste and move past. I want to do the opposite. Every piece of the pipeline exists for a reason, and understanding why each piece is there matters more than memorizing the API calls. If you understand the purpose behind each concept, switching between Vulkan, DirectX, Metal, or WebGPU becomes a matter of translation, not relearning.

This post walks through the minimal set of concepts needed to render one triangle with a modern GPU API. No engine code, no abstractions — just the raw pipeline, explained from the ground up. The code examples are pseudocode: close enough to C# and Vulkan to be concrete, loose enough to focus on the ideas rather than the syntax.

Let’s start with the first question: how do you even talk to the GPU?


Asking the GPU to Introduce Itself

Before rendering anything, you need to establish a connection. This happens in layers, and each layer exists because GPUs are shared resources — your application isn’t the only thing using the hardware.

Instance — Your application’s entry point into the graphics API. Think of it as announcing “I exist and I want to use Vulkan.” You tell it which API version you need and which extensions you’ll use (like the ability to present to a window).

Physical Device — The actual GPU hardware. A machine might have multiple: a discrete GPU, an integrated one, maybe a software renderer. You query what’s available and pick the one that fits your needs. Typically, you prefer discrete over integrated — more power, dedicated memory.

Logical Device — Your application’s private view of the GPU. Multiple applications share the same physical device, but each gets its own logical device with its own queues and resources. This is where isolation happens.

Queues — The GPU’s work intake. You submit commands to queues, and the GPU processes them. Different queue families support different operations — graphics, compute, transfer. For a triangle, you need a graphics queue (to draw) and a present queue (to show the result on screen). Often they’re the same physical queue, but the API treats them as separate concepts because they don’t have to be.

GPU device hierarchy — from application to queues

None of this draws anything. This is the handshake — your application and the GPU agreeing on the terms of their collaboration. From a functional perspective, this is pure setup with no recurring state: you query capabilities, make choices, and receive handles. The decisions are deterministic given the hardware.


The Swapchain: Where the Window Meets the GPU

The window is an OS concept. The GPU renders to images. The swapchain bridges these two worlds — it’s a set of images that the GPU renders into and the display system presents to the screen.

Why a chain of images? Because the GPU and the display run at different speeds. While the monitor shows image A, the GPU can be rendering into image B. When the GPU finishes, they swap. With two or three images in the chain, the GPU never has to wait for the display, and the display never shows a half-rendered frame.

Setting up the swapchain means making decisions:

  • Format — How are pixel colors stored? B8G8R8A8_SRGB is typical: blue, green, red, alpha, 8 bits each, in sRGB color space.
  • Present mode — How does the swap happen? FIFO waits for vertical sync (no tearing, but higher latency). Mailbox replaces the waiting image with the newest one (lower latency, but burns more GPU).
  • Extent — The resolution. Usually matches the window size.
  • Image count — How many images in the chain. More images means more latency buffering but also more memory. Minimum + 1 is a common choice.

For each swapchain image, you also create an image view — a description of how to interpret the image’s memory (which channels, which mip levels, which array layers). The raw image is just a block of GPU memory. The view gives it meaning.

Swapchain structure — window to image views


Telling the GPU What You Want to Draw

We have a connection and a place to draw. Now we need to describe what to draw and how to draw it. This is where the pipeline lives.

Shaders: Your Code on the GPU

Shaders are programs that run on the GPU. For a triangle, you need two:

Vertex shader — Runs once per vertex. Takes a position and outputs where that vertex lands on screen. For a simple triangle, this might do nothing more than pass the position through.

Fragment shader — Runs once per pixel that the triangle covers. Outputs the final color. For our triangle, it could output a solid color or interpolate colors across the surface.

// Vertex shader (pseudocode)
input:  position (vec2), color (vec3)
output: fragColor (vec3)

main:
    screenPosition = vec4(position, 0.0, 1.0)
    fragColor = color
// Fragment shader (pseudocode)
input:  fragColor (vec3)
output: pixelColor (vec4)

main:
    pixelColor = vec4(fragColor, 1.0)

Shaders are written in a high-level language (GLSL, HLSL) and compiled to an intermediate representation (SPIR-V) before the GPU sees them. This compilation happens offline — you ship the compiled bytecode, not the source. The GPU driver compiles SPIR-V into its own machine code at load time.

Vertex Layout: How the GPU Reads Your Data

You need to tell the GPU the shape of your vertex data. A vertex for our triangle has a 2D position and an RGB color:

Vertex:
    position: vec2 (8 bytes)
    color:    vec3 (12 bytes)
    total:    20 bytes per vertex

The binding description says “read vertices from buffer slot 0, each one is 20 bytes, advance per vertex.” The attribute descriptions say “position is at offset 0, color is at offset 8, and here’s the format for each.” This is metadata — a pure description of memory layout, not a command.

The Render Pass: What Happens to the Attachments

A render pass describes the structure of a rendering operation: which images (attachments) are involved, how they start, and how they end.

For our triangle, the render pass has a single color attachment — the swapchain image we’re drawing into. It says:

  • Load op: Clear — fill with a background color before drawing
  • Store op: Store — keep the result after drawing
  • Initial layout: Undefined — we don’t care what state the image was in before
  • Final layout: PresentSrc — when we’re done, this image should be ready for the display

This is a declaration of intent: “I will write to this attachment, I want it cleared first, and I want the result preserved for presentation.” The GPU uses this to optimize memory access and layout transitions.

The Graphics Pipeline: Putting It All Together

The graphics pipeline is the fully assembled description of how rendering will happen. It combines everything:

Graphics Pipeline:
    ├── Shader stages (vertex + fragment)
    ├── Vertex layout (binding + attributes)
    ├── Input assembly (triangle list)
    ├── Viewport and scissor (render area)
    ├── Rasterizer (fill mode, culling, winding order)
    ├── Multisampling (disabled for now)
    ├── Color blending (disabled — opaque output)
    └── Render pass (which attachments, which operations)

In modern APIs, the pipeline is created upfront — not configured piece by piece at draw time. This lets the GPU and driver optimize the entire configuration as a unit. The tradeoff is verbosity: even for a single triangle, you need to specify every stage, even the ones you’re not using.

Completely offtopic, agentic coding is converging on a similar tradeoff — verbose PRDs and architecture guides upfront so the agent can optimize the process of writing code, instead of figuring out sprint by sprint.

Notice what this is: a complete, immutable description of a rendering operation. It doesn’t draw anything. It’s data that describes how to draw. The pipeline object is inert until you bind it inside a command buffer and issue a draw call. This is the first hint of the pattern — describe the work as data, execute it separately.


Getting Geometry to the GPU

We have a pipeline but no triangle. The vertices live in CPU memory, and the GPU can’t read that directly. We need to upload them.

A buffer is a block of GPU-visible memory. You allocate it, map it to a CPU-accessible pointer, copy your vertex data in, and unmap it. After that, the GPU can read from it during rendering.

vertices = [
    { position: (-0.5, -0.5), color: (1, 0, 0) },  // bottom-left, red
    { position: ( 0.5, -0.5), color: (0, 1, 0) },  // bottom-right, green
    { position: ( 0.0,  0.5), color: (0, 0, 1) },  // top-center, blue
]

vertexBuffer = gpu.createBuffer(
    size:  3 * sizeof(Vertex),
    usage: VertexBuffer,
    memory: HostVisible | HostCoherent
)

gpu.upload(vertexBuffer, vertices)

The buffer is tagged with its usage (vertex data) and its memory type (visible to both CPU and GPU). For a simple triangle, host-visible memory works fine. For production workloads, you’d use a staging buffer on the CPU side and a device-local buffer on the GPU side, with a transfer in between. That optimization matters for millions of vertices. For three, it doesn’t.


The Frame Loop: Recording and Submitting Work

Everything so far has been setup. The pipeline, the buffers, the swapchain — all created once (or once per resize). The frame loop is where rendering actually happens, and it follows the same cycle every frame.

Step 1: Acquire an Image

Ask the swapchain for the next available image. This might block if all images are in use. You get back an index — which image in the chain you’ll render into.

Step 2: Record Commands

This is the core of the frame. You don’t call the GPU directly — you record commands into a command buffer, and then submit the whole buffer at once.

cmd = beginCommandBuffer()

    beginRenderPass(cmd, renderPass, framebuffer, clearColor: darkGray)
        
        bindPipeline(cmd, graphicsPipeline)
        bindVertexBuffer(cmd, vertexBuffer)
        setViewport(cmd, width, height)
        setScissor(cmd, width, height)
        
        draw(cmd, vertexCount: 3)

    endRenderPass(cmd)

endCommandBuffer(cmd)

Read that from top to bottom. It’s a sequence of instructions: start a render pass, bind the pipeline, bind the vertex data, set the viewport, draw three vertices, end the render pass. No rendering happens yet — you’re writing a list of instructions.

This is a critical concept. The command buffer is a recording of what you want the GPU to do. It’s built on the CPU, entirely deterministic given the same inputs. You could record it, throw it away, and record it again — you’d get the same result. The side effect (actual rendering) only happens when you submit it.

Step 3: Submit and Present

Submit the command buffer to the graphics queue. The GPU begins executing your recorded commands. When it’s done, present the rendered image to the swapchain, which hands it to the display.

submit(graphicsQueue, cmd,
    waitFor:    imageAvailable,     // don't start until the image is ready
    signal:     renderFinished)     // tell us when rendering is done

present(presentQueue, swapchain, imageIndex,
    waitFor:    renderFinished)     // don't present until rendering is done

Synchronization: The Hidden Complexity

The frame loop looks simple, but there’s a timing problem. The CPU can record commands faster than the GPU can execute them. If you submit frame 2 while the GPU is still rendering frame 1, you’ll stomp on resources that are in use.

The solution is double-buffering with synchronization primitives:

  • Semaphores — GPU-to-GPU synchronization. “Don’t start the render until the image is acquired.” “Don’t present until the render is finished.” The CPU never waits on these.
  • Fences — GPU-to-CPU synchronization. “Don’t start recording frame N until the GPU is done with frame N-2.” This is where the CPU blocks if it’s too far ahead.

Frame ring buffer — two frames in flight with fence synchronization

Two frames in flight means you have two sets of everything per-frame: command buffers, semaphores, fences. The CPU cycles between them. Before reusing frame 0’s resources, it waits on frame 0’s fence to confirm the GPU is done with them.

This is the most operationally complex part of the minimal pipeline. Not because the concept is hard, but because getting it wrong produces subtle bugs — flickering, corruption, crashes that only happen under load. The synchronization is pure bookkeeping: given the frame index, deterministically pick the right set of resources and wait on the right fence. No ambiguity, no state that drifts over time.


The Functional Angle: Even Here, It Matters

It might seem premature to think about architecture when all you have is a triangle. But the patterns you establish at the foundation determine what’s easy or hard later.

Look at what we built:

  • Handles are inert data. A BufferHandle is an index and a generation counter. Creating one doesn’t allocate GPU memory. Passing one around doesn’t cause side effects. It’s a reference you can store, copy, and compare — the GPU equivalent of a pointer that can’t dereference itself.

  • The pipeline is an immutable description. Once created, it never changes. It describes how to render, not when or what. Multiple command buffers can reference the same pipeline simultaneously.

  • Command recording is deterministic. Given the same pipeline, buffer, and viewport, the recorded command buffer is identical every time. The commands describe work without performing it. This is the same principle as the XR lazy-follow from the previous post — separate the what from the do.

  • GpuState is the only mutable thing. The current frame index, the swapchain, the synchronization state — all of this lives in one explicit, passed-around structure. There’s no hidden global that silently changes between frames. If the state changes, it’s because someone changed it, visibly.

None of this required a framework or a special library. It’s a design choice — the same design choice you’d make in any codebase where you want to reason about correctness: keep the pure parts pure, make the impure parts explicit, and draw a clear line between them.

For a triangle, this might feel like overkill. But the triangle is just the proof that the connection works. The next post adds a second pass, and then a third. When you have multiple passes reading and writing shared resources, the question stops being “can I render something” and becomes “can I describe a frame as a composition of independent transformations.” That question is easier to answer when your foundation already separates description from execution.


What We Have, What’s Next

Let’s take stock. To render a single triangle, we needed:

  1. A GPU connection — instance, physical device, logical device, queues
  2. A swapchain — images to render into, image views to interpret them
  3. Shaders — vertex and fragment programs, compiled to SPIR-V
  4. A vertex layout — how the GPU reads vertex data from memory
  5. A render pass — what happens to attachments (clear, render, present)
  6. A graphics pipeline — the complete, immutable rendering configuration
  7. A vertex buffer — triangle data uploaded to GPU-visible memory
  8. A frame loop — acquire, record, submit, present, synchronize

Eight concepts, just for a triangle. It’s a lot — and if it feels like a lot, that’s honest. Modern GPU APIs are verbose because they’re explicit. Every decision that older APIs made silently behind your back, you now make yourself. The payoff is control: nothing happens unless you ask for it, and you can reason about exactly what the GPU is doing at every point.

The triangle is proof that the system works — that the CPU and GPU can collaborate. It’s not interesting by itself. What’s interesting is what happens when you add a second pass that reads what the first one wrote, and then a third pass that composites the result. That’s where the pipeline stops being a straight line and becomes a graph — and where describing that graph as data starts to pay off.

That’s the next post: splitting the frame into a geometry pass and a lighting pass, writing to intermediate buffers, and composing the result. The deferred rendering pipeline.