Graphics Drivers

Gah. So if you saw the last post I made about OSX, you may remember it was running at 1 FPS.

I spent a lot of time thinking about this issue and a quite a bit of time trying to code solutions. Despite OpenGL being a ‘cross platform’ library, at this point I’m pretty sure each platform that uses it is going to have to be tailored to that platforms specific graphics drivers.

Here’s my debugging method. (This is going to sound elegant as I type this out, but there was a lot of stumbling and double and triple checking things…)

One Frame Per Second

So I’m sitting there looking at the game chug along at 1FPS, and thinking: the loading screens run fast, but the title screen runs miserably. The loading screens have 1-3 draw calls per frame, whereas the title screen has hundreds, if not thousands. Something per draw call must be going slow.

Sure enough, if I don’t make any draw calls, things run fast, but this is mostly useless, since I can’t see anything.

A few thoughts enter my mind.

Hypothesis

  1. – The graphics driver is defaulting to software rendering or software transformations.
  2. – I’m doing something that’s not OpenGL 3.2 compliant, or doing something causing OpenGL errors.
  3. – The GPU is waiting on the CPU (or vice versa) for something.

The first idea just shouldn’t be possible, as I selected a pixel format (an OpenGL thing that specifies what kind of rendering you’ll be doing) on OSX requiring hardware acceleration and no software fall back. But I’ll double check.

The second idea is somewhat likely, but I worked very hard to make the Windows renderer OpenGL 3.2 compliant and it doesn’t show any errors. But I’ll check anyway since it’s a different driver and different GPU using the same code.

Third idea? Let’s hope it’s not that.

Testing

How do you check something like this? There’s some sorta-ok GPU debugging tools available for OSX, so I downloaded them and started them up. After a little documentation reading, I got them working. You can set some OpenGL break points which will stop the program and give a bit of information if theres an error or if you encounter software rendering.

BreakPointsSet

Of course nothing is easy. No OpenGL errors, no software rendering. This immediately discounted ideas #1 and #2. So it’s probably #3. Something is syncing the CPU and GPU. Blah.

Next I looked at what OpenGL calls were being made and how long they were taking.

DrawCallsSlow

Ah ha! You’ll notice the highlighted lines (which are draw calls), and that opengl calls are taking up a crazy 98% of the frame.

Looking close at individual calls, the huge time differences can be seen between glDraw calls and other API calls…

SingleCallSlow

Having written low level code for consoles that don’t really have a driver has given me a good understanding of what sort of things go on when the CPU sends commands to the GPU, and what can cause a stall. Generally this happens when you’re either writing to dynamic resources that the GPU is currently using but the CPU wants to update. Or, when the CPU is waiting for the GPU to finish some rendering so it can access a rendered or computed result.

I only have 3 places in code that might cause this. The first one I looked at is updating vertex and index data used for dynamic rendering – which is used for particle systems, ui, and other things that change frame to frame.

The (abbreviated) code looks like this:

    GLbitfield flags = GL_MAP_WRITE_BIT;
    if (_currentOffset + bytes > _bufferBytes)
    {
        // at the end of the buffer, invalidate it and start writing at the beginning...
        flags |= GL_MAP_INVALIDATE_BUFFER_BIT;
        _currentOffset = 0;
    }
    else
    {
        // there's still room, write past what the GPU is using and notify that there's no
        // need to stall on this write.
        flags |= GL_MAP_UNSYNCHRONIZED_BIT;
    }
        
    glBindBuffer(GL_ARRAY_BUFFER, _objectId);
    void* data = glMapBufferRange(GL_ARRAY_BUFFER, _currentOffset, bytes, flags);    

    // write some data ....

    glUnmapBuffer(GL_ARRAY_BUFFER);

    // draw some stuff with the data at _currentOffset.

    _currentOffset += bytes;

It’s setup so that generally you’re just writing more data while the GPU can use data earlier in the buffer as it’s needed. Occasionally when you run out of room you let the driver know you’re going to overwrite the buffer. (This can be better with multiple buffers, but I didn’t want to overcomplicate this example code.)

This didn’t seem to be the problem as nearly every draw call was slow. Drawing that used fully static data was slow too. Static data is setup with code that looks like this.

    glGenBuffers(1, &_objectId);
    glBindBuffer(GL_ARRAY_BUFFER, _objectId);
    glBufferData(GL_ARRAY_BUFFER, bytes, data, GL_STATIC_DRAW);       

That data isn’t ever touched again, and hopefully the GPU takes the hint that it can reside in GPU memory so no problem there.

But then I noticed that not every draw call was slow. Using the OpenGL Profiler trace I could see that sequential draw calls without any changes to any render state in-between did not stall.

FastDrawCalls

Hmmmm….

What’s the most common thing that changes between draw calls? If it’s not the material on the object, it’s the location where that object is drawn. It’s transformation – position and orientation. Transformations are generally stored in a very fast (and fairly small) section of GPU memory meant just for this purpose. It’s also where the camera location, object color, and other variable properties are stored. We call this data ‘uniforms’. Or in my engine ‘constants’.

In OpenGL 3.2 I used uniform buffer objects, since it most closely matches my engine architecture and that of DX10/11. DX9 fits the concept as well, since you can specify the location of all uniforms. Seems like a good fit.

After some pre-configuration, sending uniforms to the GPU for vertex and pixel programs to use is really easy. It looks like this:

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 bytes)
{
    glBindBuffer(GL_UNIFORM_BUFFER, _objectId);
    glBufferSubData(GL_UNIFORM_BUFFER, offsetBytes, bytes, data);
}

To my knowledge this should be crazy fast. On some hardware (way down at the command stream level) this data is part of the command buffer and it updates constants just before the vertex and pixel shaders are invoked. Worst case if its actually a separate buffer the GPU uses, and/or the driver supports getting this data back to the CPU, it needs to copy it off somewhere until the GPU needs it and the last set values can be read back by the CPU without any stall…

But you never know….

I read the OpenGL docs again, and sure enough glBufferSubData can cause a stall and the GPU waits for the previous commands to consume the previous values.

“Consider using multiple buffer objects to avoid stalling the rendering pipeline during data store updates. If any rendering in the pipeline makes reference to data in the buffer object being updated by glBufferSubData, especially from the specific region being updated, that rendering must drain from the pipeline before the data store can be updated.”

Really? Why? Setting uniforms HAS to be fast. You do it almost as often as issuing draw commands!!! This has been true since vertex shader 1.0. (Yeah I know, this doesn’t have to be quite true for some of the newest GPUs and APIs)

So for kicks, since there’s more than one way to modify buffer data in OpenGL, I changed the ConstantBuffer update to:

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 bytes)
{
    glBindBuffer(GL_UNIFORM_BUFFER, _objectId);
    void* destData = glMapBufferRange(GL_UNIFORM_BUFFER, offsetBytes, bytes, GL_MAP_WRITE_BIT);
    memcpy(destData, data, bytes);
    glUnmapBuffer(GL_UNIFORM_BUFFER);
}

And while in my mind there really shouldn’t be any difference, the statistics on OpenGL commands changes to this:

MapBufferSlow

Huh, theres all that wait time again, but its moved to setting uniforms. Now I’m getting somewhere. I figure I’m just not using the API correctly when setting uniforms.

Experimentation

So I tried a bunch of different things.

I tried having a single large uniform buffer using the GL_MAP_INVALIDATE_BUFFER_BIT / GL_MAP_UNSYNCHRONIZED_BIT and glBindBufferRange() so that no constants were overwritten. This was slower. And yes, you can get slower than 1 FPS.

I tried having a uniform buffer per draw call so they were never overwritten, except between frames. This was slower, using either glMapBuffer or glBufferSubData.

I tried changing the buffer creation flags. No change.

I read about other coders running through their entire scene, collecting uniforms, updating a uniform buffer once at the beginning of the frame, and then running through the scene again just to make draw calls. This is stupid and slow.

I wished I could use a newer version of OpenGL to try some other options, but I’m using 3.2 for maximum compatibility.

Eureka!

Then I got a sinking feeling in my stomach. I knew the answer (actually was pretty sure…) but I didn’t want to code it. Ugh.

Back before OpenGL 3.0 / DirectX 10, there weren’t any uniform buffers. Uniforms were just loose data that you set one at a time using functions like glUniformMatrix4fv and glUniform4fv.

What isn’t great about the old way is every time you change vertex and pixel programs, you need to reapply all the uniforms that have changed that the next GPU programs uses. OpenGL 3.2 doesn’t let the shader pick where uniforms go in memory, so you always have to look it up, and the location of each uniform variable can change shader to shader.

With uniform buffers, if you set some values once and it doesn’t change the entire frame there’s nothing else to do.

So I went about changing the engine to use the old old way.

  1. -First I had to change all the shaders to not use uniform buffers. Luckily I have the shader compiler so this was a few lines of code instead of hand editing 100’s of shaders.
  2. -Then I sat around for a few minutes for all the shaders to regenerate and recompile.
  3. -Next I had to record the per vertex/pixel program combination of which uniforms were used and where they needed to be uploaded to. This was a non-trival amount of code to write.
  4. -Then any time a shader changed, I had to change the code to dirty all uniforms so they’d be reapplied.
  5. -Then I had to write a new uniform binding function.

Here’s the new constant binding function. Pretty messy memory wise, and many more calls to the GL API frame.

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 /*bytes*/)
{
    _Assert(offsetBytes == 0, "can't upload with non-zero offset");
        
    const VideoProgram* program = context.GetVideoProgram();
    const Collection::Array& upload = program->GetUploadInfo(context.GetDetailLevel(), _ordinal);
        
    for (int32 i = 0; i < upload.GetSize(); ++i)
    {
        const VideoProgram::UploadInfo& uploadInfo = upload[i];
        switch (uploadInfo._type)
        {
            case GL_FLOAT_MAT4:
                glUniformMatrix4fv(uploadInfo._index, uploadInfo._size, 
                                   false, (float*)data + (uploadInfo._offset * 4));
                break;
            case GL_FLOAT_VEC4:
                glUniform4fv(uploadInfo._index, uploadInfo._size, 
                            (float*)data + uploadInfo._offset * 4);
                break;
        }
    }
}

Success

Finally I watched the game run at 60 FPS. So now the statistics are nicer. And only 5% CPU time spent in OpenGL. Woot.

FixedIssue

Graphics Drivers

Ok, so the driver is optimized to set loose constants very quickly, but when presented as a block it just stalls waiting for the GPU to finish? I don't get it. The Windows drivers seem to handle uniform buffers properly. I understand writing the driver to the OpenGL spec - but geez, this makes uniform buffers mostly useless. It's known to be a uniform buffer, the calling code is updating it, it's marked as DYNAMIC_WRITE, so why isn't it doing exactly the same things as what my manual setting of each uniform value is doing???? Arhghghghg.

I'm sure someone has a good answer as to how to update uniform buffers on Mac OSX, but I couldn't find it. Or maybe the answer is upgrading, or not using them? But this was debugging hours I didn't need to spend. Actually I take that back. Tracking down issues like this is pretty satisfying...

So I can just keep the code the way that works on Mac, but uniform buffers are so much more elegant. Plus what if Linux runs faster with uniform buffers instead of loose uniforms? Or if Windows does? Then I have to generate two different OpenGL shaders, and have different code per platform to get the same data to the GPU. Now I'm not so worried that the Windows OpenGL implementation was slightly different from OSX, because I can see the implementations are going to be driver dependent anyway...

OpenGL is cross platform? Sorta. Yikes.

33 Comments

    Eric1212
    December 13, 2015 1:45 pm

    I would be currious to see if Linux work like Windows or like Mac OSX in this case, seems like outdated hardware or driver for Mac… I suspect Linux will work better like windows with uniform buffers… #ImNotaGameProgrammer

    Jon
    December 13, 2015 1:59 pm

    “Arhghghghg.” sounds like a correct statement for your situation. Don’t we just love these cross-platform ‘standards’? 😛

    Talgorn
    December 13, 2015 2:23 pm

    Very interesting post. Great debug use case. I’ve learned something important here. Thanx!

    pastuh
    December 13, 2015 2:33 pm

    Nice post :)

    Ali
    December 13, 2015 3:41 pm

    exceptional post as always mate, and pretty informational as well. TIL. Thanks

    Keaton
    December 13, 2015 3:50 pm

    So this means he is really close to releasing banished on Mac, correct?

    Jason
    December 13, 2015 9:18 pm

    As a computer science student, I find it really cool to read about the processes you are going through to debug this kind of stuff!

    Ben
    December 14, 2015 12:09 am

    Its quite interesting reading these posts without any background in computer science or coding. Though none of the technical aspects make any sense to me, you do a pretty good job of adding gravitas to your breakthrough and ‘eureka’ moments haha. Even just reading about the debugging leaves a satisfying feeling.

    Keep it up man!

    vaen
    December 14, 2015 3:15 am

    as always greatly appreciate these in-depth looks into the workings of debugging. thanks!

    Pete Wildsmith
    December 14, 2015 4:26 am

    Would it be practical to use a texture instead of a buffer uniform as described in the link below? It’s a way to get around the lack of buffer uniforms in GLES 2.0 / WebGL, and might suit a situation like this where they don’t work as you expect.

    https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/BestPracticesforShaders/BestPracticesforShaders.html#//apple_ref/doc/uid/TP40008793-CH7-SW26

    Gary
    December 14, 2015 5:50 am

    Very interesting read :)

    Riku
    December 14, 2015 7:11 am

    What are the typical size, offset and alignment you are using to update the uniform buffers? Bad alignment may cause pipeline stalls and have a huge effect on the performance.

    Offset and size should be aligned to at least 256 bytes or preferably a page size (4k, 64k or 2M).

    GL_MAP_BUFFER_PERSISTENT and explicit synchronization could solve the issue but it seems like you can’t rely on GL_ARB_buffer_storage.

    Tommy
    December 14, 2015 7:13 am

    When your game is out on Steam for OS X, can you see how many of the players run the OS X binary?

    hannesp
    December 14, 2015 7:27 am

    Uh, poor suffering colleague :( The performance of UBOs seems to be very poor. There are dozens of threads where people try to tame this beast and in the end return to simple uniforms again. For example here’s a nice thread https://www.opengl.org/discussion_boards/showthread.php/178326-Uniform-Buffer-Object-Performance . I tried it some time ago by myself and I got no performance gain (and no penalty) with UBOs, even for verly large amounts of data..but since they tend to contain uniform data from different sources I find them more difficult to use, so the implementation ended up unsued. You know you can “cache” the unifrom location per shader program, right?
    If you wouldn not have been stuck to old OpenGL versions, I would have suggested taking a look at persistent mapped buffers, which I recently used successfully. Double or triple buffering could help you too maybe, because you can avoid mapping/unmapping a buffer the pipeline wants to use in the current frame. But my unprofessional advise for you is to use simple uniforms, because they are fast and easy.

    Scott
    December 14, 2015 7:53 am

    > To my knowledge this should be crazy fast. On some hardware (way down at the command stream level) this data is part of the command buffer and it updates constants just before the vertex and pixel shaders are invoked.

    Wait, that doesn’t sound right, are you using BufferSubData correctly? I can’t quite tell from your code samples.

    My understanding is that glBufferData creates an explicit memory buffer on the GPU at essentially a fixed address (though this is OpenGL, and nivida probably has code in their driver . glBufferSubData more or less translates to a (potentially delayed) memcpy of data into that address. Uniforms are not inlined into the command-stream, instead the command stream just updates the shader cores to point at a different memory address and caching hardware takes care of the rest.

    So this means if you glBufferSubData into the same address between every drawcall, the opengl driver is forced to execute the command then sync before copying the new data over the old data, which is really slow.

    What you are meant to do instead is create a buffer with glBufferData that is large enough to hold at least one frame worth of uniforms. Then use glBufferSubData with a **different offset** for each drawcall’s uniforms, so they end up next to each other in GPU memory. Treat it as a circular buffer.

    Then you point each drawcall to the correct offset within the uniform buffer and you end up with a command stream that just changes the shader cores to point at a different uniform buffer between each drawcall (instead of needing to sync with the driver)

    Scott
    December 14, 2015 8:03 am

    Actually on second thought, I’m probably confused too. I think I was thinking of glBufferStorage.

    Getting OpenGL to do what you want is a huge pain. I usually just stay away from code which deals with buffers and let someone else maintain that.

    Funtime Ben
    December 14, 2015 8:36 am

    If I hadn’t already bought Banished, I would buy it to support your efforts. That is a ton of sleuthing. People who play on Macs certainly appreciate it (although I recently built a PC for my games).

    Semi Essessi
    December 14, 2015 8:58 am

    I’m not that surprised by this… its shame that implementation details can make some drivers considerably better than others, but it happens especially with newer features and features that are off of ‘the hot path’.

    i’ve never used uniform buffers before because i didn’t see any benefit over what i already had. i would guess many commercial game engines have done similarly and don’t change their approach because there is no serious performance benefit to be had… its entirely possible this has just not been addressed because nobody has hit a problem with it. you should report it to the driver manufacturer imo…

    well done for pointing out that OpenGL is /not/ cross platform. i’ve had more discussions with people who have no idea what they are talking about regarding this than i would like.

    Andrew Haining
    December 14, 2015 9:31 am

    Everyone who works with apple platforms stumbles into this problem at some point. It’s not really fair to blame Open GL. Apple have known their implementation is terrible and do nothing to fix it. They treat Open GL like an internal library, implementing their own extensions and refusing to implement arb extensions etc. as well as breaking and refusing to fix core functionality like this. The rule of thumb with apple is to use the plainest, oldest version of open gl you can, don’t use any second order open gl calls (for example using glBufferData is faster than reusing a buffer with glBufferSubData or glMapBufferRange).

    Moving to vulkan will *hopefully* fix a lot of these problems one way or the other (either apple refuse to implement it and break compatability explicitly or they implement it and leave themselves much less wiggle room to break things)

    Andrew Haining
    December 14, 2015 9:49 am

    @Pete Wildsmith
    This is actually a good idea, you will suffer a performance penalty in a few places but likely orders of magnitude lower than a fence stall. Unfortunately it’s not exactly trivial to implement in cross platform production code and is only useful on apple platforms. As someone who works extensively on apple platforms it’s not worth the investment to implement, which means it isn’t for any other dev either, I’d expect adoption of this technique to be very low.

    evandrix
    December 14, 2015 10:49 am

    What’s the tool on OSX that you used to debug? the one in your screencaptures/screenshots

    JonnyH
    December 14, 2015 11:16 am

    I don’t know exactly the internals of the hw this is running on, but generally GPUs like batching stuff together as much as possible.

    Simply, draw_250_triangles is faster than draw_50_triangles 5 times. ‘draw 250 triangles, 50 with UBO A, 50 with UBO B’ etc. is also likely possible in this ‘fast path’.

    In addition, in lots of hardware there are DMA engines that run asynchronously to the 3d pipeline.

    Add in the fact that a buffer object often maps directly to a chunk of memory on the GPU – and by re-using the same chunk you’re possibly causing stalls in the pipeline as it has to wait for one batch of triangles to be completely drawn and flushed until it can then start the DMA copy for the new UBO data, then wait on that to start the next bunch of triangles.

    I suspect it’s likely that the ‘fast’ drivers are internally round-robin through a number of UBO allocations that all map to the same ID, avoiding this stall at the cost of slightly more memory usage (and possibly more copies if you don’t re-write the whole UBO every time). I suspect this optimization will benefit all remotely-modern hardware, but makes this highly dependent on how the driver decided to implement things.

    This is exactly why we’re moving to explicit APIs – it’s not going to be ‘easy’ to write apps, but at least you know exactly what is happening, and can solve it yourself.

    DP
    December 14, 2015 11:56 am

    It might help if you used a more recent version of OS X. From your screen shots it looks like you are running 10.5 or 10.6?

    Lars Viklund
    December 14, 2015 1:31 pm

    DP: That won’t help the users that are running on the lower end of supported OS versions, but sure, might mitigate it for personal use.

    sol_hsa
    December 14, 2015 2:55 pm

    My guess would be that they had driver that does thing A, then got spec B, and making B fast would require big changes, so they did a “bullet point” implementation instead.

    As far as I know, there’s no “performance conformance test” for opengl. Maybe there should be.

    anonymouse
    December 14, 2015 5:30 pm

    OpenGL on OS X is special: on Windows and Linux, the vendor driver gets to own the whole OpenGL stack, though post-Vista, the Windows OS does some of the buffer management itself. But whatever OpenGL library you get owns pretty much everything between the API entrypoint and writing the commands to the command buffer. On MacOS, though, there’s also an OS-supplied OpenGL library, which owns all the entry points and which helpfully does some of the work before calling into the vendor driver, which then sometimes has to do more work to undo some of the “helpfulness”. It might be this Apple OpenGL shim getting in the way and inserting waits to preserve its notion of correct semantics or something.

    Levi
    December 14, 2015 6:29 pm

    I’m just excited for an OS X version!

    Todd
    December 15, 2015 1:04 am

    It’s exactly for this reason that people are abandoning OpenGL and D3D 11 type APIs in favor of things like D3D 12, Metal, Vulkan, and Mantle. If you’re going to have to write driver specific code anyway, it may be worth looking into APIs that aren’t as hard for vendors to support.

    Also, super stoked for the OS X version. Couldn’t be more pleased.

    cappie
    December 15, 2015 4:17 am

    I love these blog posts… I could watch someone solve these puzzles for days.. quality writing :)

    Drew
    December 15, 2015 9:16 pm

    Fascinating af, thank u

    gpdev
    December 27, 2015 10:39 am

    “Down to the Metal: what Apple’s new graphics API means for Mac apps and games”

    http://www.t3.com/features/down-to-the-metal

    Krys
    January 1, 2016 1:43 pm

    I’m going to be one of the first in line to buy it when it’s up for Mac!

    Not Ian
    January 5, 2016 1:57 pm

    I bought it with the expectation that OSX would be sorted soonventually.

    Here’s hoping :)