Adventures in Optimization…

Shortly after my last post here, I added wandering nomads to the game. They occasionally come to the town, and the player can either allow or deny them citizenship. There might be consequences of turning them away too many times…. but that’s a feature for another day.

After adding the nomads I was in a graphics sort of mood, and so I started looking at adding some atmospheric effects to the game – mist, fog, rain, snow, and changing lighting conditions throughout the year. I was looking at the shader programs that make the game pretty, and started wondering about graphics performance – thinking I should look at optimizing things before adding more GPU overhead. Which is very dangerous for me, being a graphics programmer. I can tinker forever trying this and that to make the same scene display at a higher framerate.

So I took a long road down optimization lane, and with it came some serious coding and learning a new graphics API. This is probably too long and has too much information about graphics programming, so I understand if you feel like saying TL;DR, finish the game already. :)

Lets look at my test scenes. One is a view over a town in winter, and the other is the same town seen from high above.

Test Scene 1The first scene has 2409 objects submitted to the GPU, and consist of ~817,000 triangles.

Test Scene 2The second scene has 4617 objects submitted to the GPU, and consists of ~1,243,000 triangles.

From early on my graphics engine has been able to render the same object in multiple locations with a single draw call. This is called instancing. So instead of having to call Draw() 2409 times for the first scene, the engine can batch all similar trees together, or all the rocks together, or all the houses together and submit them in 411 calls to Draw(). This is good since each draw call requires some state change for the GPU, and it (usually) saves CPU time as well.

Up until last week, my engine used DirectX 9.0c and shader model 2. I initially chose this for the widest possible support of video cards and older computers. If you know anything about shader model 2, it doesn’t support instancing. So if you want it, you have to fake it. This is done using data repetition. Each mesh is repeated in memory some number of times, but each repetition has a different ‘id’ encoded with each vertex. This id is used to look up into a table of transforms. (A transform for those non-graphics people is just a location, orientation, and scale…) This is very much like hardware skinning – the deformation that takes place to bend a model around its skeletal structure. Due to limitations on the number of constants that can be set for shader model 2, the engine can draw up to 52 objects at once.

To draw multiple copies of an object the graphics code does something like:

device->SetVertexShaderConstantF(transformListIndex, transformData, 
	transformCount * 4);
device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, startVertex, 0, 
	numVertex * instanceCount, startIndex, primitiveCount * instanceCount);

The vertex shader then does something like this

uniform float4x4 tc_transforms[52];

struct input
	float4 position : POSITION;

float4x4 localToClip = tc_transforms[input.position.w];
float4 localPosition = mul(localToClip, float4(, 1.0));

This method works great, and is very fast. However the mesh vertex and index data are repeated 52 times! Think about the memory requirements for this in the scope of all the meshes in the game…. For dynamic vertices, like particle systems, drawing instaces is worse, since the engine has to manually copy the data N times, and set the vertex id value. memcpy() starts to show up on the profiler for heavy scenes.

To alleviate this memory requirement (which is getting huge, btw), I decided to try the instancing method available in shader model 3. In shader model 3, you can mark a vertex stream as instanced. This was pretty quick to add to the engine. I’ve got all the graphics code isolated to a few files, so in a few hours I had this new method working. Instead of shoving transforms into vertex constants, they instead get copied into a vertex buffer. At draw time, you do something like:

device->SetStreamSourceFreq(1, D3DSTREAMSOURCE_INSTANCEDATA | 1);
device->SetStreamSourceFreq(0, D3DSTREAMSOURCE_INDEXEDDATA | instanceCount);
device->DrawIndexedPrimitive(D3DPT_TRIANGLELIST, startVertex, 0, 
	numVertex * instanceCount, startIndex, primitiveCount);

You then change the vertex declartion to include the ‘instance’ inputs, and the shader then loads the transform like so:

struct input
	float4 position : POSITION;
	float4 w0 : TEXCOORD0;	// transform row 1
	float4 w1 : TEXCOORD1;	// transform row 2
	float4 w2 : TEXCOORD3;	// transform row 3
	float4 w3 : TEXCOORD3;	// transform row 4
float4x4 localToClip = transpose(float4x4(input.w0, input.w1, input.w2, input.w3));
float4 localPosition = mul(localToClip, float4(, 1.0));

After changing all the vertex layouts and vertex programs to do this, I took some frame rate results. The scene was tested at 1600×900, 2xMSAA, 2x anisotropic filtering, 2048 shadow map, and a 5 tap PCF kernel for sampling the shadow map. The test machine is an i7 920 @ 2.67Gh with a NVIDIA Geforce GTX 280.

Shader Model 2 Shader Model 3
Scene 1 103 FPS 85 FPS
Scene 2 63 FPS 55 FPS

What!?! Why is the shader model 3 method 10-20% slower? This sort of thing frustrates me. I make a change that’s supposed to be either faster or the same speed, and it’s actually slower. After making sure I wasn’t doing something bad when filling the transform buffer, I pulled out a GPU profiler to see where the difference was.

You’ve probably heard talk about games being either vertex bound or pixel bound. This just means that the GPU is spending more time working on the vertices of triangles, or more time on filling pixels on screen. The truth is that most games are both pixel and vertex bound at different times. If a game uses shadow maps, it’s probably vertex bound while rendering the shadow map. Modern GPU’s are crazy fast for filling pixels that are used for shadow maps. So fast in fact that the GPU spends most it’s time loading vertex data and running vertex shaders and that isn’t fast enough generate enough work for the pixel pipline, even with load balancing.

My game is bound like this. It’s also vertex bound on characters and animals – they’re small on screen, but the GPU is spending more time deforming the meshes than shading pixels. This probably means I need lower poly meshes and LOD’s but that’s a another task for another day.

However for other objects, when the game is rendering buildings and terrain, more time is spent on pixels than on vertices.

What’s happening to make the shader model 3 version slower is that the areas that are vertex bound become more vertex bound because of the load of additional data per vertex. When drawing an object for a shadow, the shader model 2 version only has to load 8 bytes of data. The shader model 3 version has to load 72 bytes. This is a lot more memory traffic, and the only explanation I have for the slow-down.

While the shader model 3 version is slower, the memory consumption of the application drops from 270Meg to 172Meg. That’s nearly 100 megabytes of repeated mesh data!

At this point, I have an idea to have both low memory usage and faster render time, but I know it’s going to take a few days to implement. Really, the game runs fine, I should focus on gameplay features, and I don’t need to write a DirectX 11 render path but….

I don’t think DX10/11 is very well accepted, or used widely as of yet. If I have a question about something in DX9, I can google for it and have an answer in seconds. The same can’t be said of DX11. There are very few examples. And while the documentation is there, there are a lot of hidden issues that you only find by reading the debug output from DX11 once you start using it. I’d hate to be someone who never used DX9 and jumped right into DX11.

I had a very slow start getting DX11 up and running. Apparently automatic updates installed Internet Explorer 10 on my development machine, which breaks the DX11 debug output. It also breaks PIX – a tool that lets you capture a frame of a game and examine all the GPU calls and state. I use it all the time to take the guess work out of rendering errors. It’s like a debugger for graphics.

The fix for this is to use Windows SDK 8.0. At the same time I figured I’d update to Visual Studio 2012. Once compiling with the new SDK, I discover that XAudio 2.8, which is all that ships with SDK 8.0, isn’t available for Windows 7. So I hack things up to use the old SDK to get XAudio 2.7 while still using the new DirectX 11.1. This all finally works, but PIX is still broken.

Finally I just uninstall IE10 and related updates since I don’t use it anyway. Now PIX works, and the DX11 Debug layer works. And Back to Visual Studio 2010. On with coding….

The DirectX graphics interface for my engine is only about 60K of code, so writing the same small bit for DX11 was pretty quick. I spent more time writing a shader generation system so that I didn’t have to write different vertex and pixel programs for shader model 2/3 vs 4. Texture sampling and shader inputs and outputs are significantly different between the different shader models. I also spent a fair amount of time debugging and make sure I wasn’t doing anything to cause the GPU to stall.

In shader model 4, there is this great input called SV_INSTANCEID. It gives you the index of the instance that the GPU is working on. This is exactly what my initial implementation did, but as I don’t have to supply the index myself there is no need for data repetition.

The draw call becomes

context->DrawIndexedInstanced(primitiveCount, instanceCount, 
	startIndex, startVertex, 0);

The shader looks like:

cbuffer tc
	float4x4 tc_transforms[128];
struct input
	float4 position : POSITION;
	uint instanceid : SV_INSTANCEID;

float4x4 localToClip = tc_transforms[input.instanceid];
float4 localPosition = mul(localToClip, float4(, 1.0));

This is fantastic. DX11 also uses the least memory while running the game. 94Meg for the test scene.

While writing the DX11 implementation and reworking the vertex and pixel programs for shader model 4, I found a bunch of items such as unneeded input assembler loads, floating point exceptions, and render state that was making things slower needlessly. Because of those fixes, my shader model 2 implementation runs at 130FPS instead of the original 103FPS. Also fantastic.

Here’s my final resulting frame rates for my two systems. All results are GPU limited. CPU time is under 4ms in all cases.

The first scene has 2409 objects, and has ~817,000 triangles.
The second scene has 4617 objects, and has ~1,243,000 triangles.

Test System 1
i7 920 @ 2.67Ghz, NVIDIA Geforce GTX 280
1600×900, 2xMSAA, 2X Ansiotropic, 2048 Shadow Map, 5 Tap PCF shadow kernel

Shader Model 2 Shader Model 3 Shader Model 4
Scene 1 130 FPS (7.7ms) 100 FPS (10.0ms) 118 FPS (8.5ms)
Scene 2 77 FPS (13ms) 61 FPS (16ms) 71 FPS (14ms)

Test System 2
i5 M480 @ 2.67 Ghz, NVIDIA Geforce 610M
1280×720, No MSAA, Trilinear, 1024 Shadow Map, 5 Tap PCF shadow kernel

Shader Model 2 Shader Model 3 Shader Model 4
Scene 1 33 FPS (30.3ms) 26 FPS (38.5ms) 48 FPS (20.8ms)
Scene 2 21 FPS (47.6ms) 17 FPS (58.8ms) 28 FPS (35.7ms))

What does this tell me? It tells me I still possibly have something wanky in my DX11 implementation since it’s still 0.8ms slower than the DX9 shader model 2 version. However on the laptop GPU, the results are phenomenal. A decrease of nearly 10ms in the first test scene, and 12ms in the second scene is pretty amazing just for an API change.

It also tells me that I probably won’t ship the shader model 3 version. While is does use less memory, I’d prefer a better gameplay experience for those with older systems and video cards, and I can tweak the memory used for each model. Trees and rocks can have the full 52 copies of the mesh data, but buildings, and other things that will never reach 52 on screen at once can have only 2-5 copies. This will bring down memory consumption to reasonable levels, although it does require a per-asset tweak.

The DX11 memory usage is really good, and I could probably get the DX9 version down even further by not using the D3DPOOL_MANAGED flag on resources, but then alt-tabbing away and back to the application becomes annoying since I have to manually load all graphics resources from disk again. I’d much rather have the switch be immediate.

Was this week and a half of trying different instancing methods worth it? For sure. The original implementation now runs 2ms faster (103 to 130 FPS), and those with DX10 level video cards will get a performance boost on some systems. While writing the DX11 code, I treated it as a different platform. This makes me more confident about porting the game to other systems (like ones that use OpenGL), as the functionality is now there for making ports and dealing with different data per platform.

Now back to that mist, fog, rain, snow, and changing lighting conditions….


    May 29, 2013 9:24 am

    VERY NICE !!! More updtate ! I like it !

    May 29, 2013 9:42 am

    Godspeed good sir! Show DirectX who’s boss =)

    May 29, 2013 9:45 am

    Hey! Newbie here, great work, when will it be released?

    May 29, 2013 9:46 am

    As technical as it was, I really enjoyed this update. It helped me to understand some things going on here at work. We’re building a 3D simulation for a client. Keep up the good work!

    May 29, 2013 9:47 am

    Loving it, can’t wait to get my hands on this. These technical updates are awesome.

    May 29, 2013 9:51 am

    Wow. I’m still getting my feet wet when it comes to programming and game design, and believe it or not this update actually taught me a few things!

    The game is looking great, and I can’t wait to play it.

    May 29, 2013 9:53 am

    Wonderful! I love reading your devdiaries. While I have no programming knowledge besides Matlab, this makes for a wonderful read because you are describing things so accurately while not boring us non programmer folk with details.

    Well done! I admit I was hoping for some rain, fog and other shots.

    May 29, 2013 10:00 am

    It’s really great that you’re working so hard on optimization. There are so many games out there that lack the finesse and could be written much better.

    I have been following this game for a while and just keep looking forward to it more and more.

    May 29, 2013 10:16 am

    SOOO looking forward to this.

    You sir are truly amazing!

    May 29, 2013 10:28 am

    Awesome dude i want this game so bad…

    May 29, 2013 10:32 am

    Your attention to detail is what sets you apart from other developers or coders! I love that you want the best experience for all of your gamers.

    With updates like this I don’t want you to rush the game, I know it will be the best when it is released. Keep up the great work.

    May 29, 2013 10:53 am

    Whoa holy shit, I cannot wait for this is be shipped! shutupadntakemymoney.jpeg

    May 29, 2013 11:12 am

    This was a great read!

    May 29, 2013 11:14 am

    Take your time SIR! We don’t want a rushed game like S*****y.

    May 29, 2013 11:15 am

    When you were using IE10 on Win7, and PIX stopped working, did you happen to try VS 2012 PIX instead? Was wondering if it worked for you or not. (I have a vested interest in the VS PIX, I work on it)

    May 29, 2013 12:01 pm

    @Rich: I haven’t tried the VS2012 PIX. I had only downloaded VS2012 Express to get the Windows 8 SDK installed and try out the new compiler. My understanding is that VS2012 Express doesn’t include the graphics debugger. I’ll most likely purchase VS2012 Pro at some point and try it out. I was more frustrated that what previously was working was broken after a seemingly unrelated piece of software was installed. (It was actually KB2670838. Uninstalling it caused the removal of IE10 as well…)

    May 29, 2013 12:12 pm

    Properly good post, thanks a lot for taking the time to write it!

    May 29, 2013 12:13 pm

    I’d love any feedback you might have if you do decide to get it (although I’d wait until the build conference in June before buying anything).

    Please keep up the blog. Love reading your adventures. Especially this type of post. This one was excellent.

    May 29, 2013 12:24 pm

    I can’t wait to throw my money at you.

    May 29, 2013 1:56 pm

    Ummm, I read this whole post. So interesting seeing a small piece of how a game is really made. Fascinating: Mind blown. I am not going to try and pretend like I know what ANY of that meant though as it went WAY over my head. Keep up the good work!

    May 29, 2013 2:23 pm

    to read your update is as good as playing any other game. since sim city 5 ler me down you are my saver for 2013. keep up the good work and when you done i will buy the game

    May 29, 2013 4:19 pm

    I really like reading about stuff like this – I know next to nothing about graphics programming and it is nice to have this information simply stated like you have here. Thanks!

    May 29, 2013 6:09 pm

    Awesome update!
    What sort of data visualisation do you plan on implementing? Something like the new simcity has would be awesome. Playing tropico can be frustrating because although it’ll say there is surplus food, it may all be on one side of the island. Figuring that out with just graphs and numbers can be difficult.
    Just food for thought, I look forward to more updates. Keep up the good work! :)

    Andy D
    May 29, 2013 7:09 pm

    Love the ideas!

    Don Scopel
    May 29, 2013 8:58 pm

    Man, great job!
    I’m a frustrated game developer, write a post for who wants start develop a game, please!

    May 29, 2013 11:42 pm

    Will you have random flybys with birds? Both flocks and individually? Thanks!

    May 30, 2013 9:29 am

    I am impressed by the way you write these dev posts. You convey the important bits of very complicated things in an easily undrestandable fashion.

    May 30, 2013 5:15 pm

    Keep up the good work!

    May 30, 2013 11:28 pm

    A lot of folks I’ve talked to, have gone directly from DX9 to OpenGL… I can’t speak to the differences, but it just seems to be a trend.

    Anyhow, very interesting read! 😀

    basket nike air max
    May 31, 2013 1:47 am

    Pretty nice post. I just stumbled upon your blog and wanted to say that I have truly enjoyed surfing around your blog posts. In any case I’ll be subscribing to your rss feed and I hope you write again very soon!

    May 31, 2013 2:25 am

    I loved reading this :3

    I might not have understood every detail, but the way you wrote it I got a good feeling for what it meant. please do more!

    cyfrowa telewizja naziemna
    June 1, 2013 2:20 am

    Hi there, just became alert to your blog through Google, and found that it’s truly informative. I’m gonna watch out for brussels. I will appreciate if you continue this in future. A lot of people will be benefited from your writing. Cheers!

    June 1, 2013 11:59 pm

    Wow, I think I was the only one who read through everything but the coding. I love your project and hope you figure out a price for the game soon. I can’t wait!

    Tom Duhamel
    June 2, 2013 11:33 am

    Congratulations, you just found out why DX11 hasn’t really caught on yet. There was no problem with your code, your results are consistent with what was published by other developers before.

    DX9.0c have been around since Win98 (as an update) and included in the very first release of WinXP. It showed it’s maturity and have always worked fine. To this day, this is still being used in commercial games.

    Forget about DX10, it have been abandoned by MS very early, before it was even completed. No idea what went wrong, but they apparently didn’t like it.

    DX11 adds some more modern graphic features, in order to make scenes look better, but this comes at the expense of slower processing. To my eyes though, these improvements are too subtle to be worthwhile.

    On a side note, one important reason that game developers probably consider, is the fact that DX11 isn’t available for XP, and as you probably know a lot of hardcore gamers have stuck to XP, so requiring DX11 in a game would leave out a significant fraction of their market.

    Shader v3 has been out and in common use for quite a while though, I’m surprised by your results. Never checked any benchmark before, so I’m not sure if your results are normal or not.

    Your test machine #2, a laptop I reckon, is quite outdated. Despite the fact, it shows really good results. So at this point I’d say it’s ok to try and test small optimizations, but it’s just not worth spending weeks on that.

    I’m thrilled that you are confident to be able to easily port your game to another API, it means that it might be quite easy for you to port your game to the Mac, and possibly more. By letting out a few details and quality, maybe even tablets. And who knows, maybe my grand-mother’s toaster?

    Please, pretty please, do not release this game before the end of the summer. It would make me waste beautiful sunny days on my computer.

    June 4, 2013 12:49 pm

    @ Tom Duhamel:
    “… as you probably know a lot of hardcore gamers have stuck to XP, so requiring DX11 in a game would leave out a significant fraction of their market”

    That was perhaps the case in the Vista days since Vista really underperformed compared to XP, but starting with Windows 7 most gamers moved on (thank god).
    Look at for hardware/software stats of all Steam gamers. This gives a very accurate description of the current PC gaming community.

    What gives:
    Only 7.65% (32 bits) + 0.38% (64 bits) still use Windows XP.
    Most have moved on to Windows 7 53.47% (64 bits), 13.38% (32 bits) or Windows 8 11.83% (64 bits).

    Tom Duhamel
    June 4, 2013 3:42 pm


    Thanks for your input. While I do agree with your opinion, I do believe that 8% is still a significant number. Would you mind if I were to cut your revenue by 8%? If I were a game developer (I am not) I would certainly take that into account when taking decisions.

    Also, for more casual games like this one, I am on the impression that the stats toward older machines would probably be higher than this. While hardcore gamers will want the latest and the greatest for the big action hits, I think people looking in more modest games such as this one are probably less willing to upgrade. (I think I should have leave out the word “hardcore” in the bit you quoted, I suppose that was probably the bit I was wrong.)

    Beside, XP was and still is a good OS. I am myself on 7 and wouldn’t go back, though I am still using XP at work and cannot complain. MS dropped support for XP, but so far we haven’t reached a point where it is difficult to get drivers or anything for XP. For many people, there just isn’t a reason to upgrade, yet.

    In any case, this is just my opinion, and should only be taken as it. Until I figure out how to become millionaire, my opinion has to weight at all lol.

    June 4, 2013 7:05 pm

    Hello, mind if I ask why not OpenGL from start? You could spare yourself DX playing and porting to OpenGL if you did it OpenGL from start?

    June 4, 2013 7:47 pm

    @Grzegorz: I knew DX9 better when I started my engine so I used it. I can pretty much use the DX9 API without looking at the documentation – and my initial target platform was Windows PCs so it made sense. To port my graphics engine to another platform only takes 60K of code – not a big deal to implement other APIs when the time comes.

    June 5, 2013 5:54 am

    I’m not game developer. but I heard OpenGL (even tho needs more code) is way faster than DX. Will windows users get OpenGL possibility too? :)
    Anyway, really great concept 😉 really looking forward release!