Adventures in Debugging

I’ve just fixed one of the worst bugs that has been plaguing Banished. It’s potentially the cause of nearly all my pending crash to desktop bugs. It took me so long to track down that I thought I’d share the debugging process. Normally fixing a bug is an hour task. Reproduce the issue, find a solution, implement the fix, and then test the fix. Other times debugging is an all day and all night process, or worse.

The most important thing is to be able to reproduce the issue. If a developer can’t make a bug happen reliably, there is little chance it can be fixed. And finding reproduction steps can be hard.

So more than a few months ago I’d noticed an issue when I’d build test cities by cheating. I’d pause the passage of time in game, turn on debug mode so I could place buildings for free, and place a whole lot of buildings – basically lay out an entire town. This let me test features and the simulation of large towns, without spending hours to make it properly. Once in a while, when the passage of time was resumed, the game would immediately crash.

Large towns run much better when the game isn’t built for debugging, so I’d often be in release mode for a high framerate, and be unable to properly use the debugger to see why the game crashed. Even worse, the autosave feature was properly saving towns. If the crash occurred and the save was reloaded, the game could be unpaused without issue!

Because the save game was intact, I usually lost less than a minute of work due to the autosave process. And since the bug was so infrequent and it occured while I was cheating, I half assumed I had something wrong with the code that let me build buildings for free. So I ignored the issue for a while.

But lately I’ve had testers having this exact situation occur, but without cheating. They lay out where buildings should go with the passage of time paused, and on resume it crashes. Testers have also been having other random crashes that I can’t quite figure out, and no one can reproduce them or give any hint as to what they were doing when the game crashed. Save games always seem to be intact and reload without issue or crashing… Hmm…

About a month ago I added functionality to output core dumps and call stacks from crashes and asserts in game so I’d at least have an idea of where in the code the game was crashing for testers. A core dump gives me the state of the program when the crash or assert occurs, and a call stack is just the list of function calls that were made up to the current point of execution.

While the core dump isn’t as useful as live debugging, it gives me ideas about where to start looking and how I might reproduce the bug. A save game, core dump, and call stack usually make fixing bugs easy. But not today.

Looking at a single core dump didn’t give me much idea about what was going on. I knew memory was either being accessed incorrectly, or it was being overwritten, but that was it. But looking at over 20 dumps with similar call stacks got me to thinking. While the call stacks all had crashes in different locations, they all came from a root function called ComponentEntity::Update(). A small percentage of them even occurred directly in that function.

This is the lower part of the call stack of the crashes:

... lots of junk here....
void Game::ComponentEntity::Update(System::Time* dT)  <--- top of stack
void Game::EntityList::Update(System::Time* dT)
bool Game::Game::Update(System::Time* dT)
bool Engine::MainLoop::Update(System::Time* dT)
bool Engine::MainLoop::Frame(bool disableUpdate)
void Engine::MainLoop::Run(Engine::UpdateLoop* _firstLoop)
int RunGame(void* systemData, char* commandLine)
int WinMain(HINSTANCE__* instance, HINSTANCE__* __formal, char* commandLine, int __formal)

Again, this didn't give me much - a ComponentEntity is a generic object in the game. It could be a tree, a person, a deer, a pile of rocks, or a building. About the only thing in the game not using ComponentEntity is the terrain. A large map can have over 30,000 ComponentEntity objects, so it really could be anything making the game crash.

I started trying to reproduce this issue manually. I was successful only once in maybe 75 tries and only in a non-debug build. Not being able to reproduce the issue in debug mode was a hint too. It told me that uninitialized or overwritten memory was being used. In debug, the engines memory allocator clears memory to known patterns on allocation and deletion. This practice is good in many ways, but can hide other issues. I started thinking this was one of those issues.

Having reproduced the crash once is an annoying thing - I knew I could do it, but I didn't know how I had done it that one time. And when testers are reporting crashes daily, it's got to be fairly common. What was I missing? I tried over and over to get it to happen again, but I always failed.

I gave up trying to reproduce the bug, and instead gave over to trying to look for ways that an invalid entity could be in use. I was basically looking for logic errors in the code regarding the handling of objects memory and pointers to them. This is where despair sets in with the knowledge that I'm probably not going to be able to find this issue. As the saying goes, it's like looking for a needle in a haystack.

But because this crash seemed to deal with invalid memory, I was pretty sure the bug could only happen if the player placed an object and then removed it with the game paused. On a whim, I placed a breakpoint in the function that destroys a ComponentEntity. (A breakpoint just stops execution of the game at some particular line of code.)

I assumed I'd go through the objects I removed when the game was unpaused and see if I could see any pitfalls that might be causing the issue. When a player removes an object in game, say a building, it's scheduled for removal at a later time. All deletes are deferred until the end of an game update, and with time paused, final ComponentEntity removal and deletion doesn't actually happen.

To my surprise, the breakpoint was hit with the time paused! The callstack now looked this.

void Game::EntityList::DestroyEntity(Object::InternalPtr entity)
void Game::EntityList::Persist(IO::Stream* stream)
void Game::Game::Persist(IO::Stream* stream)
void Game::Game::SaveRestore(bool save, Save::SavedState* gameState)
void Save::SaveLoad::UpdateSave()
void Save::SaveLoad::Update(System::Time* dt)
bool Game::Game::Update(System::Time* dT)
bool Engine::MainLoop::Update(System::Time* dT)
bool Engine::MainLoop::Frame(bool disableUpdate)
void Engine::MainLoop::Run(Engine::UpdateLoop* _firstLoop)
int RunGame(void* systemData, char* commandLine)
int WinMain(HINSTANCE__* instance, HINSTANCE__* __formal, char* commandLine, int __formal)

The important thing to notice here is the mysterious execution path through

Save::SaveLoad::UpdateSave()        <--- this saves the game!

My thought process went like something this. "What's this? How I'm deleting objects while saving?!?!? Wait, wait, why am I saving the game? Oooohhh. Autosave went off. Why in the world is saving the game calling DestroyEntity! It's supposed to save! Oh yeah, it's an object scheduled for deletion. No need to save it. How long ago did I write that code? I wonder if that's part of the issue..."

And then a light bulb went on in my head, and it led me down the right path.

Not only do objects get scheduled for deletion, they are scheduled for addition to a list of objects needing updates. When the game is paused, if a player creates and then removes a building, it's scheduled for both addition and removal.

In the normal path of removing an object, if something is both being added and removed in the same update it's fine. The code handles this gracefully. But in the case of a save game, it directly deletes the object and it remains scheduled for addition!! The object would be deleted, but a reference to it would remain in the update list. (And this is why lots of programmers like garbage collection...)

Now I had proper reproduction steps. Pause time, place a building, delete the building, wait for autosave or save manually, place another building that will be allocated in the same memory as the previous building, then unpause time. Observe the crash. I could see the code path that would cause the issue and groaned audibly when I realized my mistake.

It was here I realized that when I was trying to reproduce the bug, I couldn't do it because I was generally moving to fast and autosave never occurred.

This is certainly one of the most convoluted bug reproduction steps I've seen in a while. But this is also a very common operation in game. As a player adds a new section of town, they pause the game so they don't have to manage the town while deciding on layout. They might mis-click, or decide on a better layout, remove some things they just placed, and place new buildings instead.

Knowing all this, I still couldn't reproduce the crash reliably. Using invalid memory does strange things. When the scheduled addition was made to the list of entities, if the memory pointed to by the now invalid pointer at a certain offset had a certain bit set, that object wouldn't be added to the list. However if it wasn't set the now deleted and invalid entity was added to the list, and the game would crash at some unknown time in the future, depending on the contents of that memory.

It always amazes me when really long debugging sessions result in a single line fix to the code. One extra function call properly cleaned up the list of scheduled additions when the save game came along to do it's thing, and a ton of random crash bugs just went away.

Phew. It's nap time.

45 Comments

    RMuldoun
    November 15, 2013 9:35 am

    It’s almost the end of the year friend :(

    Sean
    November 15, 2013 9:44 am

    Thank you for sharing this process. I found it to be a fascinating read. I can’t wait for this to release, keep up the good work!

    Nick
    November 15, 2013 9:47 am

    Thanks for bug fixing. So many games are plagued with bugs.

    daomingsi
    November 15, 2013 9:48 am

    Keep up the good work! Hope to hear more soon :)

    Dan
    November 15, 2013 9:50 am

    Seriously. With the force of a thousand suns. My money. Your pocket.

    Kbolt
    November 15, 2013 9:52 am

    Keep it up. This is possibly my most anticipated game of the year (or next year if needs be). How I’ve longed for an Age Of Empires alternative that isn’t so combat heavy. My eternal appreciation to you for your effort.

    Really curious to see where you go after this project as I have a strong feeling this will do very well at the box office. Care to share your thoughts on this in an update?

    Alin
    November 15, 2013 9:52 am

    I must assume that this was the big bug that postponed the game launch, so here is me hoping to not have to wait more, before being able to throw my money at you good sir !

    llnk
    November 15, 2013 9:55 am

    Why in the world do you use manual memory allocation over RAII protocol ?

    http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization

    shared_ptr
    auto_ptr

    :)

    ПутуИщч
    November 15, 2013 9:59 am

    Thanks you for yet another great read. Glad you were able to fix the bug. Keep up the good work!

    Tortex
    November 15, 2013 9:59 am

    Quick idea. Everyone that tests the game also records it. Realistically, not everyone might be able, their PC might not be up to the task of doing both things. But doing this could help with seeing what actions the player takes when a bug occurs.

    Another idea, if you do record testing runs, don’t use Fraps, it’s the worst thing. OBS is free and quite good at recording/streaming.

    Tony
    November 15, 2013 10:01 am

    After you release the game, please keep these blog posts coming. It’s very interesting to read your thought process and learn about the inner-workings of the game.

    Jacob
    November 15, 2013 10:02 am

    Thanks for sharing this with us. This is the kind of blog post i really like from indie developers. It gives such an insight into the development that you simply can’t get other places. If i wasn’t already sure i would be your game, i would consider it just based on the quality of your blogs. Keep up the good work sir.

    Jon
    November 15, 2013 10:04 am

    This is probably really annoying but also a huge victory moment! I must say I love reading this. Take a huge nap, this is a good feeling knowing you’ve finally fixed it.

    Mat
    November 15, 2013 10:06 am

    Thanks for the good reading you offer to us.

    Keep going your good work !

    Jacob
    November 15, 2013 10:06 am

    Edit: Dammit, i dont want to ‘be’ your game, i want to ‘buy’ it..

    My first comment on your site and i make a typo – I will stand in this corner with my cone of shame for a while now.

    dada
    November 15, 2013 10:08 am

    no sleep ’till release!
    bread and water!

    Drew
    November 15, 2013 10:14 am

    It’s reasons like this that I want you to take your time with the game. If you had a deadline and you released the game with this unknown crash in place a lot of people would be put off with the ‘final’ product. I want to plan Banished the way that YOU intend. If that means that I have to wait till March or April I am fine with that. Keep up the amazing work my fiend. I can’t wait to play your game when you finish all the polish :D

    Nero
    November 15, 2013 10:15 am

    I am glad you found your Bug.

    For a couple of days now I’m also searching a Bug in my Program (Memory Access Violation).

    I also can’t reproduce this error, it occures randomly with the same Dataset (Radar-Data from BMW Vehicle Trace Files)(I’m currently working on my Master Thesis at BMW Research and Development at Munich).

    It is very frustrating. But hearing you finding your Bug gives me hope^^.

    After fixing this, the second part of my Thesis is to write a Visualization for Android with OpenGL ES, I’m already looking forward for this becaus then I have Garbage Collection ^^.

    Mathew
    November 15, 2013 10:28 am

    A very interesting read. Debugging is always a headache, which makes that Aha! moment all the sweeter.

    Ie never done anything near as complex as you, but can sympathize (And now celebrate) with you

    Templar_X
    November 15, 2013 10:29 am

    A painful tale with a happy ending :)
    Best of Luck

    Daxamar
    November 15, 2013 11:07 am

    Good work. Love reading these updates.

    Guess we are one step further to release? ;)

    Sarayakat
    November 15, 2013 11:11 am

    I always learn something interesting from your Devlog entries, very cool.

    There are only 2 games on my list to buy for the rest of 2013 and Banished is one of them.

    Keep up the good work! :)

    SomeOtherGeek
    November 15, 2013 12:00 pm

    Priceless!

    I remember days of debug hell too. My usual ratio is 1 hour of programming takes 4 hours to debug. It seems unrealistic, but over time, it usually adds up.

    I remember the pops/pushes with Assembly that brute force worked even when you did bot want them to. But the rewards of finding them are amazing.

    Keep up the good work.

    Mark
    November 15, 2013 12:03 pm

    I love these devlogs. They give us great insight into not only what you are battling with at present but also how you’ve done.

    Hope you don’t mind but I’ve copied the Banished image from the front page and used it to create a Google+ community. (If this is a problem let me know and I’ll pull it immediately.)

    The link here is for anyone using G+
    http://goo.gl/zf4ROu

    Ben Doerr
    November 15, 2013 12:10 pm

    While, I am not a game developer I am a software developer and feel your pain as I read this.

    Kang
    November 15, 2013 12:31 pm

    Good thing you squashed it! I haven’t programmed seriously for a long while, but I remember debug hell! It’s always the littlest things that causes the biggest bugs.

    I love reading your blog posts. It’s so rare to find anyone ever talk about the nuts and bolts of crafting games.

    Michael
    November 15, 2013 12:36 pm

    I love how you described the debugging process. Some people don’t realize that you can spend hours/days on trying to solve one bug, and the end result is changing one line in code.

    Gix
    November 15, 2013 1:41 pm

    I love updates.

    Eliza
    November 15, 2013 1:46 pm

    As a (former) professional QA lead– now a professional developer– that sounds like a nightmare. Good job grabbing that horrible thing!

    Chuck
    November 15, 2013 2:02 pm

    I have to say reading these updates is something I look forward to very much. You’re writing is excellent and you explain these complicated matters (and the detective work that goes into finding them) in a very easy to understand straight forward fashion.

    You really should write a book about all this when you are done. It would be of great help to young programmers that are often confused about so many of these things like memory allocation and debugging!

    Keep up the amazing work!

    Rod
    November 15, 2013 2:43 pm

    I know what it’s like to finally track down an infuriatingly tricky bug. Such a satisfying feeling after all those hours or days of frustration. Really good news and hopefully this puts you on track to launch very early in December and capture a piece of that Christmas market.

    Lukio
    November 15, 2013 3:05 pm

    Even as a non programming type, sharing these kind of tid-bits really makes this interesting, I guess for some people these dev logs are like a “lessons learned” collection.
    MY SO and me are really looking forward to this little gem of yours, already *wishlisted* and ready to go!

    Canth
    November 15, 2013 4:12 pm

    This reminds me of the problem MWO had with one of the bugs in CryEngine they found. It took them freaking FOREVER to fix it, but what they described was one of the most convoluted bugs I’d ever heard of. Here’s the link:

    [url]http://mwomercs.com/forums/topic/117769-hud-bug-brief/[/url]

    Katankerus
    November 15, 2013 8:15 pm

    Your style of technical portrayal is very palatable for me, love reading your updates.

    ScorpyX
    November 15, 2013 8:46 pm

    Сongratulations

    Poolboy
    November 16, 2013 6:01 am

    The game is still on course for this year, right? Please say yes.

    mrbisonm
    November 16, 2013 8:19 am

    technical, technical…..issues I have no idea of, but I am glad you found a solid bug. It surely must feel good. ;) Btw, don’t bother with the “impatient” commenters, we all can wait for YOU to be ready.

    mrb

    SpaceChimp
    November 16, 2013 9:14 am

    I really enjoy these posts, they’re very interesting, thank you for taking the time to make them :) And the game is looking awesome, keep it up, and release it whenever YOU feel it’s done not when you feel like you have to (seems to be what you’re doing so it’s awesome :) )

    nelsonhaha
    November 16, 2013 10:04 am

    shut up and take my money

    Amir
    November 16, 2013 10:54 am

    i don’t know any thing about programming and stuff like thatm but when i read you devlog i feel progress and it makes me want this game even more. love your peace of work and i’m sure it will be great.

    Ryan
    November 16, 2013 11:46 am

    Thank you for this post. Really well-written, interesting story. Keep up the good work. Looking forward to release, late 2013 or beyond. I would much rather have the product you are happy with over something pushed out to meet an artificial deadline.

    Gerbrand
    November 16, 2013 11:59 am

    I really want this game 0.o I have basically been checking this website for the past month ;)

    James
    November 16, 2013 2:08 pm

    This looks great. Can’t wait. Pretty interesting to read through this process.

    Jonas
    November 16, 2013 11:09 pm

    send it to totalbiscuit :D

    Navoan
    November 18, 2013 4:01 am

    Cheers for another great DevLog, really nice incite into debugging!