Solving the Stupid Bug

A few weeks back I wrote about a very unusual bug that returned 500 errors for routes that should have been valid. Recently my team and I, to our great relief, discovered the cause of this bug. Of course, that wasn't before we ended up yelling "WTF?!" a great many times.

A set of letter cubes from the game Boggle!, spelling out

A great many times.

A New Failing Route

Most of the our collective WTFs occurred when we discovered the original problem. As I wrote about in the earlier post, we have an ASP.NET Web API application in which the routes /auth and /photos had been returning 500 errors whenever we tried to hit them. The odd thing was, the 500 errors always got returned immediately, even when there should have been a compilation delay. (Go read the whole post if you want the complete backstory).

We implemented a temporary fix by changing the routes for /auth and /photos to /authentication and /pictures, but that obviously is not a good long-term solution for something that should never have happened in the first place. (Not that I have a great track record with things that should have never happened in the first place, but hey, I'm an optimist.)

And then my coworker Diya noticed something that stunned us: now a new route was failing. The route /phones was also giving us the dreaded no-compilation-delay 500 errors.

Diya spent a day proving that /phones had the same issue that the other routes did, just like Nate and I had done a few weeks ago. And, sure enough, she came to the same conclusion: the route never responded for any request made to it.

But this didn't make sense to me. How could a route, which had been previously working, suddenly starting returning 500 errors? WTF?

Determining the Root Cause

I'd had just about enough of this. If our API's routes were going to randomly start failing for no damn reason, I needed to know why. I pestered our server team (which, admittedly, usually has bigger fish to fry than tracking down some fiddly route problem on a dev server) until they took a look at the server which hosted our API.

When they did finally peruse through the server, the told me that several other web applications existed there. They also sent me this screenshot:

A screenshot of the IIS application our API was running in.

In case this wasn't clear enough, let me make the exact problem a bit more obvious:

The same screenshot from IIS, with the offending three apps highlighted.

The server which hosted our API had empty web apps on it that corresponded exactly to the failing routes! So, whenever we tried to hit /auth or /phones or /photos, IIS would redirect that request to the corresponding app, and because that app wouldn't actually work, we'd get back 500 errors.

Hooray! We'd found the bug, and now we could finally kill it! Except...

Here's where the real fun begins. From our perspective, the route /phones (the route that Diya was working on) started failing after /auth and /photos (the routes that Nate and I discovered didn't work a few weeks ago). That, to me, suggests that the Phones application was created at a later time. But the server team swears up and down that all three apps were created at the same time.

How Did This Happen?

So now there are three possibilities. Either:

  1. The Phones web app was in fact created later, even though there's no record of that happening OR
  2. Something in our Continuous Integration/Continuous Deployment process is automatically creating these apps (which is extraordinarily unlikely) OR
  3. We never ran any code against the /phones route in the three weeks of testing we've been doing (or, if we did, didn't notice that it failed).

Occam's Razor suggests that the answer is (3). I personally don't see how that's possible with the amount of testing we've been doing over the last several weeks, but (1) and (2) are so much more unlikely that it's almost ludicrous to decide that either was the culprit. Logic suggests that we simply didn't test the /phones route, and in other news, I now hate logic.


We finally figured out why our routes were failing: there were empty applications on the server that matched the route names. Since those applications have been removed, we've been able to reset the routes to the names we wanted in the first place (/auth, /photos, and /phones). But now we're left with a worrying possibility: was an entire route (/phones) failing for three weeks and we didn't notice?

But, hey, at least it is fixed. That's good, I think. Right?

Happy Coding!

A Stupid Bug and a Plea for Help

UPDATE (17 Apr 2017): We've fixed the stupid bug described in this post! Check it out!

Bug hunts can take on many forms. Some are simple cleanup jobs, some take hours and end up being something stupid, and some look rather easy and turn out to be impossible. My group recently encountered a fourth kind, a bug where the solution is found but makes no sense, and we could use some help figuring out why this bug exists.

Birth of a (Stupid) Bug

Let me set the problem up. I had an ASP.NET Web API controller that looked something like this:

public class AuthController : ApiController  
    public IHttpActionResult Login(LoginUserRequest request)

    public IHttpActionResult TestOracle()

Naturally, this results in URLs that look like these:

  • /auth/login
  • /auth/testOracle

The bug was that any time I attempted to execute the /auth/login action, I got a response of 500 Internal Server Error. For those of you who might not work in the web world, that status message is about as helpful as a tent in a hurricane; it's a generic error message used when we don't know what actually happened (which is to say, it's used almost all the time). Worse, it didn't matter what inputs I POSTed, I always got the 500 error.

Now, it's not really a secret around here that I hate ORACLE, though sometimes unjustly. So when this bug started rearing its ugly head, I just assumed (apparently not learning from my mistakes) that ORACLE had something to do with it. I extracted the ORACLE part of the /auth/login method into the new /auth/testOracle method, and when that failed too, I thought I'd found the problem.

If you've read this blog for a while, you might see where this is going. You're right, and you're also wrong, just as I was.

You're right because I was making bad assumptions. Those assumptions caused me to waste two weeks blaming the problem on the ORACLE team. When we have problems like this, often it actually is ORACLE causing the problem, but not always. I was succumbing to lazy thinking by not investigating sooner than I did.

You are wrong (and, to be fair, so was I) because the actual problem was not something we could have predicted.

Tracking the Monster

My coworker Nate and I sat down early last week to try to figure this out once and for all. This bug appeared to only happen on our dev environment, not locally, so we would make incremental changes to the API, deploy that via our new CI/CD process, hit it with a known-working call to test that and make the app compile, and then hit the known-bad method. We did this over and over, for an entire morning.

Normally, when we hit this app, it has to compile before it can respond to requests (as do many ASP.NET applications). That means there's a delay of about 5-10 seconds before we get a response to the first request, and no such delay for any subsequent requests.

In my frustration, I made a deployment to the dev server and immediately tried to hit the known-bad method in my browser. The route to that method looks like this:

  • /auth/testOracle

While I was turned away from my monitor in disgust, Nate noticed that the method had responded immediately. There had been no compilation delay, which should not happen, but that wasn't even the worst of it. On a whim, we tried several more method calls in the same route:

  • /auth/login
  • /auth/getuserinfo
  • /auth/changepassword

Each of them had the same problem: they would respond immediately after a new deployment, with no compilation delay. Something very strange was going on here.

We conferred with the server admin group, and they swore up and down that there was nothing wrong with the configuration on this server. We inspected the network traffic, checked error logs, and found nothing. This thing completely stumped us.

As is normal when we go bug-hunting, we immediately tried to determine the commonality between all these failing methods. The only thing we could come up with was the route prefix /auth, which we changed to:

  • /authentication/login
  • /authentication/getuserinfo
  • /authentication/changepassword
  • /authentication/testOracle

And, to our considerable surprise, it worked! All our methods would now execute properly.

But WTF? Why would this only happen on the dev server, not locally? I figured it must be that "auth" was some kind of keyword on this server, and IIS was automatically redirecting any requests beginning with "/auth" to another location it couldn't find.

So, change "auth" to "authentication" and all's well, right? Not quite.

We had a lot of route prefixes in this app. Here's just a few of them:

  • /auth
  • /messages
  • /equipment
  • /photos
  • /userdata

All of them worked, except for /photos. The calls to the methods in that route had the exact same issue as the calls to the /auth route; they responded immediately with a 500 error.

A Plea for Help

So now we have two failing routes, /auth and /photos, and no idea how or why they are failing. Changing them to /authentication and /pictures makes them work, but the problem is still basically magic and I am loathe to leave it alone.

So, I'm asking for help from you, dear readers. Has anyone seen an issue like this before? Do you have any links or documents we could read to figure out why /auth and /photos, of all our 20+ route prefixes, are failing? This issue won't prevent us from finishing this project, but we'd truly be grateful for any assistance.

If you have any ideas, even ones you think we might have already tried, let us know in the comments.

Thanks, and Happy Coding!

How Do You Fix An Impossible Bug?

Within the span of an hour, it had all gone to hell.

The first deployment went rather smoothly. It was a fix to an existing web service, and went out with no problems, or so we thought. Within ten minutes of the deployment, the users started complaining of a minor bug, one that was seemingly omnipresent but didn't really stop them from doing meaningful work. The team which had sent out the deployment immediately set to work figuring out what was going on.

Unrelated to that deployment and forty minutes later, my team was launching a major change to a web site that was consuming that other team's web service. When our change went out, the bug that the users had been complaining about from the web service deployment vanished, replaced by a major bug that caused them to be unable to do any work at all. Naturally we were a little concerned.

The major bug caused an error message that included this statement:

"Unknown server tag 'asp:ScriptManager'."

Now usually when I see an error message like that, I think the project is missing a reference. But no, the references the project needed did, in fact, exist. The second thing I think is that it was using the incorrect version of .NET (in this case, 2.0). Wrong again, it's using the proper version. So now I'm a bit stumped; I pull one of my developers off his project to work on this, and he and I go to our manager and the web service team to try to hash this out.

It took the five of us about an hour to work out where exactly the problems were. As so often happens with major problems like this, it wasn't one problem but several, all intersecting at one time. They snowballed like so:

  • The previous day, my group attempted to roll out the major change to the web site. The roll out didn't work, and another manager (who had previously owned the project, and unbeknownst to me) immediately copied the application to a different location on the same server. He figured this would solve the problem, as it had before with a different app; it didn't.
  • Before the web service change went out, the users had already been notified to use the new location. Consequently they started complaining about the major error.
  • When the web service change was deployed, a different set of users within the same group complained about the minor bug, as word had not reached them to use the new location.
  • When our web site change went out (to the original location), the users of that site noticed and now complained of a "new" bug at the old location, despite the fact that it was the same bug as the bug at the new location.
  • All of this taken together meaning that the fact that our web site and their web service were related was a coincidence. The fact that the two deployments went out so near to each other had nothing to do with what the actual problem was. Coincidences like this are the worst possible thing that can happen when trying to find a bug.

Got all that?

Ultimately we worked out the problem. Well, really, we stumbled onto it. How we got there was such blind luck that I'm not convinced we actually solved the problem so much as lucked into a solution.

A bit of googling for the error message revealed this StackOverflow answer which states that the reason we get the above error is that a piece of the web.config is missing. Here's what it should be.

        <add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>

We had previously confirmed that in our application configuration that line did, in fact, exist. Obviously this could not be the problem. (I apparently forgot about what happened the last time I assumed something was true.) Later, when we started getting more desperate to find the source problem, I had our server team give me a copy of the app running on the production servers. This is what I found:

        <!--<add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>-->

Yes, you read that right. For some damn reason, the offending line of configuration code had been commented out.

And we had no idea how this happened.

We tried examining the config transforms; there was nothing that would result in that line being commented out. We tried looking at the server logs, the build logs, the source control history, anything that might give us a scrap of information as to how the hell this happened. We found nothing.

As you might imagine, this was a little frightening. This line got commented out, and we couldn't reproduce how. How can you fix a bug that should have never occurred in the first place? When we deployed the corrected configuration file, it worked, of course. But in the meantime we had wasted nearly an entire day looking for something that should have been impossible.

But was it impossible, or did we miss something? I'm inclined to believe the latter. One of the things I stress to my team when they come to me with bug reports is the important question of what changed? What changed between when the system worked and when it didn't? Was it business rules, data sources, the build process? If we can determine what changed, the time needed to pinpoint the actual bug shrinks dramatically. In this case, either we simply didn't know what had changed (the most likely scenario) or nothing had changed (the far scarier scenario). Either way, something was off and we couldn't determine what it was.

What was even more worrisome was that there had been a minor bug reported before the major bug showed up, the one that was annoying but not work-stopping. That minor bug was not reproducible now, so it's entirely possible it's still lurking out there, waiting for some unsuspecting user to click the wrong button or enter the wrong date and then blow up the whole system. We don't think it will be that serious, but we also don't really know that it won't. We're stuck in bug-induced purgatory.

That's a terrible feeling. Something went wrong, and you can't figure out how. You know what the solution is, but you don't know why.

I suppose I should just be happy we figured out the problem. I am, sort of. And yet I am forced to conclude that there exists a bug which caused a critical part of our application configuration to be commented out for no discernible reason. A part of me still wonders: how can I find and fix that bug, a bug that should have never existed in the first place?

And what if it happens again?

What about you, dear readers? What are some bugs you found that you couldn't source? How did you fix these "impossible" bugs? Share in the comments!

Happy Coding Debugging!

Post image is 2008 broken computer monitor from Wikimedia, used under license.

Ancient ORACLE, Modern IIS, and a Failure to RTFM

My coworkers Rajit and Dave and I have been spending an inordinate amount of time over the last few days looking for a really annoying error in one of our apps. The exception message reads like this:

Could not load file or assembly 'Oracle.DataAccess' or one of its dependencies. An attempt was made to load a program with an incorrect format.

Yep, that's right, we're using ORACLE as our database. I assure you, this is not by choice. We've known for a while that our version of ORACLE is so old that it can remember when it played with two sticks and a rock (and was happy, dammit). Problem was, we assumed that it had to be causing this issue, though we ultimately determined that it wasn't responsible. Even better, in the process of squashing this bug we discovered that someone had probably found the fix for this issue already and utterly failed to inform us about it.

Sun Temple With Ancient Artwork, used under license

Here's the situation: the app we're building is a rewrite of a much-loved internal app for our company, using a more modern tech stack: MVC5, Dapper, etc, all of which connect to an ORACLE database which, for reasons too convoluted to discuss here, we cannot refactor or remove. On our local machines and on our "Beta" server (which is where we have our QA department do their tests), the app works fine. It connects to the ORACLE database, runs queries, returns data, the works. But when we move the app to our staging server (which runs IIS 7 and connects to production databases to allow us to do sanity-checks before deployment) the app completely loses its mind and throws the above error.

Rajit, Dave, and I figured that the issue had to be either something to do with the app pool settings or some stupid coding mistake having to do with the ancient ORACLE DLLs. So we and the server admins started poring over the app's code and the IIS setup, scanning for some simple clue as to how to drive out the insanity that ancient ORACLE and modern IIS had inflicted upon us.

And we found nothing. At least, nothing obvious.

So, in desperation, we resorted to turning things off to see if that fixed the problem. That lead to us ultimately finding, after a day and half of searching, the very setting that had been causing all of the strife we'd been experiencing this project. This bug was so subtle, so innocuous, that none of us thought to look there. In fact, all of us presumed that it couldn't even be the issue. We prematurely concluded that the newest, shiniest thing (IIS 7) was less likely to be the cause of the problem than primitive, creaky ORACLE.

See where this is going?

After much frustration, we eventually located the cause of the bug; it was a setting in the IIS app pool called "Enable 32-bit Applications". In other words, the source of all of our grief for 36 hours was this thing:

A screenshot of an app pool's settings dialog in IIS, with

Here, let me make that a little more obvious.

A screenshot of an app pool's settings dialog in IIS, with

Perhaps I should explain.

ORACLE cares about the bitness of the machine it is running on, meaning whether said machine runs a 32-bit or 64-bit operating system. The 32-bit DLLs for ORACLE do not work on a 64-bit OS, and vice-versa. We figured that we were in the clear, though, since the production server's OS that was running our app is 64-bit and the app itself is using the 64-bit ORACLE DLLs.

(As a side note, ORACLE caring about bitness was the reason I wrote Using 64-bit IIS Express in Visual Studio, since everyone on my team needed to flip that switch in order to be able to debug the app locally.)

Did you catch what the description of that "Enable 32-bit Applications" setting actually said, though?

If set to true for an application pool on a 64-bit operating system, the worker process(es) serving the application pool will be in WOW64 (Windows on Windows64) mode. Processes in WOW64 mode are 32-bit processes that load only 32-bit applications.

Well, crap. Maybe if we'd slowed down a bit and RTFM we'd have saved ourselves some headache.

This was, obviously, reproducible in both the "Beta" and production environments. The reason that our app didn't immediately fail on our "Beta" server was that, on said server, the application was in its own app pool with "Enable 32-bit Applications" disabled. Apparently some idiot already ran into this issue months ago and forgot to tell us about it, seeing as we couldn't come up with another reason why the app pool was set up this way (and, y'know, the server logs told us such). Regardless, once we set up the application on production in its own app pool with the setting disabled, it worked like a charm.

So, yeah, I hate this stupid-ancient version of ORACLE we're forced to us (even though it wasn't the cause of the problem). It's old, it's slow, it requires us to care about whether the server it's running on has a 32-bit or 64-bit architecture. It's tacky and I hate it.

But mostly, I hate that there is someone out there, in our own company, who took the time to read the friggin manual, figure this nightmare out, and then proceed to not leave a note to inform us devs that this dumb little IIS setting would be the root cause of this infuriating bug. If you've solved a problem that others might encounter, and the solution was non-obvious, can't you leave a comment, a sign, a carrier pigeon, something to tell the people that come after you about it? Is it really that difficult to do?

That guy? Yeah, I hate that guy the most.

Of course, that guy is me. So say the server logs. Take that, me!