Ancient ORACLE, Modern IIS, and a Failure to RTFM

My coworkers Rajit and Dave and I have been spending an inordinate amount of time over the last few days looking for a really annoying error in one of our apps. The exception message reads like this:

Could not load file or assembly 'Oracle.DataAccess' or one of its dependencies. An attempt was made to load a program with an incorrect format.

Yep, that's right, we're using ORACLE as our database. I assure you, this is not by choice. We've known for a while that our version of ORACLE is so old that it can remember when it played with two sticks and a rock (and was happy, dammit). Problem was, we assumed that it had to be causing this issue, though we ultimately determined that it wasn't responsible. Even better, in the process of squashing this bug we discovered that someone had probably found the fix for this issue already and utterly failed to inform us about it.

Sun Temple With Ancient Artwork, used under license

Here's the situation: the app we're building is a rewrite of a much-loved internal app for our company, using a more modern tech stack: MVC5, Dapper, etc, all of which connect to an ORACLE database which, for reasons too convoluted to discuss here, we cannot refactor or remove. On our local machines and on our "Beta" server (which is where we have our QA department do their tests), the app works fine. It connects to the ORACLE database, runs queries, returns data, the works. But when we move the app to our staging server (which runs IIS 7 and connects to production databases to allow us to do sanity-checks before deployment) the app completely loses its mind and throws the above error.

Rajit, Dave, and I figured that the issue had to be either something to do with the app pool settings or some stupid coding mistake having to do with the ancient ORACLE DLLs. So we and the server admins started poring over the app's code and the IIS setup, scanning for some simple clue as to how to drive out the insanity that ancient ORACLE and modern IIS had inflicted upon us.

And we found nothing. At least, nothing obvious.

So, in desperation, we resorted to turning things off to see if that fixed the problem. That lead to us ultimately finding, after a day and half of searching, the very setting that had been causing all of the strife we'd been experiencing this project. This bug was so subtle, so innocuous, that none of us thought to look there. In fact, all of us presumed that it couldn't even be the issue. We prematurely concluded that the newest, shiniest thing (IIS 7) was less likely to be the cause of the problem than primitive, creaky ORACLE.

See where this is going?

After much frustration, we eventually located the cause of the bug; it was a setting in the IIS app pool called "Enable 32-bit Applications". In other words, the source of all of our grief for 36 hours was this thing:

A screenshot of an app pool's settings dialog in IIS, with

Here, let me make that a little more obvious.

A screenshot of an app pool's settings dialog in IIS, with

Perhaps I should explain.

ORACLE cares about the bitness of the machine it is running on, meaning whether said machine runs a 32-bit or 64-bit operating system. The 32-bit DLLs for ORACLE do not work on a 64-bit OS, and vice-versa. We figured that we were in the clear, though, since the production server's OS that was running our app is 64-bit and the app itself is using the 64-bit ORACLE DLLs.

(As a side note, ORACLE caring about bitness was the reason I wrote Using 64-bit IIS Express in Visual Studio, since everyone on my team needed to flip that switch in order to be able to debug the app locally.)

Did you catch what the description of that "Enable 32-bit Applications" setting actually said, though?

If set to true for an application pool on a 64-bit operating system, the worker process(es) serving the application pool will be in WOW64 (Windows on Windows64) mode. Processes in WOW64 mode are 32-bit processes that load only 32-bit applications.

Well, crap. Maybe if we'd slowed down a bit and RTFM we'd have saved ourselves some headache.

This was, obviously, reproducible in both the "Beta" and production environments. The reason that our app didn't immediately fail on our "Beta" server was that, on said server, the application was in its own app pool with "Enable 32-bit Applications" disabled. Apparently some idiot already ran into this issue months ago and forgot to tell us about it, seeing as we couldn't come up with another reason why the app pool was set up this way (and, y'know, the server logs told us such). Regardless, once we set up the application on production in its own app pool with the setting disabled, it worked like a charm.

So, yeah, I hate this stupid-ancient version of ORACLE we're forced to us (even though it wasn't the cause of the problem). It's old, it's slow, it requires us to care about whether the server it's running on has a 32-bit or 64-bit architecture. It's tacky and I hate it.

But mostly, I hate that there is someone out there, in our own company, who took the time to read the friggin manual, figure this nightmare out, and then proceed to not leave a note to inform us devs that this dumb little IIS setting would be the root cause of this infuriating bug. If you've solved a problem that others might encounter, and the solution was non-obvious, can't you leave a comment, a sign, a carrier pigeon, something to tell the people that come after you about it? Is it really that difficult to do?

That guy? Yeah, I hate that guy the most.

Of course, that guy is me. So say the server logs. Take that, me!

The Bug Is In Your Code

Have you ever been hunting a bug and been absolutely sure that it was in someone else's code, only to find out that, nope, it was in yours all along? I sure did. Come along with me as we explore my latest minor failure and remind ourselves that, most of the time, the bug is in your code.

An Emperor Gum Moth, a large, brown, hairy insect, seen in flights. Emperor Gum Moth from Wikimedia, used under license

Well, hopefully that bug isn't in your code, but you never know.

Bad Assumptions, Bad Actions

We inherited a new application, a chat system built using SignalR and .NET 4.5, which we needed to deploy to a brand-new IIS8 server box. The system consists of two pieces: a hub, which acts as the "server" for the broadcast system, and the client web application, which received broadcasts. At the time we inherited the system, the hub had already been deployed; we just needed to get the client app on the correct server and we'd be good to go.

The deployment itself went off without a hitch; the web app was accessible to the right people, let us log in, basically worked. Or so it seemed.

As part of sanity-checking the deployment, we programmers logged in and tried to create conversations and send messages through the chat system. We could create messages just fine, but after the first message was sent, no further messages could be transmitted. A quick debugging showed that the SignalR piece on the web client was failing at some point while trying to reach the hub.

We checked and double-checked the web client settings, and couldn't find anything that might cause this error. That's when we went to the server team, pointed out our problem to them, and asked if they could help us figure out what exactly was going on here. Clearly this was a server or connections issue; it was deployed onto a clean IIS8 machine which may not have gotten the correct settings applied, and SignalR was failing to connect. That means network problems, right?

Except, the server team couldn't find anything wrong. They did whatever server magic they do (I'm SO not a server guy) and didn't detect any problems. Yet our debugging suggested that the error was in the network; possibly permissions or access-related. For a few days we went back-and-forth over this, arguing and haggling and posturing until we were so bloody tired of even talking to the other side.

I went back to try to debug this again, still flustered because I was convinced that this was a network problem, but the network team couldn't find any explanations for why we might be getting this error. Come on, network guys, it's obviously some kind of network issue, and you can't even find it?! This can't be my fault! Bewildered (and kinda angry) at their lack of insight, I stepped through the web client code once more.

It CAN Be Your Fault

It was during this bug hunting session that I decided to re-read the actual text of the connection error message I was seeing:

SignalR: chathub.[MethodName] failed to execute. Error: the underlying provider failed on Open.

Hmmm, I thought, that error only appears after initiating a hub request; seems to take about a minute to appear. So it clearly is hitting the hub, it's just not getting any response back. Now that I'm thinking about it, I've often seen that "underlying provider failed on Open" error when something fails to connect to a database... Crap.

It was like a lightbulb came on in my head.

The Centennial Light, which has been on continuously since 1901

I quickly started rummaging through the hub settings (remember that it was deployed before we took ownership of this application). Sure enough, the profile of the hub that had been deployed to production was using a connection string that connected it to our staging database server, not the production one. With the wrong connection string, the hub couldn't connect to the database (since in our company, production servers cannot contact staging ones, and vice-versa), hence it was throwing that error. We could create conversations because that happened in the web app, which had the correct production connection string, but sending messages after creation of the conversation required the hub to work.

All of this, of course, meant that the error was not the server team's fault. It was mine, as my team (including me as the lead developer) was responsible for this app now. I didn't deploy the hub, and I didn't write it, but this bug was my responsibility, and I failed to research it enough.

This is what happens when we assume the solution and don't take the time to do our own thorough research. I was so certain that the hub had been deployed correctly and that our settings were correct that the only remaining possible issue, to my mind, was the network settings. After reaching that erroneous conclusion, I became blinded to the possibility that the bug was still in my code.

Of course, it was in my code, and of course, I blamed other people for what was ultimately my responsibility. Not exactly my finest moment, I'll admit.

This is yet more proof that we're all (especially me) not as smart as we think we are. Most of the time, the bug is in our code. Not in a library, not in some other person's service, not in the network, not "somewhere else". In our code.

Jeff Atwood put this another way: he wrote that the first rule of programming is that it is always your fault. If you assume that the problem exists in code you are responsible for, you are more likely to actually locate the bug (because most of the time, it genuinely is there). Even if you don't locate the bug, you've learned enough about the system in the process that you should have a much better idea of where it might exist.

Assume that the bug is in your code! Only after you've proven that it isn't can you start to explore other possibilities.

We've all seen how my forgetting to check my code thoroughly ended up making me look like an idiot. Has this ever happened to you, or even some "friend" that you know (wink wink)? Share in the comments!

The Bug Hunt From Hell (Or, Why You Shouldn't Forget About Caching)

We have an in-house media server that stores internal images, videos, etc. This project has many components, but the major ones are:

  • A database, where information about the media items is stored.
  • A service layer, which allows multiple apps to use the server and database.
  • A private site (which talks to the database through the service layer) that allows certain individuals to upload media items.

With lots of pieces, something was bound to break eventually. This week it did, and the resulting bug hunt almost had me convinced that I'd gone insane.

A sign reading Insanity Occupy Sign by ERAGON, used under license

The Media Server Setup

For most of the past three years, the Media Server has been running along very smoothly, with no really big issues. Lately, though, we've been getting some complaints about this application, specifically that deleting items from the server didn't actually delete the items; they were still
visible to the user that had uploaded them.

I assigned a task to my coworker Diana to investigate this issue. Because I was the most recent person to actually work on Media Server, I told her to come to me with any questions she had in the process of tracking down this bug.

In regular usage of the Media Server's private upload site, the flow goes like this: when you first log into the site, you are presented with a list of media items that you uploaded ("My Uploaded Items"). You can then apply filters (like videos only, or images only) and sorting, and each of these invokes a specialized query for that scenario.

Here's what Diana did to try to find the bug.

  1. First, she ran the query showing all of her uploaded items. This did not show the video she was testing with (which is what we expected, since she had not uploaded it yet).
  2. Next, she uploaded the test video.
  3. The uploaded video did not appear on her "My Uploaded Items" view. She switched the filter to "only videos" and the newly-uploaded video now appeared.
  4. She deleted the video.
  5. When she looked at the "My Uploaded Items" view the now-deleted video did not appear (which, by itself, is the correct behavior).
  6. But when she switched the filter to "only videos" the deleted video was still being shown.

So, was the video deleted or not? We spent a little over an hour trying to work that out.

The Bug Hunt From Hell

We started out doing the normal debugging things: double-checking settings, testing the connection strings, generally figuring out if we'd missed anything obvious. When that failed to locate the bug, we moved to increasingly desperate ideas. Eventually, one of us figured out that this particular bug only made sense if it existed in service layer; the database layer had no issues when we tested it separately from the entire stack, and the private site couldn't logically cause the bug since it only read the data it received from the service, so the only thing left was to test the service itself.

Sure enough, our testing revealed that on Diana's machine, the bug was entirely reproducible. The video that she had deleted would still appear in her "only videos" filtered query when running the Admin app locally. Being the good testers we are, we pulled up WCFTestClient and started hitting the service layer that way.

When we tested the "only videos" query through WCFTestClient, the result set did not include the deleted video. So if we ran Diana's instance locally, we saw the deleted video (which is a bug), but if we executed the same service call through WCFTestClient, we did not see the video (the proper execution). But it was the same damn method, with identical inputs. We were executing the same call twice and getting different results; isn't that the definition of insanity?

As I am not (as far as I know) actually insane, I was flabbergasted. Desperate for some insight, I started combing through the code, line by line. Eventually, we came to the service class (which is an older ASMX service), where we encountered this lovely little snippet:

public List<Items> MethodName(...)  
{ }

For those of you who may have never seen that attribute before, CacheDuration in this case is setting the results of this call to be cached for 86400 seconds, or 24 hours.

Dissecting The Bug

Here was our problem. Since the caching was set to such a long duration, the first time Diana ran the "My Uploaded Items" query it did not show the testing video, since it did not exist. Then, when she uploaded said video, "My Uploaded Items" would still not show the new item because the results of that query were cached, and those results did not include the new video.

Further, the "only videos" query was run after the video was uploaded, but before the video was deleted. Consequently, the video continued to show in that query because the results were cached.

A check of the deployment logs showed a recent deployment for the media service, which is when the complaints about videos not deleting started showing up. Sure enough, that deployment included this cache setting. The reason the cache was set to last for so long was that we have a lot of reads from other, external services that are hitting that method, and for performance reasons we didn't want to run that query every time we needed the same set of videos. The real solution here is to change the architecture so that the upload site isn't using the same methods and caching rules as the external services, but for now setting the cache down to 15 minutes seems to be a reasonable compromise.

Anyway, that's my bug hunt from hell story of the week. Now I'll have to remember to check for crazy cache settings as part of my debugging process. What kinds of bugs have you run into? Did any of them result in a crazy hunt? Share in the comments!

The Two-Step "Do I Need To Fix This Right Now?" Questionnaire

Almost every day one of my teammates will come to me with a "bug" or problem they discovered and want to know if we should fix it as part of the next deployment, or wait for the QA department (whom we jokingly refer to as the bug hunting team) to do some more thorough testing and research. Each time this happens, I invariably ask them the same two questions:

  1. What's the problem?
  2. Why should we care?

These two questions have served me faithfully for a long time. They encapsulate everything I need to know before I can make a decision as to whether or not the discovered bug needs to be fixed right now.

What's the Problem?

Seems like a simple enough question, right? What's the issue here? This first question's value is easy to pin down: we have to know what the problem is before we can hope to fix it.

There's a trick to the question, though. The trick is that most of the time I already know (or can guess) the answer and just want my teammate to explain it out loud as a way of working through the problem for him/her self. I find that reasoning through problems out loud helps to work out what possible solutions exist for that particular problem (I've heard this called rubber-duck debugging). In my experience, it often helps the person doing the talking reach a solution or, at least, eliminate some solutions that won't actually fix the problem.

Mostly, I ask the question and then just listen, because often they'll work out the problem themselves. But even if they have a solution, I still need to be convinced that I should even care about this bug, and that's where the second question comes into play.

Why Should We Care?

The second question serves a whole host of wonderful purposes at the same time, because it is very general. By asking this question, I'm implicitly giving a demand: Make me care enough to devote time and resources to fixing this bug. If they can't find a reason to make me care, we delay the bug (or possibly don't fix it at all).

Oftentimes I, as the "lead" developer, don't get presented with bugs like these unless the person who found it already has a workable solution, at least in theory. If that solution is feasible, relatively easy, and has a large impact, well then we just might end up doing that. Most of the time, though, the decision is not that simple.

For example, say a junior programmer on my team finds a "bug" that affects an extremely small number of users. This particular "bug" is in a system that has been deployed for two years, and in that timeframe we've received no complaints and no issues regarding that particular section of the code. In this case, I may not care enough to fix the bug; it would take too much time for too little benefit.

For another example, say a teammate finds a bug in a feature that is about to be deployed. This feature will only be used by a few persons, but will probably impact their ability to do their job should we let the bug get past us. In this situation, I'd be hard-pressed to let this bug slide, since the impact it is going to have is measurable and not trivially small.

Much of the time, the decision falls in between those two extremes. Perhaps we know what the bug is, but don't know enough to be able to determine what the impact of fixing it (or of leaving it alone) would be. Perhaps we think there's a bug, but we can't prove it without devoting extra time to pinning it down. There's a lot of possibilities.

The point of all this is that if we're going to divert time and resources that we could use to actually deliver a product to fixing a bug instead, it had better be worth fixing. We all have a limited amount of time for one reason or another, and so finding a justification for spending some of our time on a bug is why I need this two-step process. The decision to divert resources isn't often a simple one, and these two questions help me narrow down what actions me and my group can take.

Next time you find yourself presented with a bug that might need to be fixed, or maybe you found a bug that someone else has to make the call on, try asking: "What's the problem, and why should I care?"