Within the span of an hour, it had all gone to hell.
The first deployment went rather smoothly. It was a fix to an existing web service, and went out with no problems, or so we thought. Within ten minutes of the deployment, the users started complaining of a minor bug, one that was seemingly omnipresent but didn't really stop them from doing meaningful work. The team which had sent out the deployment immediately set to work figuring out what was going on.
Unrelated to that deployment and forty minutes later, my team was launching a major change to a web site that was consuming that other team's web service. When our change went out, the bug that the users had been complaining about from the web service deployment vanished, replaced by a major bug that caused them to be unable to do any work at all. Naturally we were a little concerned.
The major bug caused an error message that included this statement:
"Unknown server tag 'asp:ScriptManager'."
Now usually when I see an error message like that, I think the project is missing a reference. But no, the references the project needed did, in fact, exist. The second thing I think is that it was using the incorrect version of .NET (in this case, 2.0). Wrong again, it's using the proper version. So now I'm a bit stumped; I pull one of my developers off his project to work on this, and he and I go to our manager and the web service team to try to hash this out.
It took the five of us about an hour to work out where exactly the problems were. As so often happens with major problems like this, it wasn't one problem but several, all intersecting at one time. They snowballed like so:
- The previous day, my group attempted to roll out the major change to the web site. The roll out didn't work, and another manager (who had previously owned the project, and unbeknownst to me) immediately copied the application to a different location on the same server. He figured this would solve the problem, as it had before with a different app; it didn't.
- Before the web service change went out, the users had already been notified to use the new location. Consequently they started complaining about the major error.
- When the web service change was deployed, a different set of users within the same group complained about the minor bug, as word had not reached them to use the new location.
- When our web site change went out (to the original location), the users of that site noticed and now complained of a "new" bug at the old location, despite the fact that it was the same bug as the bug at the new location.
- All of this taken together meaning that the fact that our web site and their web service were related was a coincidence. The fact that the two deployments went out so near to each other had nothing to do with what the actual problem was. Coincidences like this are the worst possible thing that can happen when trying to find a bug.
Got all that?
Ultimately we worked out the problem. Well, really, we stumbled onto it. How we got there was such blind luck that I'm not convinced we actually solved the problem so much as lucked into a solution.
A bit of googling for the error message revealed this StackOverflow answer which states that the reason we get the above error is that a piece of the web.config is missing. Here's what it should be.
<pages> ... <controls> <add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=18.104.22.168, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/> </controls> </pages>
We had previously confirmed that in our application configuration that line did, in fact, exist. Obviously this could not be the problem. (I apparently forgot about what happened the last time I assumed something was true.) Later, when we started getting more desperate to find the source problem, I had our server team give me a copy of the app running on the production servers. This is what I found:
<pages> ... <controls> <!--<add tagPrefix="asp" namespace="System.Web.UI" assembly="System.Web.Extensions, Version=22.214.171.124, Culture=neutral, PublicKeyToken=31bf3856ad364e35"/>--> </controls> </pages>
Yes, you read that right. For some damn reason, the offending line of configuration code had been commented out.
And we had no idea how this happened.
We tried examining the config transforms; there was nothing that would result in that line being commented out. We tried looking at the server logs, the build logs, the source control history, anything that might give us a scrap of information as to how the hell this happened. We found nothing.
As you might imagine, this was a little frightening. This line got commented out, and we couldn't reproduce how. How can you fix a bug that should have never occurred in the first place? When we deployed the corrected configuration file, it worked, of course. But in the meantime we had wasted nearly an entire day looking for something that should have been impossible.
But was it impossible, or did we miss something? I'm inclined to believe the latter. One of the things I stress to my team when they come to me with bug reports is the important question of what changed? What changed between when the system worked and when it didn't? Was it business rules, data sources, the build process? If we can determine what changed, the time needed to pinpoint the actual bug shrinks dramatically. In this case, either we simply didn't know what had changed (the most likely scenario) or nothing had changed (the far scarier scenario). Either way, something was off and we couldn't determine what it was.
What was even more worrisome was that there had been a minor bug reported before the major bug showed up, the one that was annoying but not work-stopping. That minor bug was not reproducible now, so it's entirely possible it's still lurking out there, waiting for some unsuspecting user to click the wrong button or enter the wrong date and then blow up the whole system. We don't think it will be that serious, but we also don't really know that it won't. We're stuck in bug-induced purgatory.
That's a terrible feeling. Something went wrong, and you can't figure out how. You know what the solution is, but you don't know why.
I suppose I should just be happy we figured out the problem. I am, sort of. And yet I am forced to conclude that there exists a bug which caused a critical part of our application configuration to be commented out for no discernible reason. A part of me still wonders: how can I find and fix that bug, a bug that should have never existed in the first place?
And what if it happens again?
What about you, dear readers? What are some bugs you found that you couldn't source? How did you fix these "impossible" bugs? Share in the comments!