We have an in-house media server that stores internal images, videos, etc. This project has many components, but the major ones are:

  • A database, where information about the media items is stored.
  • A service layer, which allows multiple apps to use the server and database.
  • A private site (which talks to the database through the service layer) that allows certain individuals to upload media items.

With lots of pieces, something was bound to break eventually. This week it did, and the resulting bug hunt almost had me convinced that I'd gone insane.

A sign reading Insanity Occupy Sign by ERAGON, used under license

The Media Server Setup

For most of the past three years, the Media Server has been running along very smoothly, with no really big issues. Lately, though, we've been getting some complaints about this application, specifically that deleting items from the server didn't actually delete the items; they were still
visible to the user that had uploaded them.

I assigned a task to my coworker Diana to investigate this issue. Because I was the most recent person to actually work on Media Server, I told her to come to me with any questions she had in the process of tracking down this bug.

In regular usage of the Media Server's private upload site, the flow goes like this: when you first log into the site, you are presented with a list of media items that you uploaded ("My Uploaded Items"). You can then apply filters (like videos only, or images only) and sorting, and each of these invokes a specialized query for that scenario.

Here's what Diana did to try to find the bug.

  1. First, she ran the query showing all of her uploaded items. This did not show the video she was testing with (which is what we expected, since she had not uploaded it yet).
  2. Next, she uploaded the test video.
  3. The uploaded video did not appear on her "My Uploaded Items" view. She switched the filter to "only videos" and the newly-uploaded video now appeared.
  4. She deleted the video.
  5. When she looked at the "My Uploaded Items" view the now-deleted video did not appear (which, by itself, is the correct behavior).
  6. But when she switched the filter to "only videos" the deleted video was still being shown.

So, was the video deleted or not? We spent a little over an hour trying to work that out.

The Bug Hunt From Hell

We started out doing the normal debugging things: double-checking settings, testing the connection strings, generally figuring out if we'd missed anything obvious. When that failed to locate the bug, we moved to increasingly desperate ideas. Eventually, one of us figured out that this particular bug only made sense if it existed in service layer; the database layer had no issues when we tested it separately from the entire stack, and the private site couldn't logically cause the bug since it only read the data it received from the service, so the only thing left was to test the service itself.

Sure enough, our testing revealed that on Diana's machine, the bug was entirely reproducible. The video that she had deleted would still appear in her "only videos" filtered query when running the Admin app locally. Being the good testers we are, we pulled up WCFTestClient and started hitting the service layer that way.

When we tested the "only videos" query through WCFTestClient, the result set did not include the deleted video. So if we ran Diana's instance locally, we saw the deleted video (which is a bug), but if we executed the same service call through WCFTestClient, we did not see the video (the proper execution). But it was the same damn method, with identical inputs. We were executing the same call twice and getting different results; isn't that the definition of insanity?

As I am not (as far as I know) actually insane, I was flabbergasted. Desperate for some insight, I started combing through the code, line by line. Eventually, we came to the service class (which is an older ASMX service), where we encountered this lovely little snippet:

[WebMethod(CacheDuration=86400)]
public List<Items> MethodName(...)  
{ }

For those of you who may have never seen that attribute before, CacheDuration in this case is setting the results of this call to be cached for 86400 seconds, or 24 hours.

Dissecting The Bug

Here was our problem. Since the caching was set to such a long duration, the first time Diana ran the "My Uploaded Items" query it did not show the testing video, since it did not exist. Then, when she uploaded said video, "My Uploaded Items" would still not show the new item because the results of that query were cached, and those results did not include the new video.

Further, the "only videos" query was run after the video was uploaded, but before the video was deleted. Consequently, the video continued to show in that query because the results were cached.

A check of the deployment logs showed a recent deployment for the media service, which is when the complaints about videos not deleting started showing up. Sure enough, that deployment included this cache setting. The reason the cache was set to last for so long was that we have a lot of reads from other, external services that are hitting that method, and for performance reasons we didn't want to run that query every time we needed the same set of videos. The real solution here is to change the architecture so that the upload site isn't using the same methods and caching rules as the external services, but for now setting the cache down to 15 minutes seems to be a reasonable compromise.

Anyway, that's my bug hunt from hell story of the week. Now I'll have to remember to check for crazy cache settings as part of my debugging process. What kinds of bugs have you run into? Did any of them result in a crazy hunt? Share in the comments!