Integers vs GUIDs - The Great Primary Key Debate

One of the first steps in creating any database table is deciding what kind of data will uniquely identify a given row in said table; we call this a primary key. In modern databases, if we want to create a unique key from "thin air" and not use already-existing data (this situation is known as using a surrogate key) we have two commonly-accepted options: using integers, or using Globally-Unique Identifiers (GUIDs).

Both integers and GUIDs can be generated automatically, either by the database itself or some other system. But which is better? Let's discuss the pros and cons of using integers and GUIDs as primary keys, and see if we can discover which works better for our databases and applications.

A set of two keys, dating to the Roman Empire. Roman keys by MatthiasKabel, used under license

Using Integers

The primary argument in favor of using integers for primary keys is due to their simplicity. After all, they're just numbers. If you printed out a table that used integers as primary keys, the data is easy to read and understand:

ID Name
1 John
2 Kelly
3 Luis

If we're using integers as your primary key type, it becomes easy for the user to guess what the ID of a certain thing is (which may or may not be what you want). Let's imagine that a user encounters a URL similar to this one


If the user already knew that the ID of the location s/he wants to see is 120, all they have to do is change the URL like so:


In this way, the user can gain some extra knowledge about the system and start becoming able to make predictions as to what information exists where. Overall, it makes the user experience of our applications nicer.

Finally, integers are tiny; they use only 4 bytes of storage. This means that things like indexing and querying, in general, will be quicker when using integers as your primary key type. However, you probably won't see any difference until/unless you've got a massive application in place.

In short, use Integers when:

  • You want easily-understood IDs.
  • You want your URLs to be "hackable" by your end-users.
  • You are concerned about performance (in very large applications).

Using GUIDs

GUID stands for Globally-Unique Identifier, and is comprised of 32 hexadecimal characters in a format that looks like this:


A GUID is a specific form of the Universally-Unique Identifier idea, which was formally laid out by the Internet Engineering Task Force (IETF) as RFC 4122. The term GUID and its format were originally developed and used by Microsoft.

The defining characteristic of a GUID is that it is, comparatively speaking, huge; it uses 16 bytes of storage, versus only 4 bytes for an integer. But why would anybody need such large entropy in a data type? Let's do some math to discover that answer.

A GUID contains 2128 possible combinations, which means 3.4x1038 possible unique values can be generated from this structure. When written out, this number looks approximately like this:


That is an astoundingly massive number. To put this in perspective, according to this blog there are somewhere around 5.6X1021 grains of sand on Earth's beaches. Every single one could use 6.07x1016 GUIDs without any repeats. Unless you have data sets that are impossibly large, you will never see duplicate GUIDs.

Because we are virtually guaranteed to not ever get the same GUID twice, collecting data from many disparate sources becomes much easier. If you are merging data from a database that uses GUIDs as primary keys into your database, you are so unlikely to encounter collisions that you probably won't even need to check for them. The entropy invoked by GUIDs allows us developers to not have to consider handling collisions in the vast majority of circumstances.

However, because GUIDs are so big, performance can theoretically suffer with sufficiently large data sets. Indexing, in particular, can suffer due to the size of the data being stored. Opinions vary on just how much of a performance hit you'll take, though.

Another con to using GUIDs is that they are not easily remembered by users, so you won't get the "hackable" URLs that you would when using Integers. Depending on your system, this may be acceptable.

In short, use GUIDs when:

  • You want the data to be uniquely-identified, no matter where it came from.
  • You need to be able to combine data from difference sources with little-to-no chance of duplicate GUIDs.
  • You don't want or don't care about the users needing to remember an ID themselves.

Which Is Better?

As always, this kind of decision comes down to what kind of system you are building. But IMO you should only favor GUIDs over integers if you have a very large, distributed system where you are likely to need global uniqueness.

The performance argument against GUIDs has largely been nullified by the passage of time and improvements in technology. In modern systems you will not see any discernable performance difference between the two data types.

Further, the "hackable" argument in favor of using integers is entirely wrong-headed; why should the user have to remember anything? It's nice for the users to be able to do this, but not necessary or vital by a long shot.

Despite both of those points, I don't see a reason to invoke the extra complexity necessitated by GUIDs unless you have a correspondingly complex system. In my mind, the argument boils down to clarity vs guaranteed uniqueness, and given that the best code is no code I'll take the simplicity of integers every day of the week.

But that's just my opinion. I've laid out the best arguments I've found for using both integers and GUIDs as primary keys so that you (my readers) can go make this decision for yourself. Let me know what you decide, and why, in the comments!

Happy Coding!

Designing a Workflow Engine Database Part 8: Complete Schema and Shortcomings

This is the final part of an eight-part series describing how to design a database for a Workflow Engine. Click here for Part 1

We've now completed the database design for our Workflow Engine, so let's take a look at the final database diagram (click here for a larger view):

The complete database diagram for our Workflow Engine

That's a lot of tables! But we're trying to design a generic engine, one that can be used for many disparate processes, and as such the complexity of this design reflects the complexity of the requirements.

Shortcomings of This Design

There are a few shortcomings of this design:

  • Currently, there's no way to say "Stay in this state for X days; after X days, move to another state." We cannot cause Requests to move along in the Process due to time elapsed.
  • We don't have a procedure in place to handle a State or Transition ceasing to exist (meaning, even though the Process Admins are supposed to be able to change the Process itself, what happens when they actually do that?)
  • The RequestFiles table stores the actual files in the table itself; a more performant way to do this would be to store the files on a filestore somewhere and just keep references to them in this table.
  • We don't currently have a way to designate that a specific individual needs to perform an Action or receive an Activity.

That said, most of these shortcomings are fairly easy to implement; we actually did solve a few of them for our company's workflow engine.

Remember: this is only a database schema. The design doesn't account for how this service would actually be implemented in code.

Whew! That was a lot of steps! But what we ended up with was a generic, extensible, reusable workflow engine database design that can handle a huge variety of processes.

Thanks for reading! If you found this series helpful (or just need to yell at me for something I did incorrectly) let me know in the comments.

Happy Coding!

Designing a Workflow Engine Database Part 7: Request Actions

This is Part 7 of an eight-part series describing how to design a database for a Workflow Engine. Click here for Part 1

All of the infrastructure we've built so far has lead to this moment. At long last, we can build the final piece of our schema: the Request Actions table.

Request Actions

So we now have Actions that Users can perform to invoke Transitions. Obviously we can't allow for any action to be performed against the Request; only the actions we need should be allowed.

The Table

Let's look at the schema of the RequestActions table first, and then show how it would actually be used:

Those last two columns (IsActive and IsComplete) are very important to the actual execution of this table.

Here's how we're going to use this table.

  1. When a Request enters a State, we get all outgoing Transitions from that state. For each Action in those Transitions, we add an entry in RequestAction, with each entry having IsActive = 1 and IsCompleted = 0.
  2. A User may submit an Action at any time. Each submitted Action consists of an ActionType, a RequestID, and a UserID.
  3. When an Action is submitted, we check the RequestActions for the specified Request. If the submitted Action matches one of the active RequestActions (where IsActive = 1), we set that entry's IsActive = 0 and IsCompleted = 1.
  4. After marking the submitted Action as completed, we check all Actions for that Transition in that Request. If all RequestActions are marked as Completed, then we disable all remaining actions (by setting IsActive = 0, e.g. all actions for Transitions that were not matched).

Sample Walkthrough

Let's see if we can see how this works by introducing some sample data.

USERS: Jane (ID 1), Tom (ID 2), Gary (ID 3)

GROUPS: Executives (ID 1), includes Tom and Gary

STATES: A (Type: Start), B (Type: Normal), C (Type: Denied)

TRANSITIONS: A -> B (ID 1), A -> C (ID 2), B -> C (ID 3)

TRANSITION ACTIONS: A -> B: Approved by Requester (ID 1) AND Approved by Executives (ID 2)
A -> C: Denied by Executives (ID 3)
B -> C: Denied by Requester (ID 4)

Let's say Jane creates a Request, which immediately is placed into State A.

At this point, the system looks for all outgoing transitions from State A and finds two of them, Transitions 1 and 2. It then loads the following data into RequestActions:


Now the Request just sits in its current state, waiting for an Action to be submitted.

Say Jane submits this action:

User ID: 1
ActionType: Approve
Request ID: 1
(We read that as "User 1 approves Request 1")

Since that action matches the first RequestAction in the table, it is marked as completed:


At this point, we have not matched all Actions for a Transition, so nothing happens to the Request; it remains in State A.

Now say Tom submits this action:

User ID: 2
ActionType: Approve
Request ID: 1
(User 2 approves Request 1)

After that action gets matched, the table of Request Actions now looks like this:


Notice that because both Actions for Transition 1 are complete, we now must follow Transition 1 and move the Request to the next State, which is State B. After we move to State B, we load the Actions for the Transitions from that State and disable any old Actions; our RequestActions table looks like this:


In this manner, we can keep track of all actions that have been performed, that could have been performed, and that are still waiting to be performed, in the same table.

What did we accomplish?

In this post, we implemented the last piece of our structure: the Request Actions table. This table stores all Actions that can be made against a particular Request, and is the driving force behind how this engine actually works.

There's still one last part we will talk about, and that's actually building and demoing a process that runs on this engine. The final part of our saga is Part 8, where we will discuss the Complete Schema and Shortcomings of our Workflow Engine database design.

Designing a Workflow Engine Database Part 6: Groups and Targets

This is Part 6 of an eight-part series describing how to design a database for a Workflow Engine. Click here for Part 1

We've now got most of the Process tables defined, but we are still missing a few things.

One of those things is a definition for exactly who can perform Actions or receive Activities; we're going to call this Targets.

The second piece we need is Groups, or collections of people that perform a similar or related job in this process. In our design, we want to allow for a Group to also be a Target. Let's see how we can design these tables!


A Group is a collection of Users that perform related functions.

For example, we might have a group called Programmers that actually write code, or a group called Executives that approve projects for development. Because we are building a generic Workflow Engine, we want to make sure that each Process can have its own Groups.

The Group table is pretty simple:

The Group table, showing GroupID, ProcessID, and Name

We also need a table to represent which Users are in a given group, which will be another many-to-many table. This table is called GroupMember:

The design and relationships of Group, GroupMember, and User


Remember that this system is people-focused; only people can action the Request. We still need a way to associate which persons (or Groups) can perform Actions and receive Activities.

The problem is that we may need to specify that only the Requester of a given Request can send that request to his/her supervisor, or that all of this set of people need to receive an email when a Request reaches a certain State. How can we design this in such a manner that it is flexible in who can perform the Action or receive the Activity, but still implements a few rules that the engine must follow?

We can accomplish this by creating Targets. A Target is a set of standardized representations of a person who have specific roles relative to a Request or Process. We use the following targets:

  • Request Creator (Requester)
  • Request Stakeholders
  • Group Members
  • Process Admins

Our Target table looks like this:

The Target table, showing TargetID, Name, and Description

Because this is another static table (like StateType, ActionType, and ActivityType), we don't expect the data in it to ever change. Here's the data we will be using for this design:

The data in the Target table, showing Requester, Stakeholders, Group Members, and Process Admins

Action Targets and Activity Targets

The Targets table does us no good unless we can relate it to other tables that can actually use the Targets. We want to use Targets in two scenarios:
* As people who can perform Actions * As people who can receive Activities

Let's design Action Targets first. For every Action that can be submitted to this engine, we need to define who can actually submit that action for it to be considered valid. After all, we don't want janitor Kevin approving the construction of a new grocery store, since that's not his responsibility. With this in mind, our table design looks like this:

The ActionTarget table, showing ActionTargetID, ActionID, TargetID, and optional GroupID

We needed to include Group ID because if our Target is a Group, we have to specify which Group can perform the Action.

Now let's discuss Activity Targets. Depending on the kind of activity, the Targets for that activity could receive an email, or be added to the stakeholders list for a Request, etc. Because we've defined a central list of Targets, the ActivityTarget table looks remarkably similar to the ActionTarget table:

The ActivityTarget table, showing ActivityTargetID, ActivityID, TargetID, and optional GroupID

IMPORTANT NOTE: When using a Group as a Target, the way the system interprets this relationship is different.

  • If the Group is an Action Target, then any member of the Group can perform the action for it to be valid.
  • If the Group is an Activity Target, then all members of the Group receive the Activity (e.g. everyone in the group gets an email).

What did we accomplish?

Our design for Actions, Activities, Groups, and Targets now looks like this:

In this part, we nailed down exactly who could perform Actions and receive Activities by creating Targets, and we gave our engine a bit more flexibility by creating Groups of Users who could each do the same thing as the others.

In the next part of this series, we get to the real meat of the system. We'll show how individual Requests can track which Actions can be performed against them, and we'll see how we can use that list to determine which Transition the Request needs to follow. Next up is Part 7 of this series, Request Actions.