Archive for the ‘Database’ Category
If you’re following along, I’ve recently named this effort “Parallax” for long and windy reasons. You can read all about it here.
I’ve copied the project up to CodePlex, at http://parallax.codeplex.com. Feel free take a look.
There’s not much there yet, just a few database tables, but it can get us started talking about the approach we’re taking to the database.
Here’s what we have so far:
There are a few things going on here:
Singular Table Names
I’ve used singular table names for all tables. This is a long and well-worn debate, whether to use singular or pluralized table names.
For years, I was a strong proponent of the pluralized table names. At the time, it was probably mostly because that’s what I had used before, and like most people I tend to come up with rational arguments to support my subjective preferences, because cognitive dissonance is yucky.
Plural Table Names Are Good
One of the main reasons for plural table names, at least for me, is that from one perspective it matches what the table actually represents. It is a container that has a list of work items, or a list of users, or a list of whatever. If we’re going to store a list of work items in a container, by golly we should call that thing “WorkItems”. It’s the same logic that is applied to object models; if your Person object contains a list of addresses, you’re going to call the property something like “Addresses” or “AddressList”, not “Address”. The idea of describing a whole list of customers as just “Customer” just rubbed me the wrong way.
Singular Table Names Are More Gooder
However, there are also several benefits to using singular table names. First, it avoids a lot of confusion about the proper way to pluralize a name.
Should the table that contains a list of person records be called “People” or “Persons”? “People” is a more common name, but “Persons” has the benefit of simplicity, because it is just appending an “s” to the end. I’ve heard this argument go back and forth, and it doesn’t matter who’s right or wrong because the whole fact that you’re discussing it just a damn waste of time. You are introducing artificial and unnecessary complexity to your project, which you and every other developer will have to worry about for the life of the project. Hidden productive costs like this add up significantly over time, and they suck the life out of your developers, of often times without them even realizing why they hate their job.
Also, assuming you follow some sort of standardized naming convention for primary and foreign keys, this can improve the readability and writability of your code, because you will not be mixing singular and plural names. To me, this:
SELECT * FROM WorkItem
INNER JOIN WorkItemType
ON WorkItem.WorkItemTypeID = WorkItemType.ID
SELECT * FROM WorkItem
Flows a little better this:
SELECT * FROM WorkItems
INNER JOIN WorkItemTypes
ON WorkItems.WorkItemTypeID = WorkItemTypes.ID
SELECT * FROM WorkItems
Lastly, if you ever want to have any scripts for performing automated maintenance or generate code, it’s a lot easier to only have to deal with a single tense to deal with and not have to worry about pluralized names. We’re going to be doing at least a little of that in Parallax.
Keep It Consistent
So like in most things, there are valid reasons for either approach, and in the end it probably doesn’t matter too much which you choose. However, what does matter is that you do choose and stick to it. If you start switching between plural and singular names, it is one of the first steps towards chaos. Your developers will have to continuously waste a unacceptable of time to stop and think “was that table called Person, or Persons, or People, or Peoples, or Peeps, or…?”
Primary and Foreign Keys
When I create a table, the first thing I do is created a clustered integer primary key named “ID”. 99% of the time, I do it every time.
Could I include the table name in the primary key column, e.g. “WorkItemID” instead of “ID”? Sure, but I chose not to, again for subjective reasons. I feel that the primary key is a really important field on the table, and should jump out in a query as the single most important and identifying field in the table. When you look at a query and are trying to visual exactly what it is doing, I feel that it makes it a just a little bit clearer to see the primary keys first and work out from there. Of course, one shortcoming is that it can get a little unclear when you start using vague subquery aliases, like “subq.ID”, but I think that is just another good reason to not use vague aliases in the first place.
For foreign keys, I just add “ID” to the referenced table name, such as “WorkItemID”. This way, when you’re writing a query, you never really have to stop and think what the field name is, or worse yet, go look it up in schema. Now there are cases that this doesn’t work, like when you have multiple independent foreign keys in table referencing the same table. For example, the WorkItem table references the AlmUser table for two different fields, the reporting user and the assigned to user, and we’ll be adding several more shortly. In this case, we can’t call them both “AlmUserID”, so we’ll add a prefix to them to identify their purpose, “ReportingAlmUserID” and “AssignedToAlmUserID”. This can also be used in cases where the reason for the key is not immediately evident, but this starts down a slippery slope of naming things certain ways just because you feel like it on a certain day, which should be avoided as much as possible.
What does the clustered part mean? A table can have a single clustered index, and that means that every row in the table will actually be physically stored in that order, which means that retrieving records by that index faster than a standard index. The tradeoff is that inserting new records into the middle of the index can be slow because it affects the physical layout of the table. However, this works especially well for auto-incrementing integer fields, because the new records always go at the end of the table and the end of the index.
This extra, auto-generated ID column approach is called a surrogate key, which means that it is not derived from any application data. So why reserve a whole extra field for this? Another approach is to use a “natural” or “intelligent” key, which is something a naturally occurring piece of data that is sure to be unique across all the rows. One of the more common example you see is using Social Security Number as the key for an Employee table.
There are several reasons not to use natural keys, but the first and most important to me, is: Get Your Damn Business Rules Out Of My Schema. I’m designing a database here, and my primary concern is that the data can be reliably and efficiently stored and retrieved. You may say today that every employee has a Social Security Number and that it’s guaranteed to be unique, and then in 6 months we need to change the application because we need to enter employee records before we have their SSN, or somehow a dead employee’s SSN got reissued to a new employee, or a new government regulation restricts from storing SSNs in the database, or about a million other things that you won’t think of because you’re really not nearly as smart as you think you are. If you’ve built your entire system around the assumption that a piece of data will be unique, you’ve really painted yourself into a corner when that assumption is inevitably proven to be wrong. Wouldn’t you rather just start with a single ID field that you can always rely on to be unique? It’s only purpose in life is to make sure that your data can be reliably identified, and that makes me feel warm and fuzzy.
And for a very real case of this, just this past fall I had to change SportsCommander to allow users to change their username. Why would you ever allow users to do this? Because they needed to. The original implementation auto-generated cryptic usernames, and this was only done to match a legacy system it had replaced, but the users could never remember their username. So despite how much it drove me nuts to change this after the fact, we accepted that was critically important to our user’s satisfaction with the site to allow them to select a new username. In the end, the technology gets less important, and you either have fantastically happy users or you’re selling flowers by the highway. If would have been easy to say up front that “users can never change their username, why would we ever allow that?”, and if we had used the username as a primary key, we’d really be screwed. Luckily, we already knew that we were simpleminded fools who could not predict the future, and did our best to protected ourselves against that.
Of course, in the original version of SportsCommander, we could have also used the USA Volleyball ID number to identify users, because every user in the system is a USAV member with their own number. However, it turned out that some users had their USAV number change from one season to the next. Also, we subsequently expanded SportsCommander to support organizations beyond USA Volleyball, so the identifier further ceased to be the grandly unique identifier it originally seemed to be.
Also, it’s slow. Sure, if it’s indexed, it can be really fast, but a lookup with a string field will never be as fast as an integer field. It also undermines any gain achieved by using clustered indexes, because your data is no longer sequential. the difference may be negligible a lot of the time, and I definitely don’t want to start doing premature optimization or complicating the code to squeeze our a tiny bit of performance. However, this is a small thing that can scale significantly depending on the amount of data and tables you have, because you are are doing constantly doing these lookups on against almost every row you touch.
Of course, natural keys aren’t all bad. One positive benefit of natural keys is that they can make bulk importing or synchronizing of data easier. For example, imagine that you are importing a list of employee records and also a list of their timesheet hours. You’ll make two passes through the data, once for the employee records themselves, and then a second pass for the timesheet hours. During the second pass, you’ll need to correlate the hours with the employee records that you’ve already inserted, and it’s a lot easier and quicker if you can use a value you already have (like the SSN), than to have to use that value to lookup another arbitrary value (like an ID field). However, in the end, I really don’t care about this all that much. It’s a small subset of the usage of the database, and I feel it pales in comparison to the other factors. Sure, it might make a DBA’s life a little more difficult for some data import they have to run in the background from time to time, but honestly I really don’t give a damn. Life is full of compromises, and the total system is almost always better off, in terms of stability and total development time, with a more consistent and reliable surrogate key.
So why not use Globally Unique Identifiers (GUIDs) for keys instead of numeric IDs? GUIDs definitely have their place, but it really depends on what you’re trying to do.
GUIDs have a lot of benefits due to the fact that they are always globally unique, while your cute little identity (i.e. auto-incrementing) ID is only unique within your table. The identity isn’t even unique throughout the database, because each table maintains its own list (this of course assumes SQL Server, not Oracle, I don’t know how MySQL handles this because I really don’t care).
Having globally unique keys can be useful if you are doing any replication from one database to another, or if you need to store the data in several different databases for other reasons, because you can assign a GUID ID to a record and be assured that it won’t conflict with ID from another database. For example, if you had a database that stored a master list of users, and those users are represented in several satellite databases, maybe for several separate web applications at the same company, GUID keys would make sense because the data by its very nature should be unique beyond the database.
Another benefit is that because they are non-sequential, you don’t have to worry about people guessing adjacent values. If you are going to include any IDs in a query string, it’s usually important to use values that will not allow the user to guess additional values. If user is on a page called “EditRecord.aspx?ID=12”, you know for sure that some user is going to try typing in “EditRecord.aspx?ID=13” just to see what would happen. Ideally in these cases you application will restrict the user from getting access to records that he/she/it/heshe should not have access to, but it’s a good idea to not encourage your users to go looking for these types of vulnerabilities. If the user sees “EditRecord.aspx?ID=ceb143ce-b4e9-4d59-bedb-bb07c5f0eb41”, good luck trying to guess the another value from that table.
However, GUIDs are slow. Again, if it’s indexed properly, it should perform sufficiently, but much like string fields GUIDs will never be as fast as integers.
Also, GUIDs make debugging a lot harder. You can’t yell over the cube wall “hey Barack, can you take a look at record 49bebb82-a96c-4bfc-b222-0b9df372ec2a, it looks like you screwed something up again!”. Even when you are just running a few queries to track down a problem in the middle of the night, it goes a lot quicker and burns fewer brain cycles if you can reference an easily-remembered number, like 8675309, instead of a 36-character identifier.
So in the end, I usually just use integer keys for everything. That way I get the best performance, the easily debugging, and it fits my needs 95% of the time. There still are always cases that you could use GUID identifier, so I just put an extra GUID column on those tables. That gives us the fast and easy integer key lookups for most queries, and the GUID uniqueness added-on when we need it. We’ve done that here for the AlmUser table, because there are a lot of cases that you don’t want to have users guessing the user ID number.
Often times when you are creating a schema, you use table or field names that are reserved by SQL. You may name a table “User” or name a field “Description”. You may not think anything of it at the time, and SQL will let you do it without complaint, but then when you try to write queries against it you’ll get syntax errors or weird behavior. A lot of developers know the frustration of typing that last character of the field name and watching the font color turn blue.
Of course, SQL provides a way around this. By wrapping your identifiers in brackets (like [User].[Description]), you can help SQL avoid this confusion. Any database layer that automatically generates SQL code should ALWAYS do this, because you never know what you might encounter.
However, typing in the brackets is a pain, and it hampers readability, and people often forget to do it, and it’s one of the last things I want to have to worry about while I’m writing code.
Most other programming platforms, like C+ or C#, prevent you from even using reserved words for your own purposes, because really, who needs the headache of keeping them straight? They are called “reserved” for a very good reason, and I hate that SQL doesn’t actually “reserve” these words.
Anyhow, just because SQL doesn’t enforce the rule doesn’t mean that we can’t enforce it ourselves. That being the case, stay away from this list of SQL Server 2008 reserved words. You can use them as any part of a name, such as “AlmUser” or “ItemDescription”, but just don’t use them by themselves.
Just about every database I’ve designed that drives a web application includes some sort of a User table. It may store authentication information, or personal information, or may just be an intersection point for everything else.
In this case we will store some basic identifying information, like FirstName and LastName. We’ll expand it over time to include additional fields. There is also a GUID column, in case we need to pass a User ID down to the browser, that is the version we’d like to use so that people can’t guess the other user’s IDs as well.
However, you may notice that it does not include the necessary fields for authentication, like Password. We’re going to outsource that for now using the ASP.NET Membership Provider interface. The default ASP.NET Membership Provider just stores the information in some database tables, but you can also have providers to connect to Active Directory, Open ID, or some any number of other third-party authentication systems. For now, we’ll use the default provider, and we may store the tables in the Parallax database just to keep everything together, but there will be NO connection or references between those membership tables and the rest of the Parallax schema. At any point in time, we will want to be able to pull that out and replace it with another provider.
Granted, this approach may be overkill, and it is usually not necessary for most applications to keep their membership information completely abstracted. However, as I’ve mentioned, I’m not pretending that we building a practical application here with realistic design constraints; this is a sandbox where we will explore several technologies, so membership providers is one of those topics I’d like to explore.
Every table ends in 4 fields, indicating who created the record and when, and who last updated the record and when. Granted, this only displays two dimensions of the history, rather than a full history of who changed what when, but I’ve found it comes in handy when trying to figure out when the data went awry.
You’ll notice that we use the user GUIDs to store the user. I’ve seen implementations where they used actual foreign keys to the user table, and it caused enormous problems. The checking of the key constraint can slow the performance of inserts and updates, albeit slightly, but that’s still a big no-no. Your auditing code should have no detrimental impact on the database. Also, what happens if for some reason the ID value is wrong? You really don’t want to reject the change for this reason, because these are just audit fields, not real transactional information. It’s important that the value be right, but it’s a lot less important than the rest of the record. Lastly, and the most crippling, is that by the time the database had grown to hundreds of tables, it was virtually impossible to delete a user record. Anytime we tried, it would run through every key referencing the table (2 per table) to determine if there were any other records referencing the deleted record. Due to the number of relationships and the volume of data, this would timeout after a few minutes.
So clearly, foreign keys are out. We could still use the ID from the table without a key, but I see a few issues with that. First, every other place in the database that stores an ID reference field has a foreign key. That’s a core rule of the database, and I don’t want to give anyone, especially myself, the opening to start getting creative with whether to enforce foreign keys. Also, the value is not very intuitive. Lastly, I don’t want anyone to see that field, think that someone else forgot a key constraint, and go ahead and start adding them. If you don’t make it clear in the code how your code is supposed to be used, it’s your own fault when it is inevitably misused.
So in the past I’ve learned towards putting the username in there, and it worked great. It was straightforward, unique, succinct, and human-readable. However, as we mentioned above when discussing natural keys, even usernames can change.
So we’re trying the user GUID this time around. It’s not nearly as friendly as the username, but it certainly is unique. And it avoids the usage confusion that you get with ID fields. I’m still not crazy about this one, but we’ll see how it goes.
NVARCHAR vs. VARCHAR
For a long time I always used VARCHAR, because I didn’t have a reason to use NVARCHAR. Then, I ran into a few apps that needed to have NVARCHARs here and there because we had some international customers, and it was a pain keeping track of which fields were which. So I just started using NVARCHAR for everything, but in hindsight I think that was a mistake. Doug Blystone (co-founder of MMDB Solutions and a guy who knows more about SQL databases that I could ever hope to) has finally convinced me to stop using them unless absolutely necessary. I usually shrug off performance issues for the sake of simplicity until they are a demonstrable problem, but this is one of the cases where it can lead to a hidden and pervasive performance drag that won’t jump out at you, but will slow you down just a little bit every step of the way. These are the types of performance issues that add a second or two to the page load here and there, which makes you users subconsciously hate your guts, even if they don’t really know why. It takes up twice as much space, it makes comparisons of string values take longer because there are twice as many bytes to compare, and it’s just not necessary unless you’re dealing with Asian languages (or anything language that may have thousands of characters). We are years removed from the day that we can afford to hire someone to translate our applications to Japanese, and I’m guessing we’ll never do business in China because I hate Commies.
That being said, we’re using VARCHAR. We’ll also be using UNIQUEIDENTIFIER, DATETIME, DATE, TIME, BIT, INT, FLOAT, TEXT, and maybe VARBINARY. And that’s it. No smallint or real or money or any of that crap. We are going to use the minimum number of logically distinct data types that meet our needs. I have no interest in getting into data-type arms war, trying to save a byte here or there to create some sort of idealistically pure representation of the data’s inner beauty. I, nor any other developer, will waste a minute of time worrying about if we can cast this bigint into an integer or that real into a float without data loss. We’re going to keep this simple, so help me God, and so you and your sql_variant friends better get the hell off my lawn before I get my shotgun.
I actually have used the new DATE and TIME fields yet, but I’ve long felt the need for them. A DATETIME field is really two different things, which usually works out OK, but when you need to store just a date or just a time, it really gets clunky. Then when you start having to worry about time zones, it can get just plain weird. We have a few cases in SportsCommander now where we have to do a little code gymnastics to store just a date or just a time in a DATETIME field, and have the inactive part properly ignored without interfering with the active part. This will be my first foray into using these data types, so we’ll see how well they work.
So that’s a start for the database stuff. It ran a little long for 3 tables, but you need to state your ground rules up front, even if it’s just for your own benefit, lest you start getting lazy and breaking them. So here we go.