At the very least, it’s probably not nearly as good as it should be.  As a pathological job-hopper who sees one company after another wasting hidden hours and days on insufficient database change control strategies, I’m sick of it.

So how does my company handle database changes?  I’ve asked this of potential employers many times, and I usually just get blank stares, or a vague answer along the line of “uh… with scripts?”  Sometimes, if they are honest, they just reply “poorly.”  There’s not even an explicit Joel Test for it.  Considering how bad the situation is at most companies, the real test is not whether their database change management is any good, but just whether they are willing to recognize how problematic it really is.

Given how much thought and effort goes into source code control and change management at many of these same companies, it is confusing and a little unsettling that so much less progress has been made on the database change management front.  Many developers can give you a 15 minute explanation of their source code strategy, why they are doing certain things and referencing books and blog posts to support their approach, but when it comes to database changes it is usually just an ad-hoc system that has evolved over time and everyone is a little bit ashamed of it. 

Hopefully we can get to the bottom of why this is and come up with a few ways to make it better.  It’s not like the database is important, is it?

 
I like you, but not in that way

So why don’t developers think about databases the same way that they think about source code? 

The most obvious reason is that database changes are fundamentally different in many ways.  While source code is usually just a collection of files that recompiled, versioned, and released at any given time, databases are much more temperamental.  They have existing data, and they have history.  Sure, source code has history, which you can review for reference purposes, but in databases the historical lineage of a table is actually very important.  In C#, if you added a field to a class, and then a week later someone else changed the type, and then a month later someone else changed the name, it usually doesn’t really matter too much when or in what order those changes happened, all that matters is the current state of the code and that everything builds together and works in its latest state.  However, if you were to do the same thing to a new field in a table, it definitely makes a difference, because there are data implications at every step of the way.  This alone scares a lot of developers away from maintaining the database.

To many developers, there is something fundamentally uncontrollable about databases.  They don’t fit into the safe model that we’re use to.  Managing those changes is definitely introduces new challenge, and many developers just don’t want to be bothered.

 

Doing Business As

Another major reason for the difference is just cultural.  Developers want the database to work well, and they may even like tinkering around with some stored procedures from time to time, but at the end of the day they like to be able to wash their hands of anything that slightly resembles being a “DBA”.

When you compare the DBAs (database administrators) and SCMs (stupid code monkeys), often times there is just a different way of looking at the world.

Developers often see themselves as the general of a utilitarian army of code that they build and train and then order off to carry out their mission, and they usually won’t hesitate to trash it all and replace it if it proves itself unable to complete the mission.  DBAs on the other hand are used to dealing with gentle lion that could easily lose it temper and kill everyone in the room if it’s not treated with kindness and respect.  Developers often have the option to wipe the slate clean and start over, and usually want to, especially when they are dealing with someone else’s code.  DBAs however are stuck with the original version and they need to keep it alive and make it work until the point that the data can be migrated to a better system, and we all secretly know that is never going to happen.

 

Code source control is an over-ripe banana

As developers, we are awash with tools and techniques for managing our source code changes.  We have a seemingly endless number of source control platforms to choose from, each with more features than we could every possibly need or even understand, and some people build entire careers around defining complicated branching and merging strategies for massive codebases.  I figured for a while that the ideas of practical source control where pretty well nailed down, but then you read Eric’s Sink’s posts on Distributed Version Control  Systems, and get an idea of how quickly the field is continuing to evolve.

At some internal IT department, a group of developers is toiling away on a set of dull yet fabulously enterprise-y software that five people will use.  Jimmy checks his code changes into his company’s version control system of choice, where it is automatically held at the gates until it is code-reviewed by a senior developer, and then it is checked into the team’s current release branch.  Meanwhile the continuous integration build server will download, compile, package, and unit test the code to make sure that Jimmy hasn’t broken anything, and that nobody else’s changes broke Jimmy.  The package code is then migrated through an integration environment, QA environment, UAT environment, and staging environment on its way to production.  All the while, as the changes are validated from one environment to another, version numbers are automatically assigned to allow anyone to trace back the exact revision for a given build, and the corresponding code changes slowly work their way towards the code-release promised land, the Main Trunk.  Those branches can get pretty damned complicated, even when everything is going smoothly, and it never goes smoothly.

Code branchs can get  

Hopefully this is a process that Jimmy’s company evolved out of mixture of necessity and forethought over the years.  The other, less attractive, and far more common scenario is that the company hired an astronaut trapped in a developer’s body, bored with his work and not bound by any sense of urgency, who assured the company that he was going to implement a lot of big complicated processes because that’s just want professional companies do. 

In the end, a whole lot of people are paying a whole lot of attention to managing the source code.  Hopefully at your company, you are paying attention to this to.

 

Databases, the redheaded stepchild of source control

Now ask yourself, how does Jimmy handle database changes? 

For example, say that all database changes need to be checked into an isolated directory in source control; after all we’re not savages.  However, since they don’t really “build” a database, that a directory which is ignored by their continuous integration server.  This in turn breaks the unit tests that are pointing to the integration database server, so Jimmy then needs to run those scripts manually in the integration environment. 

In this process, Jimmy sees other database scripts that were checked in recently in the same source control directory, but he has no way to know which scripts have already been applied to the integration server.  For the briefest moment, Jimmy considers applying those scripts as well, just to make sure that the integration server is fully up-to-date, but then he realizes that he can’t be sure which scripts have already been run without manually comparing the schema and scripts to see which have been applied, and this would make Jimmy the defacto owner for any issue that arise because of it.  With his own tasks and deadlines to worry about, Jimmy doesn’t have the time or patience for this silliness, so he just deploys his scripts, forgets about the others, and hopes for the best. 

It’s worth noting here that this is the kind of things that can silently kill software quality.  A motivated developer just tried to make things a little better, but the process was so frustratingly inadequate that it was impractical for him to do so.  Software companies depend on their developers taking the initiative to improve things, and when they are discouraged from doing so, either by person or by process, the company will slowly slide backwards into mediocrity, and it will drag every developer’s morale with them.

 

Now once Jimmy makes the changes to the integration server database, that also breaks some other developers that have been using that server for their development.  Those developers now need to stop and download the latest code changes to get back in sync, cursing Jimmy’s name the whole way.

Anyhow, during the next deployment to QA, someone needs to remember that these specific changes need to get deployed.  Since there is no defined strategy for tying database changes to code changes, every time code is deployed there is a little bit of confusion around exactly which database changes need to be released, which were already released, and what order the scripts need to be run in.  Jimmy is getting upset.

Another darker possibility is that instead Jimmy needs to submit his changes to the database review board, a collection of detached idealists, college professors without the college, who will criticize every aspect of the scripts in order to justify their existence, but will not really offer any true value because they don’t understand the business problem that needs to be solved, nor do they appreciate the application considerations beyond the database that need to be satisfied.

One of the long term impacts of this is that Jimmy will look for any possible way to accomplish what he is trying to do without making database changes, because, in his colorful youthful vernacular, “making changes to the database is a sodding bitch.”  And if he does indeed need to change the database, he’ll try to accomplish it just by changing stored procedures, because changing table schemas are even worse.  In the end, he’s definitely not trying to find the appropriate solution to the problem; instead he backed into a situation of being forced to find a “good-enough” solution that will minimize his hassle, regardless of the downstream impacts. 

From now on, he’ll look for any way he can accomplish it by only changing stored procedures and not changing the underlying schema.  If he’s lucky (sort of), he’ll find a way that he can just kludge the stored procedures to work around the problem for now, and let it be someone else’s problem to solve later.  He has long since given up trying to find the “right” solution, because it is so exhausting the current state of things is so busted up that it’s not even worth trying anymore.

Further complicating the issue, some developers and DBAs make one-off changes in the QA and production environments without going through source control.  Either they need to make an emergency fix in one of the environments and forget to go back and add the scripts to source control, or worse they just don’t believe that databases should be under source control (I’ve seen this attitude far too often from some DBAs, because they can’t stand the idea of being forced to use a source control system managed by some other developers, just so that they can make changes to their own database).  Pretty soon, every environment is a perverted branch of the one true database, and trying to identify why a test works in one environment and fails in another quickly becomes a nightmare.

 

Some day, things will be different!

So what’s the solution?  Well, in my humble opinion, we need something like this:

  • The database must be versioned, so that it is easy to tell which changes have applied and when they were applied.
  • All database changes must be checked into source control.
  • All database changes can tied to the code changes that they affect, ideally checked into source control as a part of the same changeset transaction.
  • The database changes are built along with the code changes.
  • The database changes are deployed along with the code changes.
  • The continuous integration server must be able to build and update its own copy of the database, so that it can run automated tests of code and scripts that are checked in at the same time.

I’ve seen some companies that have had home-grown utilities that come close to accomplishing this, but in the end they all fell just a little bit short, which is not too bad.  However, the vast majority of companies I’ve seen were not even in the ball park.

Some of you are probably asking, “doesn’t Visual Studio Team System” do this?  Yeah, I think so.  Probably, but who knows.  Honestly I tried working with it a few times, and it caused me nothing but problems.  Sure, I could spend a lot of time mastering all the quirks, but I’m looking for something a little bit more accessible here.  The underlying concepts are hard enough; we need an approach that simplifies it, and I just don’t think that VSTS accomplishes that.  More importantly, and also along the lines of accessibility, VSTS costs a fortune, and so most developers will never have access to it, so I’d like to fine something that can work for the other 95% of developers out there that are stuck using reasonably-priced tools.

What about Red Gate’s SQL Compare and SQL Data Compare products?  Absolutely, they tools are indeed awesome products that can prove invaluable in many cases.  However, they are often leveraged once an unknown number of database changes have already been made by an unknown number of people.  However, I think the better solution, whenever possible, is to step back and track the changes as they come in, proactively considering the changes, just like we do for regular source code, which allows for far more robust change management and accountability for the database.

So that’s the idea.  In the next few weeks I’ll have some more posts on this topic, getting into some more specifics about how to solve this.  Hopefully we can find a simple, easy to manage solution, that developers can adapt to quickly, and which promotes database changes to the first-class citizen of the change management world that it deserves to be. 

If you’re following along, I’ve recently named this effort “Parallax” for long and windy reasons.  You can read all about it here.

I’ve copied the project up to CodePlex, at http://parallax.codeplex.com.  Feel free take a look.

There’s not much there yet, just a few database tables, but it can get us started talking about the approach we’re taking to the database.

 

Here’s what we have so far:

ParallaxDB_1

 

There are a few things going on here:

Singular Table Names

I’ve used singular table names for all tables.  This is a long and well-worn debate, whether to use singular or pluralized table names.

For years, I was a strong proponent of the pluralized table names.  At the time, it was probably mostly because that’s what I had used before, and like most people I tend to come up with rational arguments to support my subjective preferences, because cognitive dissonance is yucky.

Plural Table Names Are Good

One of the main reasons for plural table names, at least for me, is that from one perspective it matches what the table actually represents.  It is a container that has a list of work items, or a list of users, or a list of whatever.  If we’re going to store a list of work items in a container, by golly we should call that thing “WorkItems”.  It’s the same logic that is applied to object models; if your Person object contains a list of addresses, you’re going to call the property something like “Addresses” or “AddressList”, not “Address”.  The idea of describing a whole list of customers as just “Customer” just rubbed me the wrong way.

Singular Table Names Are More Gooder

However, there are also several benefits to using singular table names.  First, it avoids a lot of confusion about the proper way to pluralize a name. 

Should the table that contains a list of person records be called “People” or “Persons”?  “People” is a more common name, but “Persons” has the benefit of simplicity, because it is just appending an “s” to the end.  I’ve heard this argument go back and forth, and it doesn’t matter who’s right or wrong because the whole fact that you’re discussing it just a damn waste of time.  You are introducing artificial and unnecessary complexity to your project, which you and every other developer will have to worry about for the life of the project.  Hidden productive costs like this add up significantly over time, and they suck the life out of your developers, of often times without them even realizing why they hate their job.

 

Also, assuming you follow some sort of standardized naming convention for primary and foreign keys, this can improve the readability and writability of your code, because you will not be mixing singular and plural names.  To me, this:


SELECT * FROM WorkItem
INNER JOIN WorkItemType
ON WorkItem.WorkItemTypeID = WorkItemType.ID

 

Flows a little better this:


SELECT * FROM WorkItems
INNER JOIN WorkItemTypes
ON WorkItems.WorkItemTypeID = WorkItemTypes.ID

 

Lastly, if you ever want to have any scripts for performing automated maintenance or generate code, it’s a lot easier to only have to deal with a single tense to deal with and not have to worry about pluralized names.  We’re going to be doing at least a little of that in Parallax.

Keep It Consistent

So like in most things, there are valid reasons for either approach, and in the end it probably doesn’t matter too much which you choose.  However, what does matter is that you do choose and stick to it.  If you start switching between plural and singular names, it is one of the first steps towards chaos.  Your developers will have to continuously waste a unacceptable of time to stop and think “was that table called Person, or Persons, or People, or Peoples, or Peeps, or…?”

 

Primary and Foreign Keys

When I create a table, the first thing I do is created a clustered integer primary key named “ID”.  99% of the time, I do it every time.

Naming Conventions

Could I include the table name in the primary key column, e.g. “WorkItemID” instead of “ID”?  Sure, but I chose not to, again for subjective reasons.  I feel that the primary key is a really important field on the table, and should jump out in a query as the single most important and identifying field in the table.  When you look at a query and are trying to visual exactly what it is doing, I feel that it makes it a just a little bit clearer to see the primary keys first and work out from there.  Of course, one shortcoming is that it can get a little unclear when you start using vague subquery aliases, like “subq.ID”, but I think that is just another good reason to not use vague aliases in the first place.

For foreign keys, I just add “ID” to the referenced table name, such as “WorkItemID”. This way, when you’re writing a query, you never really have to stop and think what the field name is, or worse yet, go look it up in schema.  Now there are cases that this doesn’t work, like when you have multiple independent foreign keys in table referencing the same table.  For example, the WorkItem table references the AlmUser table for two different fields, the reporting user and the assigned to user, and we’ll be adding several more shortly.  In this case, we can’t call them both “AlmUserID”, so we’ll add a prefix to them to identify their purpose, “ReportingAlmUserID” and “AssignedToAlmUserID”.  This can also be used in cases where the reason for the key is not immediately evident, but this starts down a slippery slope of naming things certain ways just because you feel like it on a certain day, which should be avoided as much as possible.

Clustered Index

What does the clustered part mean?  A table can have a single clustered index, and that means that every row in the table will actually be physically stored in that order, which means that retrieving records by that index faster than a standard index.  The tradeoff is that inserting new records into the middle of the index can be slow because it affects the physical layout of the table.  However, this works especially well for auto-incrementing integer fields, because the new records always go at the end of the table and the end of the index. 

Natural/Intelligent Keys

This extra, auto-generated ID column approach is called a surrogate key, which means that it is not derived from any application data.  So why reserve a whole extra field for this?  Another approach is to use a “natural” or “intelligent” key, which is something a naturally occurring piece of data that is sure to be unique across all the rows.  One of the more common example you see is using Social Security Number as the key for an Employee table. 

There are several reasons not to use natural keys, but the first and most important to me, is: Get Your Damn Business Rules Out Of My Schema.  I’m designing a database here, and my primary concern is that the data can be reliably and efficiently stored and retrieved.  You may say today that every employee has a Social Security Number and that it’s guaranteed to be unique, and then in 6 months we need to change the application because we need to enter employee records before we have their SSN, or somehow a dead employee’s SSN got reissued to a new employee, or a new government regulation restricts from storing SSNs in the database, or about a million other things that you won’t think of because you’re really not nearly as smart as you think you are.  If you’ve built your entire system around the assumption that a piece of data will be unique, you’ve really painted yourself into a corner when that assumption is inevitably proven to be wrong.  Wouldn’t you rather just start with a single ID field that you can always rely on to be unique?  It’s only purpose in life is to make sure that your data can be reliably identified, and that makes me feel warm and fuzzy.

And for a very real case of this, just this past fall I had to change SportsCommander to allow users to change their username.  Why would you ever allow users to do this?  Because they needed to.  The original implementation auto-generated cryptic usernames, and this was only done to match a legacy system it had replaced, but the users could never remember their username.  So despite how much it drove me nuts to change this after the fact, we accepted that was critically important to our user’s satisfaction with the site to allow them to select a new username.  In the end, the technology gets less important, and you either have fantastically happy users or you’re selling flowers by the highway.  If would have been easy to say up front that “users can never change their username, why would we ever allow that?”, and if we had used the username as a primary key, we’d really be screwed.  Luckily, we already knew that we were simpleminded fools who could not predict the future, and did our best to protected ourselves against that. 

Of course, in the original version of SportsCommander, we could have also used the USA Volleyball ID number to identify users, because every user in the system is a USAV member with their own number.  However, it turned out that some users had their USAV number change from one season to the next.  Also, we subsequently expanded SportsCommander to support organizations beyond USA Volleyball, so the identifier further ceased to be the grandly unique identifier it originally seemed to be.

 

Also, it’s slow.  Sure, if it’s indexed, it can be really fast, but a lookup with a string field will never be as fast as an integer field.  It also undermines any gain achieved by using clustered indexes, because your data is no longer sequential.  the difference may be negligible a lot of the time, and I definitely don’t want to start doing premature optimization or complicating the code to squeeze our a tiny bit of performance.  However, this is a small thing that can scale significantly depending on the amount of data and tables you have, because you are are doing constantly doing these lookups on against almost every row you touch. 

Of course, natural keys aren’t all bad.  One positive benefit of natural keys is that they can make bulk importing or synchronizing of data easier.  For example, imagine that you are importing a list of employee records and also a list of their timesheet hours.  You’ll make two passes through the data, once for the employee records themselves, and then a second pass for the timesheet hours.  During the second pass, you’ll need to correlate the hours with the employee records that you’ve already inserted, and it’s a lot easier and quicker if you can use a value you already have (like the SSN), than to have to use that value to lookup another arbitrary value (like an ID field).  However, in the end, I really don’t care about this all that much.  It’s a small subset of the usage of the database, and I feel it pales in comparison to the other factors.  Sure, it might make a DBA’s life a little more difficult for some data import they have to run in the background from time to time, but honestly I really don’t give a damn.  Life is full of compromises, and the total system is almost always better off, in terms of stability and total development time, with a more consistent and reliable surrogate key.

GUID Keys

So why not use Globally Unique Identifiers (GUIDs) for keys instead of numeric IDs?  GUIDs definitely have their place, but it really depends on what you’re trying to do.

GUIDs have a lot of benefits due to the fact that they are always globally unique, while your cute little identity (i.e. auto-incrementing)  ID is only unique within your table.  The identity isn’t even unique throughout the database, because each table maintains its own list (this of course assumes SQL Server, not Oracle, I don’t know how MySQL handles this because I really don’t care). 

Having globally unique keys can be useful if you are doing any replication from one database to another, or if you need to store the data in several different databases for other reasons, because you can assign a GUID ID to a record and be assured that it won’t conflict with ID from another database.  For example, if you had a database that stored a master list of users, and those users are represented in several satellite databases, maybe for several separate web applications at the same company, GUID keys would make sense because the data by its very nature should be unique beyond the database. 

Another benefit is that because they are non-sequential, you don’t have to worry about people guessing adjacent values.  If you are going to include any IDs in a query string, it’s usually important to use values that will not allow the user to guess additional values.  If user is on a page called “EditRecord.aspx?ID=12”, you know for sure that some user is going to try typing in “EditRecord.aspx?ID=13” just to see what would happen.  Ideally in these cases you application will restrict the user from getting access to records that he/she/it/heshe should not have access to, but it’s a good idea to not encourage your users to go looking for these types of vulnerabilities.  If the user sees “EditRecord.aspx?ID=ceb143ce-b4e9-4d59-bedb-bb07c5f0eb41”, good luck trying to guess the another value from that table.

However, GUIDs are slow.  Again, if it’s indexed properly, it should perform sufficiently, but much like string fields GUIDs will never be as fast as integers.

Also, GUIDs make debugging a lot harder.  You can’t yell over the cube wall “hey Barack, can you take a look at record 49bebb82-a96c-4bfc-b222-0b9df372ec2a, it looks like you screwed something up again!”.  Even when you are just running a few queries to track down a problem in the middle of the night, it goes a lot quicker and burns fewer brain cycles if you can reference an easily-remembered number, like 8675309, instead of a 36-character identifier.

So in the end, I usually just use integer keys for everything.  That way I get the best performance, the easily debugging, and it fits my needs 95% of the time.  There still are always cases that you could use GUID identifier, so I just put an extra GUID column on those tables.  That gives us the fast and easy integer key lookups for most queries, and the GUID uniqueness added-on when we need it.  We’ve done that here for the AlmUser table, because there are a lot of cases that you don’t want to have users guessing the user ID number.

 

Reserved Words

Often times when you are creating a schema, you use table or field names that are reserved by SQL.  You may name a table “User” or name a field “Description”.  You may not think anything of it at the time, and SQL will let you do it without complaint, but then when you try to write queries against it you’ll get syntax errors or weird behavior.  A lot of developers know the frustration of typing that last character of the field name and watching the font color turn blue.

Of course, SQL provides a way around this.  By wrapping your identifiers in brackets (like [User].[Description]), you can help SQL avoid this confusion.  Any database layer that automatically generates SQL code should ALWAYS do this, because you never know what you might encounter.

However, typing in the brackets is a pain, and it hampers readability, and people often forget to do it, and it’s one of the last things I want to have to worry about while I’m writing code.

Most other programming platforms, like C+ or C#, prevent you from even using reserved words for your own purposes, because really, who needs the headache of keeping them straight?  They are called “reserved” for a very good reason, and I hate that SQL doesn’t actually “reserve” these words.

Anyhow, just because SQL doesn’t enforce the rule doesn’t mean that we can’t enforce it ourselves.  That being the case, stay away from this list of SQL Server 2008 reserved words.  You can use them as any part of a name, such as “AlmUser” or “ItemDescription”, but just don’t use them by themselves.

 

AlmUser Table

Just about every database I’ve designed that drives a web application includes some sort of a User table.  It may store authentication information, or personal information, or may just be an intersection point for everything else.

In this case we will store some basic identifying information, like FirstName and LastName.  We’ll expand it over time to include additional fields.  There is also a GUID column, in case we need to pass a User ID down to the browser, that is the version we’d like to use so that people can’t guess the other user’s IDs as well.

However, you may notice that it does not include the necessary fields for authentication, like Password.  We’re going to outsource that for now using the ASP.NET Membership Provider interface.  The default ASP.NET Membership Provider just stores the information in some database tables, but you can also have providers to connect to Active Directory, Open ID, or some any number of other third-party authentication systems.  For now, we’ll use the default provider, and we may store the tables in the Parallax database just to keep everything together, but there will be NO connection or references between those membership tables and the rest of the Parallax schema.  At any point in time, we will want to be able to pull that out and replace it with another provider.

Granted, this approach may be overkill, and it is usually not necessary for most applications to keep their membership information completely abstracted.  However, as I’ve mentioned, I’m not pretending that we building a practical application here with realistic design constraints; this is a sandbox where we will explore several technologies, so membership providers is one of those topics I’d like to explore.

 

Audit Columns

Every table ends in 4 fields, indicating who created the record and when, and who last updated the record and when.  Granted, this only displays two dimensions of the history, rather than a full history of who changed what when, but I’ve found it comes in handy when trying to figure out when the data went awry.

You’ll notice that we use the user GUIDs to store the user.  I’ve seen implementations where they used actual foreign keys to the user table, and it caused enormous problems.  The checking of the key constraint can slow the performance of inserts and updates, albeit slightly, but that’s still a big no-no.  Your auditing code should have no detrimental impact on the database.  Also, what happens if for some reason the ID value is wrong?  You really don’t want to reject the change for this reason, because these are just audit fields, not real transactional information.  It’s important that the value be right, but it’s a lot less important than the rest of the record.  Lastly, and the most crippling, is that by the time the database had grown to hundreds of tables, it was virtually impossible to delete a user record.  Anytime we tried, it would run through every key referencing the table (2 per table) to determine if there were any other records referencing the deleted record.  Due to the number of relationships and the volume of data, this would timeout after a few minutes.

So clearly, foreign keys are out.  We could still use the ID from the table without a key, but I see a few issues with that.  First, every other place in the database that stores an ID reference field has a foreign key.  That’s a core rule of the database, and I don’t want to give anyone, especially myself, the opening to start getting creative with whether to enforce foreign keys.  Also, the value is not very intuitive.  Lastly, I don’t want anyone to see that field, think that someone else forgot a key constraint, and go ahead and start adding them.  If you don’t make it clear in the code how your code is supposed to be used, it’s your own fault when it is inevitably misused.

So in the past I’ve learned towards putting the username in there, and it worked great.  It was straightforward, unique, succinct, and human-readable.  However, as we mentioned above when discussing natural keys, even usernames can change.

So we’re trying the user GUID this time around.  It’s not nearly as friendly as the username, but it certainly is unique.  And it avoids the usage confusion that you get with ID fields.  I’m still not crazy about this one, but we’ll see how it goes.

 

Data Types

NVARCHAR vs. VARCHAR

For a long time I always used VARCHAR, because I didn’t have a reason to use NVARCHAR.  Then, I ran into a few apps that needed to have NVARCHARs here and there because we had some international customers, and it was a pain keeping track of which fields were which.  So I just started using NVARCHAR for everything, but in hindsight I think that was a mistake.  Doug Blystone (co-founder of MMDB Solutions and a guy who knows more about SQL databases that I could ever hope to) has finally convinced me to stop using them unless absolutely necessary.  I usually shrug off performance issues for the sake of simplicity until they are a demonstrable problem, but this is one of the cases where it can lead to a hidden and pervasive performance drag that won’t jump out at you, but will slow you down just a little bit every step of the way.  These are the types of performance issues that add a second or two to the page load here and there, which makes you users subconsciously hate your guts, even if they don’t really know why.  It takes up twice as much space, it makes comparisons of string values take longer because there are twice as many bytes to compare, and it’s just not necessary unless you’re dealing with Asian languages (or anything language that may have thousands of characters).  We are years removed from the day that we can afford to hire someone to translate our applications to Japanese, and I’m guessing we’ll never do business in China because I hate Commies.

Simple Types

That being said, we’re using VARCHAR.  We’ll also be using UNIQUEIDENTIFIER, DATETIME, DATE, TIME, BIT, INT, FLOAT, TEXT, and maybe VARBINARY.  And that’s it.  No smallint or real or money or any of that crap.  We are going to use the minimum number of logically distinct data types that meet our needs.  I have no interest in getting into data-type arms war, trying to save a byte here or there to create some sort of idealistically pure representation of the data’s inner beauty.  I, nor any other developer, will waste a minute of time worrying about if we can cast this bigint into an integer or that real into a float without data loss.  We’re going to keep this simple, so help me God, and so you and your sql_variant friends better get the hell off my lawn before I get my shotgun.

I actually have used the new DATE and TIME fields yet, but I’ve long felt the need for them.  A DATETIME field is really two different things, which usually works out OK, but when you need to store just a date or just a time, it really gets clunky.  Then when you start having to worry about time zones, it can get just plain weird.  We have a few cases in SportsCommander now where we have to do a little code gymnastics to store just a date or just a time in a  DATETIME field, and have the inactive part properly ignored without interfering with the active part.  This will be my first foray into using these data types, so we’ll see how well they work.

 

So that’s a start for the database stuff.  It ran a little long for 3 tables, but you need to state your ground rules up front, even if it’s just for your own benefit, lest you start getting lazy and breaking them.  So here we go.