A while ago I had some posts about how to set up simple backups of SQL Azure to make up for a few holes in the tooling (here and here).  I recently ran into a the same issue with RavenDB, and it required stringing a few pieces together, so I figured I’d write up the steps.

TL;DR

Yet again I started out to make a quick how-to, and ended up going into a lot of detail.  Anyhow, here’s the short version:

  1. Download s3.exe from s3.codeplex.com
  2. Run this:
Raven.Smuggler.exe out http://[ServerName]:[Rort] [DatabaseName].dump --database=[DatabaseName]
s3 auth /nogui [AccessKeyID] [SecretAccessKey]
s3 put /nogui [S3BucketName]/[TargetS3Folder]/ [DatabaseName].dump

 

What is RavenDB?

RavenDB is a flat out awesome document database written in .NET.  It’s sort of link MongoDB or CouchDB, but very Windows- and Microsoft-friendly, and has much better ACID and LINQ query support than your average document database.

Whether a document database is right for your project is a complicated question, well beyond the scope of this post, but if you decide you need one, and you’re doing .NET on Windows, RavenDB should be the first one you check out.

While document databases are not ideal for every situation, I’ve found them to be very good for message based applications, which pump data messages from one queue to another.  I’ve used RavenDB as the default primary data store for SrirachaDeploy and MMDB.DataService, and besides a few bumps in the road, it’s worked great.

Types of backups in RavenDB

RavenDB offers two flavors of backups.  One is the official “Backup and Restore” feature, which is very similar to a SQL Server backup/restore, including advanced options link incremental backups.  This does a low-level backup of the ESENT files, including index data.  Restores are all-or-nothing, so you can’t import a file if you’re going to overwrite data in the process.

The other type is the the “Smuggler” feature, which is more of a data import/export utility.  This generates an extract file that contains all of the documents, index definitions, and attachments (controllable by command line parameters).  It’s worth noting though, while Smuggler will backup the index definition, it does not backup the index data, so after you import via Smuggler you may have to wait a few minutes for your indexes to rebuild, depending on your data size.  Since it’s just a data import, you can import it into an existing database, without deleting your existing records, and it will just append the new records, and override existing records if there is a matching ID.

The simplest way to get started with either option is to try them out in the RavenDB user interface.  The RavenDB UI is continually evolving, but as of this writing, under the Tasks section there are Import/Export Database options that use Smuggler, and a Backup Database option as well.

image

Personally, I prefer Smuggler.  It’s very easy to use, the defaults do what I want them to do most of the time, and it can do a low-impact data import to an existing database without blowing away existing test data.  Also, because backup/restore feature uses some OS ESENT logic, it has some OS version portability limitations.  In the end, I usually don’t want anything too fancy or even incremental, the first and foremost backup I want to get in place is “export ALL of my data in the most portable format possible on a regular schedule so I can always get another server running if this one fails, and I can restore it on my machine to recreate a production issue”, and Smuggler has fit that bill nicely.

RavenDB Periodic Backup

RavenDB does have a very cool feature called “Periodic Backup”.  This option actually uses the Smuggler data export functionality, and runs incremental backups and uploads them to your hosting provider of choice (File system, Amazon Glacier, Amazon S3, or Azure storage).

image

The cool thing with this feature is that it’s easy to setup up without any many confusing options.  My problem with this is  that it doesn’t quite have enough options for me, or rather the defaults are not what I really want.  Rather than doing incremental backups on a schedule, I want to be able to do a full backup any time I want.  Unfortunately it doesn’t (yet?) offer the options to force a full backup, nor to force a backup on demand or a a specific time of day.  I’m guessing that these features will continue to improve over time, but in the mean time this is not really what I’m looking for.

Smuggler Export 101

So how to get started with Smuggler?  Of course, visit the documentation here, but here’s the short version for everything I usually need to do.

First, open a command line. 

Yes a command line.  What, you don’t like using the command line?  Oh well, deal with it.  I know, I have hated the command line through much of my career, and I fought against it, but complained about it.  Then I gave up and embraced it.  And guess what, it’s not that bad.  There are plenty of things that are just plain easier to do in a command line and don’t always need a pointy-clicky interface.  So please, just stop complaining and get over it, it’s  all a part of being a developer these days.  If you refuse to use a command line, you are tying a hand behind your back and refusing to use some of the most powerful tools at your disposal.  Plus we are going to be scripting this to run every night, so guess what, that works a lot better with a command line.  I’ll throw in some basic command line tips as we go.

Anyhow, in your command line, go to the Smuggler folder under your RavenDB installation (usually C:\RavenDB\Smuggler on my machines).

Tip: You don’t have type the whole line.  Type part of a folder name and hit TAB, and it will autocomplete with the first match.  Hit TAB a few times and it will cycle through all of matches.  Even us a wildcard (like *.exe) with TAB and it will autocomplete the file name.

Type Raven.Smuggler.exe (or Raven + TAB a few times, or *.exe + TAB) to run Smuggler without and parameters, and you’ll get some detailed instructions.

image

The most common thing you want to do here is backup a whole database to a file.  You do this with the command “Raven.Smugger.exe out [ServerUrl] [OutputFileName]”.

Note: the instructions here will dump the System database (if you use http://localhost:8080/ or something similar as your URL), which is almost certainly not what you want.  It’s not entirely clear the documentation, but the way to export a specific database instance is to use a URL like “http://[servername]:[port]/databases/[databasename]”, or use the –-database variable at the end.  For example, to backup my local demo database, I would use the command:

Raven.Smuggler.exe out http://localhost:8080/databases/demodb demodb.dump

or

Raven.Smuggler.exe out http://localhost:8080 demodb.dump --database=demodb

 

And off it goes:

image

Depending on the size of your database, this may take a few seconds to a few minutes.  Generally it’s pretty fast, but if you have a lot of attachments, that seems to slow it down quite a bit.  Once it’s done, you can see your output file in the same directory:

image

Tip: Are you in a command line directory and really wish you had a Explorer window open in that location?  Run “explorer %cd%” to launch a new version of Explorer defaulted to your current directory.  Note: sometimes this doesn’t always work, like if you’re running the command line window in administrator mode.

Yes, that’s not a very big file, but it’s a pretty small database to start with.  Obviously they can get much bigger, and I usually see backups gettting up to a few hundred MB or a few GB.  You could try to compress it with your favorite command line compression tool installed (I really like 7-Zip), but it’s not going to get you much.  RavenDB already does a good job of compressing the content while it’s extracting it via Smuggler.

Amazon S3

Next, you have to put it somewhere, preferably as far away from this server as possible.  A different machine is a must, a different data center or even different hosting company is even better.  For me, one of the cheapest/easiest/fastest places to put it in Amazon S3.

There are a few ways to get the file up to S3.  The first option is to upload it straight from Amazon’s S3 website, although that can require installing Java, and you may not be into that kind of thing.  Also, that’s not really scriptable.

Or you could use S3 Browser, which is an awesome tool.  For end user integration with S3, it’s great, and I’ve gladly paid the nominal charge for a professional license, and recommended all of my S3-enabled clients to do the same.  However, while it’s a great UI tool for S3, it is not very scripting friendly.  It stores your S3 connection information in your Windows user profile, which means if you want to script it you need to log in as that user first, setup the S3 profile in S3 Browser, and then make sure you run the tool under that same user account.  That’s a lot of headache I don’t really want to worry about setting up, much less remembering in 6 months when I need to change something.

One great S3 backup tool is CloudBerry.  It’s not free, but it’s relatively inexpensive, and it’s really good for scheduling backups of specific files or folders to S3.  Depending on your wallet tolerance, this may be the best option for you.

But you may want a free version, and you’re probably asking, “why is this so hard to just push a file to S3?  Isn’t that just a few lines of AWSSDK code?”.  Well yeah, it is.  Actually it can be quite a few lines, but yeah, it’s not rocket science.  Luckily here is a great tool on CodePlex that lets you do this: http://s3.codeplex.com/.  It’s a simple command tool with one-line commands like “put” and “auth” to do the most simple tasks.  To push your new file to S3, it would just be:

s3 auth /nogui [AccessKeyID] [SecretAccessKey]
s3 put /nogui [S3BucketName]/[TargetS3Folder]/ [FileToUpload]

 

So if we wanted to upload our new demodbn.dump file to S3, it would look something like this:

image

And hey, there’s our backup on S3. 

image

 

So we now have a 3-line database backup script:

Raven.Smuggler.exe out http://[ServerName]:[Port] [DatabaseName].dump --database=[DatabaseName]
s3 auth /nogui [AccessKeyID] [SecretAccessKey]
s3 put /nogui [S3BucketName]/[TargetS3Folder]/ [DatabaseName].dump

 

Just put that in a batch file and set a Window’s Scheduled Task to run whenever you want.  Simple enough, eh?

Restoring via Smuggler

So now that you have your RavenDB database backed up, what do you do with it?  You’re going to test your restore process regularly, right?  Right?

First you need to get a copy of the file to your machine.  You could easily write a script using the S3 tool to download the file, and I’ll leave that as an exercise for the reader.  I usually just pull it down with S3 Browser whenever I need it.

So once you have it downloaded, you just need to call Smuggler again to import it.  It’s the same call as the export, just change “out” to “in”.  For example, to import our demodb back into our local server, into a new DemoDBRestore database instance, we would say:

Raven.Smuggler.exe in http://localhost:8080 demodb.dump --database=DemoDBRestore

And we would see:

image

And then we have our restored database up and running in RavenDB:

image

Conclusion

Now I’m not a backup wizard.  I’m sure there are better ways to do this, with incremental builds and 3-way offsite backups and regular automated restores to a disaster recovery site and all sorts of fancy stuff like that.  The bigger and more critical your application becomes, the more important it becomes to have those solutions in place.  But day 1, starting out on your project, you need to have something place, and hopefully this helps you get started.

If you’re following along, I’ve recently named this effort “Parallax” for long and windy reasons.  You can read all about it here.

I’ve copied the project up to CodePlex, at http://parallax.codeplex.com.  Feel free take a look.

There’s not much there yet, just a few database tables, but it can get us started talking about the approach we’re taking to the database.

 

Here’s what we have so far:

ParallaxDB_1

 

There are a few things going on here:

Singular Table Names

I’ve used singular table names for all tables.  This is a long and well-worn debate, whether to use singular or pluralized table names.

For years, I was a strong proponent of the pluralized table names.  At the time, it was probably mostly because that’s what I had used before, and like most people I tend to come up with rational arguments to support my subjective preferences, because cognitive dissonance is yucky.

Plural Table Names Are Good

One of the main reasons for plural table names, at least for me, is that from one perspective it matches what the table actually represents.  It is a container that has a list of work items, or a list of users, or a list of whatever.  If we’re going to store a list of work items in a container, by golly we should call that thing “WorkItems”.  It’s the same logic that is applied to object models; if your Person object contains a list of addresses, you’re going to call the property something like “Addresses” or “AddressList”, not “Address”.  The idea of describing a whole list of customers as just “Customer” just rubbed me the wrong way.

Singular Table Names Are More Gooder

However, there are also several benefits to using singular table names.  First, it avoids a lot of confusion about the proper way to pluralize a name. 

Should the table that contains a list of person records be called “People” or “Persons”?  “People” is a more common name, but “Persons” has the benefit of simplicity, because it is just appending an “s” to the end.  I’ve heard this argument go back and forth, and it doesn’t matter who’s right or wrong because the whole fact that you’re discussing it just a damn waste of time.  You are introducing artificial and unnecessary complexity to your project, which you and every other developer will have to worry about for the life of the project.  Hidden productive costs like this add up significantly over time, and they suck the life out of your developers, of often times without them even realizing why they hate their job.

 

Also, assuming you follow some sort of standardized naming convention for primary and foreign keys, this can improve the readability and writability of your code, because you will not be mixing singular and plural names.  To me, this:


SELECT * FROM WorkItem
INNER JOIN WorkItemType
ON WorkItem.WorkItemTypeID = WorkItemType.ID

 

Flows a little better this:


SELECT * FROM WorkItems
INNER JOIN WorkItemTypes
ON WorkItems.WorkItemTypeID = WorkItemTypes.ID

 

Lastly, if you ever want to have any scripts for performing automated maintenance or generate code, it’s a lot easier to only have to deal with a single tense to deal with and not have to worry about pluralized names.  We’re going to be doing at least a little of that in Parallax.

Keep It Consistent

So like in most things, there are valid reasons for either approach, and in the end it probably doesn’t matter too much which you choose.  However, what does matter is that you do choose and stick to it.  If you start switching between plural and singular names, it is one of the first steps towards chaos.  Your developers will have to continuously waste a unacceptable of time to stop and think “was that table called Person, or Persons, or People, or Peoples, or Peeps, or…?”

 

Primary and Foreign Keys

When I create a table, the first thing I do is created a clustered integer primary key named “ID”.  99% of the time, I do it every time.

Naming Conventions

Could I include the table name in the primary key column, e.g. “WorkItemID” instead of “ID”?  Sure, but I chose not to, again for subjective reasons.  I feel that the primary key is a really important field on the table, and should jump out in a query as the single most important and identifying field in the table.  When you look at a query and are trying to visual exactly what it is doing, I feel that it makes it a just a little bit clearer to see the primary keys first and work out from there.  Of course, one shortcoming is that it can get a little unclear when you start using vague subquery aliases, like “subq.ID”, but I think that is just another good reason to not use vague aliases in the first place.

For foreign keys, I just add “ID” to the referenced table name, such as “WorkItemID”. This way, when you’re writing a query, you never really have to stop and think what the field name is, or worse yet, go look it up in schema.  Now there are cases that this doesn’t work, like when you have multiple independent foreign keys in table referencing the same table.  For example, the WorkItem table references the AlmUser table for two different fields, the reporting user and the assigned to user, and we’ll be adding several more shortly.  In this case, we can’t call them both “AlmUserID”, so we’ll add a prefix to them to identify their purpose, “ReportingAlmUserID” and “AssignedToAlmUserID”.  This can also be used in cases where the reason for the key is not immediately evident, but this starts down a slippery slope of naming things certain ways just because you feel like it on a certain day, which should be avoided as much as possible.

Clustered Index

What does the clustered part mean?  A table can have a single clustered index, and that means that every row in the table will actually be physically stored in that order, which means that retrieving records by that index faster than a standard index.  The tradeoff is that inserting new records into the middle of the index can be slow because it affects the physical layout of the table.  However, this works especially well for auto-incrementing integer fields, because the new records always go at the end of the table and the end of the index. 

Natural/Intelligent Keys

This extra, auto-generated ID column approach is called a surrogate key, which means that it is not derived from any application data.  So why reserve a whole extra field for this?  Another approach is to use a “natural” or “intelligent” key, which is something a naturally occurring piece of data that is sure to be unique across all the rows.  One of the more common example you see is using Social Security Number as the key for an Employee table. 

There are several reasons not to use natural keys, but the first and most important to me, is: Get Your Damn Business Rules Out Of My Schema.  I’m designing a database here, and my primary concern is that the data can be reliably and efficiently stored and retrieved.  You may say today that every employee has a Social Security Number and that it’s guaranteed to be unique, and then in 6 months we need to change the application because we need to enter employee records before we have their SSN, or somehow a dead employee’s SSN got reissued to a new employee, or a new government regulation restricts from storing SSNs in the database, or about a million other things that you won’t think of because you’re really not nearly as smart as you think you are.  If you’ve built your entire system around the assumption that a piece of data will be unique, you’ve really painted yourself into a corner when that assumption is inevitably proven to be wrong.  Wouldn’t you rather just start with a single ID field that you can always rely on to be unique?  It’s only purpose in life is to make sure that your data can be reliably identified, and that makes me feel warm and fuzzy.

And for a very real case of this, just this past fall I had to change SportsCommander to allow users to change their username.  Why would you ever allow users to do this?  Because they needed to.  The original implementation auto-generated cryptic usernames, and this was only done to match a legacy system it had replaced, but the users could never remember their username.  So despite how much it drove me nuts to change this after the fact, we accepted that was critically important to our user’s satisfaction with the site to allow them to select a new username.  In the end, the technology gets less important, and you either have fantastically happy users or you’re selling flowers by the highway.  If would have been easy to say up front that “users can never change their username, why would we ever allow that?”, and if we had used the username as a primary key, we’d really be screwed.  Luckily, we already knew that we were simpleminded fools who could not predict the future, and did our best to protected ourselves against that. 

Of course, in the original version of SportsCommander, we could have also used the USA Volleyball ID number to identify users, because every user in the system is a USAV member with their own number.  However, it turned out that some users had their USAV number change from one season to the next.  Also, we subsequently expanded SportsCommander to support organizations beyond USA Volleyball, so the identifier further ceased to be the grandly unique identifier it originally seemed to be.

 

Also, it’s slow.  Sure, if it’s indexed, it can be really fast, but a lookup with a string field will never be as fast as an integer field.  It also undermines any gain achieved by using clustered indexes, because your data is no longer sequential.  the difference may be negligible a lot of the time, and I definitely don’t want to start doing premature optimization or complicating the code to squeeze our a tiny bit of performance.  However, this is a small thing that can scale significantly depending on the amount of data and tables you have, because you are are doing constantly doing these lookups on against almost every row you touch. 

Of course, natural keys aren’t all bad.  One positive benefit of natural keys is that they can make bulk importing or synchronizing of data easier.  For example, imagine that you are importing a list of employee records and also a list of their timesheet hours.  You’ll make two passes through the data, once for the employee records themselves, and then a second pass for the timesheet hours.  During the second pass, you’ll need to correlate the hours with the employee records that you’ve already inserted, and it’s a lot easier and quicker if you can use a value you already have (like the SSN), than to have to use that value to lookup another arbitrary value (like an ID field).  However, in the end, I really don’t care about this all that much.  It’s a small subset of the usage of the database, and I feel it pales in comparison to the other factors.  Sure, it might make a DBA’s life a little more difficult for some data import they have to run in the background from time to time, but honestly I really don’t give a damn.  Life is full of compromises, and the total system is almost always better off, in terms of stability and total development time, with a more consistent and reliable surrogate key.

GUID Keys

So why not use Globally Unique Identifiers (GUIDs) for keys instead of numeric IDs?  GUIDs definitely have their place, but it really depends on what you’re trying to do.

GUIDs have a lot of benefits due to the fact that they are always globally unique, while your cute little identity (i.e. auto-incrementing)  ID is only unique within your table.  The identity isn’t even unique throughout the database, because each table maintains its own list (this of course assumes SQL Server, not Oracle, I don’t know how MySQL handles this because I really don’t care). 

Having globally unique keys can be useful if you are doing any replication from one database to another, or if you need to store the data in several different databases for other reasons, because you can assign a GUID ID to a record and be assured that it won’t conflict with ID from another database.  For example, if you had a database that stored a master list of users, and those users are represented in several satellite databases, maybe for several separate web applications at the same company, GUID keys would make sense because the data by its very nature should be unique beyond the database. 

Another benefit is that because they are non-sequential, you don’t have to worry about people guessing adjacent values.  If you are going to include any IDs in a query string, it’s usually important to use values that will not allow the user to guess additional values.  If user is on a page called “EditRecord.aspx?ID=12”, you know for sure that some user is going to try typing in “EditRecord.aspx?ID=13” just to see what would happen.  Ideally in these cases you application will restrict the user from getting access to records that he/she/it/heshe should not have access to, but it’s a good idea to not encourage your users to go looking for these types of vulnerabilities.  If the user sees “EditRecord.aspx?ID=ceb143ce-b4e9-4d59-bedb-bb07c5f0eb41”, good luck trying to guess the another value from that table.

However, GUIDs are slow.  Again, if it’s indexed properly, it should perform sufficiently, but much like string fields GUIDs will never be as fast as integers.

Also, GUIDs make debugging a lot harder.  You can’t yell over the cube wall “hey Barack, can you take a look at record 49bebb82-a96c-4bfc-b222-0b9df372ec2a, it looks like you screwed something up again!”.  Even when you are just running a few queries to track down a problem in the middle of the night, it goes a lot quicker and burns fewer brain cycles if you can reference an easily-remembered number, like 8675309, instead of a 36-character identifier.

So in the end, I usually just use integer keys for everything.  That way I get the best performance, the easily debugging, and it fits my needs 95% of the time.  There still are always cases that you could use GUID identifier, so I just put an extra GUID column on those tables.  That gives us the fast and easy integer key lookups for most queries, and the GUID uniqueness added-on when we need it.  We’ve done that here for the AlmUser table, because there are a lot of cases that you don’t want to have users guessing the user ID number.

 

Reserved Words

Often times when you are creating a schema, you use table or field names that are reserved by SQL.  You may name a table “User” or name a field “Description”.  You may not think anything of it at the time, and SQL will let you do it without complaint, but then when you try to write queries against it you’ll get syntax errors or weird behavior.  A lot of developers know the frustration of typing that last character of the field name and watching the font color turn blue.

Of course, SQL provides a way around this.  By wrapping your identifiers in brackets (like [User].[Description]), you can help SQL avoid this confusion.  Any database layer that automatically generates SQL code should ALWAYS do this, because you never know what you might encounter.

However, typing in the brackets is a pain, and it hampers readability, and people often forget to do it, and it’s one of the last things I want to have to worry about while I’m writing code.

Most other programming platforms, like C+ or C#, prevent you from even using reserved words for your own purposes, because really, who needs the headache of keeping them straight?  They are called “reserved” for a very good reason, and I hate that SQL doesn’t actually “reserve” these words.

Anyhow, just because SQL doesn’t enforce the rule doesn’t mean that we can’t enforce it ourselves.  That being the case, stay away from this list of SQL Server 2008 reserved words.  You can use them as any part of a name, such as “AlmUser” or “ItemDescription”, but just don’t use them by themselves.

 

AlmUser Table

Just about every database I’ve designed that drives a web application includes some sort of a User table.  It may store authentication information, or personal information, or may just be an intersection point for everything else.

In this case we will store some basic identifying information, like FirstName and LastName.  We’ll expand it over time to include additional fields.  There is also a GUID column, in case we need to pass a User ID down to the browser, that is the version we’d like to use so that people can’t guess the other user’s IDs as well.

However, you may notice that it does not include the necessary fields for authentication, like Password.  We’re going to outsource that for now using the ASP.NET Membership Provider interface.  The default ASP.NET Membership Provider just stores the information in some database tables, but you can also have providers to connect to Active Directory, Open ID, or some any number of other third-party authentication systems.  For now, we’ll use the default provider, and we may store the tables in the Parallax database just to keep everything together, but there will be NO connection or references between those membership tables and the rest of the Parallax schema.  At any point in time, we will want to be able to pull that out and replace it with another provider.

Granted, this approach may be overkill, and it is usually not necessary for most applications to keep their membership information completely abstracted.  However, as I’ve mentioned, I’m not pretending that we building a practical application here with realistic design constraints; this is a sandbox where we will explore several technologies, so membership providers is one of those topics I’d like to explore.

 

Audit Columns

Every table ends in 4 fields, indicating who created the record and when, and who last updated the record and when.  Granted, this only displays two dimensions of the history, rather than a full history of who changed what when, but I’ve found it comes in handy when trying to figure out when the data went awry.

You’ll notice that we use the user GUIDs to store the user.  I’ve seen implementations where they used actual foreign keys to the user table, and it caused enormous problems.  The checking of the key constraint can slow the performance of inserts and updates, albeit slightly, but that’s still a big no-no.  Your auditing code should have no detrimental impact on the database.  Also, what happens if for some reason the ID value is wrong?  You really don’t want to reject the change for this reason, because these are just audit fields, not real transactional information.  It’s important that the value be right, but it’s a lot less important than the rest of the record.  Lastly, and the most crippling, is that by the time the database had grown to hundreds of tables, it was virtually impossible to delete a user record.  Anytime we tried, it would run through every key referencing the table (2 per table) to determine if there were any other records referencing the deleted record.  Due to the number of relationships and the volume of data, this would timeout after a few minutes.

So clearly, foreign keys are out.  We could still use the ID from the table without a key, but I see a few issues with that.  First, every other place in the database that stores an ID reference field has a foreign key.  That’s a core rule of the database, and I don’t want to give anyone, especially myself, the opening to start getting creative with whether to enforce foreign keys.  Also, the value is not very intuitive.  Lastly, I don’t want anyone to see that field, think that someone else forgot a key constraint, and go ahead and start adding them.  If you don’t make it clear in the code how your code is supposed to be used, it’s your own fault when it is inevitably misused.

So in the past I’ve learned towards putting the username in there, and it worked great.  It was straightforward, unique, succinct, and human-readable.  However, as we mentioned above when discussing natural keys, even usernames can change.

So we’re trying the user GUID this time around.  It’s not nearly as friendly as the username, but it certainly is unique.  And it avoids the usage confusion that you get with ID fields.  I’m still not crazy about this one, but we’ll see how it goes.

 

Data Types

NVARCHAR vs. VARCHAR

For a long time I always used VARCHAR, because I didn’t have a reason to use NVARCHAR.  Then, I ran into a few apps that needed to have NVARCHARs here and there because we had some international customers, and it was a pain keeping track of which fields were which.  So I just started using NVARCHAR for everything, but in hindsight I think that was a mistake.  Doug Blystone (co-founder of MMDB Solutions and a guy who knows more about SQL databases that I could ever hope to) has finally convinced me to stop using them unless absolutely necessary.  I usually shrug off performance issues for the sake of simplicity until they are a demonstrable problem, but this is one of the cases where it can lead to a hidden and pervasive performance drag that won’t jump out at you, but will slow you down just a little bit every step of the way.  These are the types of performance issues that add a second or two to the page load here and there, which makes you users subconsciously hate your guts, even if they don’t really know why.  It takes up twice as much space, it makes comparisons of string values take longer because there are twice as many bytes to compare, and it’s just not necessary unless you’re dealing with Asian languages (or anything language that may have thousands of characters).  We are years removed from the day that we can afford to hire someone to translate our applications to Japanese, and I’m guessing we’ll never do business in China because I hate Commies.

Simple Types

That being said, we’re using VARCHAR.  We’ll also be using UNIQUEIDENTIFIER, DATETIME, DATE, TIME, BIT, INT, FLOAT, TEXT, and maybe VARBINARY.  And that’s it.  No smallint or real or money or any of that crap.  We are going to use the minimum number of logically distinct data types that meet our needs.  I have no interest in getting into data-type arms war, trying to save a byte here or there to create some sort of idealistically pure representation of the data’s inner beauty.  I, nor any other developer, will waste a minute of time worrying about if we can cast this bigint into an integer or that real into a float without data loss.  We’re going to keep this simple, so help me God, and so you and your sql_variant friends better get the hell off my lawn before I get my shotgun.

I actually have used the new DATE and TIME fields yet, but I’ve long felt the need for them.  A DATETIME field is really two different things, which usually works out OK, but when you need to store just a date or just a time, it really gets clunky.  Then when you start having to worry about time zones, it can get just plain weird.  We have a few cases in SportsCommander now where we have to do a little code gymnastics to store just a date or just a time in a  DATETIME field, and have the inactive part properly ignored without interfering with the active part.  This will be my first foray into using these data types, so we’ll see how well they work.

 

So that’s a start for the database stuff.  It ran a little long for 3 tables, but you need to state your ground rules up front, even if it’s just for your own benefit, lest you start getting lazy and breaking them.  So here we go.