Tag Archives: sql

Simple Database Backups With SQL Azure Part 2: Simpler and Free with Red Gate

So while ago I wrote about my adventures in SQL Azure backups. At the time, there was very little offered by either Microsoft or tool vendors to provide an easy solution for scheduling SQL Azure backups. So in the end, I cobbled together a solution involving batch files, Task Scheduler, and most importantly Red Gate Compare and Data Compare.

But much has changed the past year. Red Gate released their new SQL Azure Backup product, whose functionality looks freakishly similar to other less polished solutions that people had written about. The cool part is that while the SQL Compare solution I proposed originally required a purchased copy of the Red Gate SQL tools, Red Gate has been nice enough to release their Azure backup tool for free.

Also, Microsoft has released a CTP version of their SQL Import/Export Service. This service allows you to backup and restore your database using Azure Blob storage instead having to download it to a local database server, which is actually what most of us really wanted in the first place anyway. The latest versions of Red Gate’s Azure Backup also supports this functionality, which gives you a lot of options.

So just to close the loop on this, here’s the updated batch script file we’re using for SportsCommander now for doing regular production backups of our database. We’re opting to use the the Import/Export functionality as our primary backup strategy:

SET SqlAzureServerName=[censored]
SET SqlAzureUserName=[censored]
SET SqlAzurePassword=[censored]
SET SqlAzureDatabaseName=[censored]

SET AzureStorageAccount=[censored]
SET AzureStorageKey=[censored]
SET AzureStorageContainer=[censored[

for /f "tokens=1-4 delims=/- " %%a in (‘date /t’) do set XDate=%%d_%%b_%%c
for /f "tokens=1-2 delims=: " %%a in (‘time /t’) do set XTime=%%a_%%b

SET BackupName=SportsCommander_Backup_%XDate%_%XTime%

C:\SQLBackups\RedGate.SQLAzureBackupCommandLine.exe /AzureServer:%SqlAzureServerName% /AzureDatabase:%SqlAzureDatabaseName% /AzureUserName:%SqlAzureUserName% /AzurePassword:%SqlAzurePassword% /CreateCopy /StorageAccount:%AzureStorageAccount% /AccessKey:%AzureStorageKey% /Container:%AzureStorageContainer% /Filename:%BackupName%.bacpac

A few notes:

– This runs the same Import/Export functionality you can get through the Azure portal. If you have any problems with the parameters here, you can experiment in Azure portal

– The AzureStorageAccount parameter is the account name of your storage account. So if your blob storage URL is http://myawesomeapp.blob.core.windows.net, your would want to use “myawesomeapp” in this parameter

– The /CreateCopy parameter will use SQL Azure’s CREATE DATABASE AS COPY OF method to create a snapshot first and then back that up, instead of just backing up the live database. This takes a little extra time, but it is important to ensure that you are getting a transactionally consistent backup.

Of course, if you still want to copy down a local instance of the database like we did in the previous post, you can easily do that too:

SET SqlAzureServerName=[censored]
SET SqlAzureUserName=[censored]
SET SqlAzurePassword=[censored]
SET SqlAzureDatabaseName=[censored]

SET LocalSqlServerName=[censored]
SET LocalSqlUserName=[censored]
SET LocalSqlPassword=[censored]

for /f "tokens=1-4 delims=/- " %%a in (‘date /t’) do set XDate=%%d_%%b_%%c
for /f "tokens=1-2 delims=: " %%a in (‘time /t’) do set XTime=%%a_%%b

SET BackupName=SportsCommander_Backup_%XDate%_%XTime%

C:\SQLBackups\RedGate.SQLAzureBackupCommandLine.exe /AzureServer:%SqlAzureServerName% /AzureDatabase:%SqlAzureDatabaseName% /AzureUserName:%SqlAzureUserName% /AzurePassword:%SqlAzurePassword% /CreateCopy /LocalServer:%LocalSqlServerName% /LocalDatabase:%BackupName% /LocalUserName:%LocalSqlUserName% /LocalPassword:%LocalSqlPassword% /DropLocal

Good luck.

Generating Azure-Friendly SQL Scripts

The Error

So if you are working in SQL Azure, you’ve probably learned the hard way that you can’t just script out your DDL changes in your local SQL Management Studio and run it against your Azure database. It throws in a whole bunch of extra fancy-pants DBA-y stuff that SQL Azure just doesn’t let you use.

For example, say I throw together a simple table in my local database. Depending on your SQL source control approach (you have one, right?), you might script it out in SQL Management Studio and get something like this:

CREATE

TABLE MyWonderfulSampleAzureTable (

[ID] [int] IDENTITY(1,1) NOT NULL,

[GoEagles] [varchar](50) NOT NULL,

[BeatThemGiants] [varchar](50) NOT NULL,

[TheCowboysAreAwful] [bit] NOT NULL,

[AndyReidForPresident] [varchar](50) NULL,

CONSTRAINT [PK_MyWonderfulSampleAzureTable] PRIMARY KEY CLUSTERED

(

[ID] ASC)

WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

)

ON [PRIMARY]

GO

Pretty enough, no? Sure, it’s full of a lot of gibberish you don’t really care about, like PAD_INDEX=OFF, but hey if it runs, what’s the problem?

Now, let’s run that against our SQL Azure database:

Msg 40517, Level 16, State 1, Line 10
Keyword or statement option ‘pad_index’ is not supported in this version of SQL Server.

Ooops. This is a pain to fix when you’re deploying a single script. However, when you’re running a whole development cycle worth of these against your production database at 3 AM and it chokes one of these scripts, this is absolutely brutal.

Censorship Brings Peace

So why does this happen? Why can’t SQL Azure handle these types of cool features? Mostly because they just don’t want to. Sure, some of the features missing from SQL Azure are because they just haven’t been implemented yet, but some of them are deliberately disabled to prevent unmitigated chaos.

While you may have a DBA managing your on-premise database who is working in your best interest (or at least your company’s interest), SQL Azure has a much bigger problem to solve. They need to provide a shared SQL environment that does not let any one consumer hose up everyone else. If you’ve ever hosted a SQL database in a high-traffic shared hosting environment, you’ve probably feel the pain of some joker going cookoo-bananas with the database resources.

In other words, what you do in the privacy of your own home is all well and good, but if you are going to go play bocce in the public park, you’re certainly going to have to watch your language and act live a civilized person.

And a lot of these features you don’t really have to care about anyway. No doubt, you are really really smart and know when your indexes should be recompiled, but the reality is that much of the time whatever algorithm the folks on the SQL team came up with is going to be a little bit smarter than you, Mr. SmartyPants.

Anyhow, for your edification, here’s a wealth of information about the stuff you can’t do.

The Manual Workaround

So how do we get our script to run? My general rule of thumb is to rip out all of the WITH stuff and all of the file group references:

CREATE TABLE MyWonderfulSampleAzureTable (

[ID] [int] IDENTITY(1,1) NOT NULL,

[GoEagles] [varchar](50) NOT NULL,

[BeatThemGiants] [varchar](50) NOT NULL,

[TheCowboysAreAwful] [bit] NOT NULL,

[AndyReidForPresident] [varchar](50) NULL,

CONSTRAINT [PK_MyWonderfulSampleAzureTable] PRIMARY KEY CLUSTERED

(

[ID] ASC)

~~WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]~~

)

~~ON [PRIMARY]~~

GO

See, I had a general rule of thumb for this, because I encountered it a lot. On just about every DDL script I had to generate. And missed a lot of them. Quite the pain in the neck.

The Much Better, Not-So-Manual Workaround

So I was at the a Philly.NET user group meeting last night and Bill Emmert from Microsoft was walking through the options for migrating SQL Server databases, and he showed us this setting that I wish I knew about a year ago:

If you change this to SQL Azure Database, it will change your environment settings to always use create SQL scripts that are compatible with Azure. No more manual script trimming! Unless, of course, you are into that kind of thing, in which case you really wasted the last 10 minutes reading this.

Good Luck.

Simple Database Backups With SQL Azure

12 Replies

Why?

Last year we launched a new version of SportsCommander.com, which offered volleyball organizations across the country the ability to promote their tournaments and accept registrations for a negligible fee. Having grown out of our previous hosting company, we tried hosting the platform on Windows Azure, and for the most part it’s been great. Also, the price was right.

We are also hosting our data in SQL Azure, which again for the most part has been great. It has performed well enough for our needs, and it abstracts away a lot of the IT/DBA maintenance issues that we would really rather not worry about.

Of course, nothing is perfect. We’ve had a few snags with Azure, all of which we were able to work around, but it was a headache.

One of the biggest issues for us was the ability to run regular backups of our data, for both disaster recovery and testing purposes. SQL Azure does a great job of abstracting away the maintenance details, but one of the things you lose is direct access to the SQL backup and restore functionality. This was almost a deal-breaker for us.

Microsoft’s response to this issue is that they handle all of the backups and restores for you, so that if something went wrong with the data center, they would handle getting everything up and running again. Obviously this only solves part of the problem, because many companies want to have their own archive copies of their databases, and personally I think doing a backup before a code deployment should be an absolute requirement. Their answer has been “if you need your own backups, you need to build your own solution.”

Microsoft is aware of this need, and it has been the top-voted issue on their Azure UserVoice site for a while.

In poking around the interwebs, I saw some general discussion of how to work around this, but very little concrete detail. After hacking around for a while, I came up with a solution that has worked serviceably well for us, so I figured I’d share it with y’all.

What?

In order to address these concerns, Microsoft introduced the ability to copy a database in SQL Azure. So, as a limited backup option, you can create a quick copy of your database before a deployment, and quickly restore it back if something fails. However, this does not allow for archiving or exporting the data from SQL Azure, so all of the data is still trapped in the Azure universe.

Apparently another option is SSIS. Since you can connect to Azure through a standard SQL connection, theoretically you could export the data this way. Now I am no SSIS ninja, so I was just never able to get this working with Azure, and I was spending far too much time on something that I shouldn’t need to be spending much time on.

I’ve heard rumblings Microsoft’s Sync Framework could address the issue, but, uh, see the previous point. Who’s got time for that?

So of course, Red Gate to the rescue. Generally speaking their SQL Compare and SQL Data Compare solve this type of problem beautifully, they are excellent at copying SQL content from one server to another to keep them in sync. The latest formal release of their products (v8.5 as of this writing) does not support SQL Azure. However, they do have beta versions of their new v9.0 products, which do support SQL Azure. Right now you can get time-locked beta versions for free, so get yourself over to http://www.red-gate.com/Azure and see if they are still available. If you’re reading this after the beta program has expired, just pony up the cash and by them, they are beyond worth it.

How?

OK, so how do we set this all up? Basically, we create a scheduled task that creates a copy of the database on SQL Azure, downloads the copy to a local SQL Server database, and then creates a zipped backup of that database.

First, you need a SQL Server database server. And go install the Azure-enabled versions of SQL Compare and SQL Data Compare.

Also, go get a copy of 7-Zip, if you have any interest in zipping the backups.

The scheduled task will execute a batch file. Here’s that batch file:

SET SqlAzureServerName=[censored]
SET SqlAzureUserName=[censored]
SET SqlAzurePassword=[censored]

SET LocalSqlServerName=[censored]
SET LocalSqlUserName=[censored]
SET LocalSqlPassword=[censored]

echo Creating backup on Azure server
sqlcmd -U %SqlAzureUserName%@%SqlAzureServerName% -P %SqlAzurePassword% -S %SqlAzureServerName% -d master -i C:\SQLBackups\DropAndRecreateAzureDatabase.sql

echo Backup on Azure server complete

echo Create local database SportsCommander_NightlyBackup
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d master -i C:\SQLBackups\DropAndRecreateLocalDatabase.sql

echo Synchronizing schema
"C:\Program Files (x86)\Red Gate\SQL Compare 9\SQLCompare.exe" /s1:%SqlAzureServerName% /db1:SportsCommanderBackup /u1:%SqlAzureUserName% /p1:%SqlAzurePassword% /s2:%LocalSqlServerName% /db2:SportsCommander_NightlyBackup /u2:%LocalSqlUserName% /p2:%LocalSqlPassword% /sync

echo Synchronizing data
"C:\Program Files (x86)\Red Gate\SQL Data Compare 9\SQLDataCompare.exe" /s1:%SqlAzureServerName% /db1:SportsCommanderBackup /u1:%SqlAzureUserName% /p1:%SqlAzurePassword% /s2:%LocalSqlServerName% /db2:SportsCommander_NightlyBackup /u2:%LocalSqlUserName% /p2:%LocalSqlPassword% /sync

echo Backup Local Database
for /f "tokens=1-4 delims=/- " %%a in (‘date /t’) do set XDate=%%d_%%b_%%c
for /f "tokens=1-2 delims=: " %%a in (‘time /t’) do set XTime=%%a_%%b
SET BackupName=SportsCommander_Backup_%XDate%_%XTime%
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d master -Q "BACKUP DATABASE SportsCommander_NightlyBackup TO DISK = ‘C:\SQLBackups\%BackupName%.bak’"

"C:\Program Files\7-Zip\7z.exe" a "C:\SQLBackups\%BackupName%.zip" "C:\SQLBackups\%BackupName%.bak"

del /F /Q "C:\SQLBackups\%BackupName%.bak"

echo Anonymize Database For Test Usage
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d SportsCommander_NightlyBackup -i "C:\SQLBackups\AnonymizeDatabase.sql"

The first thing this does is run a SQL script against the SQL Azure server (DropAndRecreateAzureDatabase.sql). This script will create a backup copy of the database on Azure, using their new copy-database functionality. Here’s that script:

DROP DATABASE SportsCommanderBackup
GO
CREATE DATABASE SportsCommanderBackup AS COPY OF SportsCommander
GO
DECLARE @intSanityCheck INT
SET @intSanityCheck = 0
WHILE(@intSanityCheck < 100 AND (SELECT state_desc FROM sys.databases WHERE name=’SportsCommanderBackup’) = ‘COPYING’)
BEGIN
— wait for 10 seconds
WAITFOR DELAY ’00:00:10′
SET @intSanityCheck = @intSanityCheck+1
END
GO
DECLARE @vchState VARCHAR(200)
SET @vchState = (SELECT state_desc FROM sys.databases WHERE name=’SportsCommanderBackup’)
IF(@vchState != ‘ONLINE’)
BEGIN
DECLARE @vchError VARCHAR(200)
SET @vchError = ‘Failed to copy database, state = ”’ + @vchState + ””
RAISERROR (@vchError, 16, 1)
END
GO

A few notes here:

We are always overwriting the last copy of the backup. This is not an archive; that will be on the local server. Instead, this always the latest copy. Besides, extra Azure databases are expensive.
For some reason SQL Azure won’t let you run a DROP DATABASE command in a batch with other commands, even though SQL 2008 allows it. As a result, we can’t wrap the DROP DATABASE in an “IF(EXISTS(“ clause. So, we need to always just drop the database, which means you’ll have to create an initial copy the database drop for the first time you run the script.
The CREATE DATABASE … AS COPY OF will return almost immediately, and the database will be created, but it is not done the copying. That is actually still running in the background, and it could take a minute or two to complete depending on the size of the database. Because of that, we sit in a loop and wait for the copy to finish before continuing. We put a sanity check in there to throw an exception just in case it runs forever.

Once that is complete, we create a local database and copy the Azure database down into that. There are several ways to do this, but we chose to keep a single most-recent version on the server, and then zipped backups as an archive. This gives a good balance of being able to look at and test against the most recent data, and having access to archived history if we really need it, while using up as little disk space as possible.

In order to create the local database, we run a very similar script (DropAndRecreateLocalDatabase.sql):

IF(EXISTS(SELECT * FROM sys.databases WHERE Name=’SportsCommander_NightlyBackup’))
BEGIN
DROP DATABASE SportsCommander_NightlyBackup
END
CREATE DATABASE SportsCommander_NightlyBackup

In this case, we actually can wrap the DROP DATABASE command in a “IF(EXISTS”, which makes me feel all warm and fuzzy.

After that, it’s a matter of calling the SQL Compare command line to copy the schema down to the new database, and then calling SQL Data Compare to copy the data down into the schema. At this point we have a complete copy of the database exported from SQL Azure.

As some general maintenance, we then call sqlcmd to backup the database out to time-stamped file on the drive, and then calling 7-Zip to compress it. You might want to consider dumping this out to a DropBox folder, and boom-goes-the-dynamite, you’ve got some seriously backed-up databii.

Lastly, we run an AnonymizeDatabase.sql script to clear out and reset all of the email addresses, so that we can use the database in a test environment without fear of accidentally sending bogus test emails out to our users, which I’ve done before and it never reflected well on us.

Run that batch file anytime you want to get a backup, or create a scheduled task in Windows to run it every night.

Anyhoo, that’s about it. It’s quick, it’s dirty, but it worked for us in a pinch. Microsoft is just getting rolling on Azure and adding more stuff every month, so I’m sure they will provide a more elegant solution sooner or later, but this will get us by for now.

Have you had a similar experience? How are you handling SQL Azure backups?

Think Globally, Act Locally (or Why your database version control strategy sucks and what to do about it, Part II)

3 Replies

This post is part 2 of a 382-part series on how to manage database changes, primarily for SQL Server, starting here.

I figured this would be a good week to discuss ways that you can make your world a better place by making small changes to things you do in your everyday work. No, this post is not about inconvenient truths or sustainability or hybrids or manbearpig. This post is about the importance of local development databases.

The situation you see all too often is that a development team has a single database server in their development environment that they share, and everyone is developing application code locally while simultaneously making changes to a shared database instance. Bad, bad, bad.

Captain Obvious

These days, most everyone develops their code locally. That’s just want you do. Many developers have learned the hard way that this is important, and won’t tolerate any divergence. And for the less experienced developers who are doing it just because they are told to, eventually they will make the same mistakes and learn from them too. This is such an easy lesson to learn that you don’t see too many people violate it intentionally.

Even if you HAVE to develop on a server environment, you’ll probably at least find a way to isolate yourself. For example, SharePoint developers don’t tend to install the SharePoint platform on their local machines, mostly because it requires a server OS, but also because SharePoint is pure, unadulterated evil that will steal the very soul of any machine it comes into contact with. Nonetheless, in those cases where a local machine is not practical, the developer will install the SharePoint software onto a virtual machine so that they can still work in isolation.

This local development approach is critically important to any form of version control or change management. For all practical purposes, developers must have a workable environment that they can fully control and work in peace. From there, developers check their code into source control, and hopefully it gets built from source control before being deployed to another server. This gives each developer a degree of control over how much the other developers can screw them up, and more importantly it ensures that every change is traceable back to a date and time and person responsible.

This approach is so ingrained in so many developers, that often we take it for granted. Just try to remind yourself regularly how awful it was that time that everyone was working directly on the same developer server, and nobody can keep track of who changed what when. Or better yet, how fired up everyone got the last time somebody sneaked directly into the production server and started mucking around.

The database is like the village bicycle

I am consistently surprised how often developers go to so much trouble to isolate their local application development environment, and then point their local application code to a shared development database server that the whole team is working on.

If you never need to make database changes, and nobody on your team needs to make database changes, this can certainly work. In that case, the database behaves like a third party service, rather than an actively developed part of the system.

However, if you are ever making database changes, you need to isolate your database for the exact same reasons that you need to isolate your application code.

Imagine that you working on a project that involves several layers of DLLs communicating with each other. Because you are in active development, you and your team are constantly making changes that affect the interfaces between those DLLs. The result in is that you continually need to check in your changes in a whole batches; you can’t just check in a few files here and there because you will be breaking the interfaces for anyone else working in that code.

The same rules must apply to the databases as well, for all of the same reasons. At any given point in time, anyone should be able to pull the code that is in source control, build it, and run it. However, if I’m making a series of changes to my local code and the shared development database, my crazy C# changes are isolated on my local machine, but coworkers are getting my database changes as they happen, so their systems will stop working all of the sudden, and they won’t even know why, or worse yet they will know exactly why and I’ll be the guy “who busted everything up.”

Better yet, after a few days of wasting time on a bad design, I give up on it, and with one or two clicks I can undo all of my code changes and roll back to the main development code stream. However, there is no one-click rollback to the database schema, and so now those changes need to be manually backed out. Hopefully I kept a good list of the changes so I can do this without missing anything, but we all know that a few things will get missed, and now the development database becomes even more of a mutant branch of the true database schema, full of changes that nobody remember or owns, and it is all going to blow up and make us all look like fools when we are rushing to deploy it into QA next month.

DVCS isn’t all bad

Distributed Version Control Systems like Git and Mercurial are the fun new fad in version control, and everyone seems to think that they are so much cooler than more traditional and linear systems like Vault. To me, it seems to grossly overcomplicate an already difficult issue by exacerbating the most problematic concepts, namely branching and merging. But I’m a crusty old conservative who scoffs at anything new, so maybe (and even hopefully) I’m wrong. I was quick to dismiss it as a new toy of bored astronauts, but some people at lot smarter than me have done the same and seem to be coming around to it, if not embracing it completely, so I will continue to believe that I am right for now, even though I know I’m probably wrong and will change my mind eventually.

But there is one idea in DVCS systems that I can get on board with, and that’s the idea that everyone is working in their own branch. As we’ve discussed, you simply cannot be working in the same sandbox as everyone else, or you will have intractable chaos. You should stay plugged into what everyone else is doing on a regular basis, usually through version control, but you must also isolate yourself, and you must do so thoroughly.

And here’s the thing (and this may very well be the idea that eventually opens my path to DVCS enlightenment): your local machine is branch. Granted, it is not a very robust branch, because it only has two states (your current state and the latest thing in source control), but you are still essentially branched until you check in, in which case you will have to merge. It might be a really small merge, because the changes were small or backwards compatible, or because you were not branched off locally for that long, or you are the only one working on a feature, or because you’ve been communicating with the rest of your team, or because you are the only person who actually does any work, but you are merging nonetheless.

What does this have to do with databases? Branching is all about isolation. You must isolate your development environment, and you must so thoroughly. If you think of your machine as simply a branch of the source code, it crystallizes the idea that everything you are doing locally is a full stream of code, and it must contain everything needed to run that code, and must represent all of the changes in that code, including the database. In a broader view, if you were to branch your code to represent a release or a patch or feature, you obviously should be branching your database code at the same time (assuming of course that your database is under version control). If that is the case, and if the code on your local machine is nothing more than a primitive branch of what is in source control, then your local machine should also have its own copy of the database.

Database scripting is hard, let’s go shopping!

I know. This makes pushing your code to the development server more difficult, because you have to script everything out, and honestly the last thing I really want to do is complicate the lives of developers. In fact, I think the primary purpose of most development processes should be to reduce friction, and to make a developer’s life as worry-free as possible, so that they can focus on the real complicated business problems they are paid to solve, not the silly process crap that people invent to make themselves appear smart and organized.

That being the case, it may make your life a little harder to develop locally, and then write up all of the scripts necessary to push those changes to the dev server, but it is definitely worth it. This is not a theoretical improvement that will hopefully save you time in the distant future, when design patters rule and everybody’s tasks are on index cards and you’ve achieved 100% code coverage in your unit tests. No, this is a real, tangible, and immediate benefit, because you will save yourself effort when you deploy it to the next stage, namely QA or production. At that point, you’ll already have everything organized and listed out, and you did so when you were still working on the code and everything is still fresh in your mind. In my humble opinion, this is a much more maintainable process than everyone just trashing around in a wild-west development database, and then after it all spending days trying figure out which schema differences need to be included to release which features, because the change of getting that right consistently are almost non-existent.

And if this is really too much work for you to do well, maybe we can find you a ball to bounce with instead. Or maybe some UML diagrams and crayons to draw with. Either way, get the hell out of my code.

Beating a dead horse

Hopefully I’ve convinced you that you should have your own copy of the database for you local development. I could go on forever giving locally reasons for this. Or, I could give specific examples, like a recent client that did not follow this pattern, and our team was constantly breaking each other because the code and database was out of sync, even though we were just a small team of 3 experienced and practical developers sitting in a single office right next to each other, working together well and communicating all day, but the lack of database isolation made the issues unavoidable.

So yeah, I could go on and on. But I won’t, because it’s getting boring. If you’re agreeing with me by now, feel free to go read something else.

I can’t run SQL Server locally, my machine sucks!

I’ve heard a lot of people say that they can’t run SQL Server locally, and sometimes they are right, but I think a lot of the time it is an excuse.

Maybe you don’t have a license to SQL Server. That’s usually fine, SQL Server Express Edition is free. Sure, it has some limitations, like the lack of a SQL profiler, but there are great free tools out there like the one from AnjLab. And if you still need a full-featured copy, the developer edition costs less than $50. Can you or your company not spare $50 for something like that? Really?

Or maybe your machine doesn’t have enough memory. It’s true, SQL will eat up memory like nobody’s business and if you have Visual Studio 2008 and Outlook 2007 running, it can be pretty heavy. But I’ve found that as long as you have 3 or 4 GB of RAM, it works pretty well, and doesn’t everyone have that these days? Sure, a lot of you are stuck with crappy old machines that your employer gave you because he considers you to be a high-priced janitor, and he can’t justify in his mind spending a few hundred extra to make help you be more productive, but in that case you have bigger problems than anything we’re going to solve here. I would say, if possible, you should even shell out a few hundred and get some more memory for your machine, even if it’s a work machine and they won’t reimburse you for it. I know plenty of people who would be opposed to this just out of principle, but those people and their principles can go have their own little pity party and see who comes; in the meantime I’d rather solve the problem and move on.

Too Beaucoup?

However there is certainly one potential problem that can be difficult to overcome. What if your existing database is just too damn big to run locally?

One recent client had a production database which was used for a million unrelated purposes, and it was 30GB. Another recent client had the vast majority of their of their business data spread across two databases that were each about 300 GB. Sometimes, the database is just too big to copy down to your local machine. There are a few ways to deal with the problem.

Sometimes the best option is to separate the schema and the data. Strip down the data, get rid of the 300 GB, and get the minimum amount of sample data necessary to run your applications. Maybe clear it out entirely, and have some scripts or automated tests that generate a batch of sample data. Often times this will require a lot of analysis to determine what is necessary, and what all of the data is being used for, but that’s not an entirely bad thing. If you get a good and portable development database out of it, while also getting a better understanding of how the data is being used, then that has a lot of benefits. Granted, this is not easy by any stretch, but it may be doable. It all depends on your situation.

Another option is to setup a single high-powered development database server, and give each developer their own instance of the database on the server. This approach can have its own problems as well, such as people getting confused about which database instance belongs to who, and having enough disk space to store the inevitable terabytes of data.

So if you have any of these problems, my sympathies to you, and I hope that you can find a workable solution.

The End

So hopefully you are now inspired to change your process at little bit, or just entertained for a little while.

Or, if you’ve been trying to do the same thing but aren’t getting the buy-in you need, you have a little more reference material for saying “I read an article about this on the interwebs, and some guy said…”. It’s one reason I make these things so long, as a favor to you, because the person you are arguing with isn’t actually going to read all of this crap, so maybe they will just give up and let you do whatever you want. The person that wins the argument usually isn’t the person who is right, it’s usually the person who is willing to waste the most time arguing about it.

Good luck!

Why your database version control strategy sucks and what to do about it, Part I

29 Replies

At the very least, it’s probably not nearly as good as it should be. As a pathological job-hopper who sees one company after another wasting hidden hours and days on insufficient database change control strategies, I’m sick of it.

So how does my company handle database changes? I’ve asked this of potential employers many times, and I usually just get blank stares, or a vague answer along the line of “uh… with scripts?” Sometimes, if they are honest, they just reply “poorly.” There’s not even an explicit Joel Test for it. Considering how bad the situation is at most companies, the real test is not whether their database change management is any good, but just whether they are willing to recognize how problematic it really is.

Given how much thought and effort goes into source code control and change management at many of these same companies, it is confusing and a little unsettling that so much less progress has been made on the database change management front. Many developers can give you a 15 minute explanation of their source code strategy, why they are doing certain things and referencing books and blog posts to support their approach, but when it comes to database changes it is usually just an ad-hoc system that has evolved over time and everyone is a little bit ashamed of it.

Hopefully we can get to the bottom of why this is and come up with a few ways to make it better. It’s not like the database is important, is it?

I like you, but not in that way

So why don’t developers think about databases the same way that they think about source code?

The most obvious reason is that database changes are fundamentally different in many ways. While source code is usually just a collection of files that recompiled, versioned, and released at any given time, databases are much more temperamental. They have existing data, and they have history. Sure, source code has history, which you can review for reference purposes, but in databases the historical lineage of a table is actually very important. In C#, if you added a field to a class, and then a week later someone else changed the type, and then a month later someone else changed the name, it usually doesn’t really matter too much when or in what order those changes happened, all that matters is the current state of the code and that everything builds together and works in its latest state. However, if you were to do the same thing to a new field in a table, it definitely makes a difference, because there are data implications at every step of the way. This alone scares a lot of developers away from maintaining the database.

To many developers, there is something fundamentally uncontrollable about databases. They don’t fit into the safe model that we’re use to. Managing those changes is definitely introduces new challenge, and many developers just don’t want to be bothered.

Doing Business As

Another major reason for the difference is just cultural. Developers want the database to work well, and they may even like tinkering around with some stored procedures from time to time, but at the end of the day they like to be able to wash their hands of anything that slightly resembles being a “DBA”.

When you compare the DBAs (database administrators) and SCMs (stupid code monkeys), often times there is just a different way of looking at the world.

Developers often see themselves as the general of a utilitarian army of code that they build and train and then order off to carry out their mission, and they usually won’t hesitate to trash it all and replace it if it proves itself unable to complete the mission. DBAs on the other hand are used to dealing with gentle lion that could easily lose it temper and kill everyone in the room if it’s not treated with kindness and respect. Developers often have the option to wipe the slate clean and start over, and usually want to, especially when they are dealing with someone else’s code. DBAs however are stuck with the original version and they need to keep it alive and make it work until the point that the data can be migrated to a better system, and we all secretly know that is never going to happen.

Code source control is an over-ripe banana

As developers, we are awash with tools and techniques for managing our source code changes. We have a seemingly endless number of source control platforms to choose from, each with more features than we could every possibly need or even understand, and some people build entire careers around defining complicated branching and merging strategies for massive codebases. I figured for a while that the ideas of practical source control where pretty well nailed down, but then you read Eric’s Sink’s posts on Distributed Version Control Systems, and get an idea of how quickly the field is continuing to evolve.

At some internal IT department, a group of developers is toiling away on a set of dull yet fabulously enterprise-y software that five people will use. Jimmy checks his code changes into his company’s version control system of choice, where it is automatically held at the gates until it is code-reviewed by a senior developer, and then it is checked into the team’s current release branch. Meanwhile the continuous integration build server will download, compile, package, and unit test the code to make sure that Jimmy hasn’t broken anything, and that nobody else’s changes broke Jimmy. The package code is then migrated through an integration environment, QA environment, UAT environment, and staging environment on its way to production. All the while, as the changes are validated from one environment to another, version numbers are automatically assigned to allow anyone to trace back the exact revision for a given build, and the corresponding code changes slowly work their way towards the code-release promised land, the Main Trunk. Those branches can get pretty damned complicated, even when everything is going smoothly, and it never goes smoothly.

Hopefully this is a process that Jimmy’s company evolved out of mixture of necessity and forethought over the years. The other, less attractive, and far more common scenario is that the company hired an astronaut trapped in a developer’s body, bored with his work and not bound by any sense of urgency, who assured the company that he was going to implement a lot of big complicated processes because that’s just want professional companies do.

In the end, a whole lot of people are paying a whole lot of attention to managing the source code. Hopefully at your company, you are paying attention to this to.

Databases, the redheaded stepchild of source control

Now ask yourself, how does Jimmy handle database changes?

For example, say that all database changes need to be checked into an isolated directory in source control; after all we’re not savages. However, since they don’t really “build” a database, that a directory which is ignored by their continuous integration server. This in turn breaks the unit tests that are pointing to the integration database server, so Jimmy then needs to run those scripts manually in the integration environment.

In this process, Jimmy sees other database scripts that were checked in recently in the same source control directory, but he has no way to know which scripts have already been applied to the integration server. For the briefest moment, Jimmy considers applying those scripts as well, just to make sure that the integration server is fully up-to-date, but then he realizes that he can’t be sure which scripts have already been run without manually comparing the schema and scripts to see which have been applied, and this would make Jimmy the defacto owner for any issue that arise because of it. With his own tasks and deadlines to worry about, Jimmy doesn’t have the time or patience for this silliness, so he just deploys his scripts, forgets about the others, and hopes for the best.

It’s worth noting here that this is the kind of things that can silently kill software quality. A motivated developer just tried to make things a little better, but the process was so frustratingly inadequate that it was impractical for him to do so. Software companies depend on their developers taking the initiative to improve things, and when they are discouraged from doing so, either by person or by process, the company will slowly slide backwards into mediocrity, and it will drag every developer’s morale with them.

Now once Jimmy makes the changes to the integration server database, that also breaks some other developers that have been using that server for their development. Those developers now need to stop and download the latest code changes to get back in sync, cursing Jimmy’s name the whole way.

Anyhow, during the next deployment to QA, someone needs to remember that these specific changes need to get deployed. Since there is no defined strategy for tying database changes to code changes, every time code is deployed there is a little bit of confusion around exactly which database changes need to be released, which were already released, and what order the scripts need to be run in. Jimmy is getting upset.

Another darker possibility is that instead Jimmy needs to submit his changes to the database review board, a collection of detached idealists, college professors without the college, who will criticize every aspect of the scripts in order to justify their existence, but will not really offer any true value because they don’t understand the business problem that needs to be solved, nor do they appreciate the application considerations beyond the database that need to be satisfied.

One of the long term impacts of this is that Jimmy will look for any possible way to accomplish what he is trying to do without making database changes, because, in his colorful youthful vernacular, “making changes to the database is a sodding bitch.” And if he does indeed need to change the database, he’ll try to accomplish it just by changing stored procedures, because changing table schemas are even worse. In the end, he’s definitely not trying to find the appropriate solution to the problem; instead he backed into a situation of being forced to find a “good-enough” solution that will minimize his hassle, regardless of the downstream impacts.

From now on, he’ll look for any way he can accomplish it by only changing stored procedures and not changing the underlying schema. If he’s lucky (sort of), he’ll find a way that he can just kludge the stored procedures to work around the problem for now, and let it be someone else’s problem to solve later. He has long since given up trying to find the “right” solution, because it is so exhausting the current state of things is so busted up that it’s not even worth trying anymore.

Further complicating the issue, some developers and DBAs make one-off changes in the QA and production environments without going through source control. Either they need to make an emergency fix in one of the environments and forget to go back and add the scripts to source control, or worse they just don’t believe that databases should be under source control (I’ve seen this attitude far too often from some DBAs, because they can’t stand the idea of being forced to use a source control system managed by some other developers, just so that they can make changes to their own database). Pretty soon, every environment is a perverted branch of the one true database, and trying to identify why a test works in one environment and fails in another quickly becomes a nightmare.

Some day, things will be different!

So what’s the solution? Well, in my humble opinion, we need something like this:

The database must be versioned, so that it is easy to tell which changes have applied and when they were applied.
All database changes must be checked into source control.
All database changes can tied to the code changes that they affect, ideally checked into source control as a part of the same changeset transaction.
The database changes are built along with the code changes.
The database changes are deployed along with the code changes.
The continuous integration server must be able to build and update its own copy of the database, so that it can run automated tests of code and scripts that are checked in at the same time.

I’ve seen some companies that have had home-grown utilities that come close to accomplishing this, but in the end they all fell just a little bit short, which is not too bad. However, the vast majority of companies I’ve seen were not even in the ball park.

Some of you are probably asking, “doesn’t Visual Studio Team System” do this? Yeah, I think so. Probably, but who knows. Honestly I tried working with it a few times, and it caused me nothing but problems. Sure, I could spend a lot of time mastering all the quirks, but I’m looking for something a little bit more accessible here. The underlying concepts are hard enough; we need an approach that simplifies it, and I just don’t think that VSTS accomplishes that. More importantly, and also along the lines of accessibility, VSTS costs a fortune, and so most developers will never have access to it, so I’d like to fine something that can work for the other 95% of developers out there that are stuck using reasonably-priced tools.

What about Red Gate’s SQL Compare and SQL Data Compare products? Absolutely, they tools are indeed awesome products that can prove invaluable in many cases. However, they are often leveraged once an unknown number of database changes have already been made by an unknown number of people. However, I think the better solution, whenever possible, is to step back and track the changes as they come in, proactively considering the changes, just like we do for regular source code, which allows for far more robust change management and accountability for the database.

So that’s the idea. In the next few weeks I’ll have some more posts on this topic, getting into some more specifics about how to solve this. Hopefully we can find a simple, easy to manage solution, that developers can adapt to quickly, and which promotes database changes to the first-class citizen of the change management world that it deserves to be.

Mooney's Blog

Stuff about software development