About Mike Mooney

Mike Mooney has been a software developer for 15 years, and lives in Chester Springs, PA with his wife and three sons, and is the co-founder of MMDB MMDB Solutions.
Why?

Last year we launched a new version of SportsCommander.com, which offered volleyball organizations across the country the ability to promote their tournaments and accept registrations for a negligible fee.  Having grown out of our previous hosting company, we tried hosting the platform on Windows Azure, and for the most part it’s been great.  Also, the price was right.

We are also hosting our data in SQL Azure, which again for the most part has been great.  It has performed well enough for our needs, and it abstracts away a lot of the IT/DBA maintenance issues that we would really rather not worry about.

Of course, nothing is perfect.  We’ve had a few snags with Azure, all of which we were able to work around, but it was a headache. 

One of the biggest issues for us was the ability to run regular backups of our data, for both disaster recovery and testing purposes.  SQL Azure does a great job of abstracting away the maintenance details, but one of the things you lose is direct access to the SQL backup and restore functionality.  This was almost a deal-breaker for us.

Microsoft’s response to this issue is that they handle all of the backups and restores for you, so that if something went wrong with the data center, they would handle getting everything up and running again.  Obviously this only solves part of the problem, because many companies want to have their own archive copies of their databases, and personally I think doing a backup before a code deployment should be an absolute requirement.  Their answer has been “if you need your own backups, you need to build your own solution.”

Microsoft is aware of this need, and it has been the top-voted issue on their Azure UserVoice site for a while. 

In poking around the interwebs, I saw some general discussion of how to work around this, but very little concrete detail.  After hacking around for a while, I came up with a solution that has worked serviceably well for us, so I figured I’d share it with y’all.

 

What?

In order to address these concerns, Microsoft introduced the ability to copy a database in SQL Azure.  So, as a limited backup option, you can create a quick copy of your database before a deployment, and quickly restore it back if something fails.  However, this does not allow for archiving or exporting the data from SQL Azure, so all of the data is still trapped in the Azure universe.

Apparently another option is SSIS.  Since you can connect to Azure through a standard SQL connection, theoretically you could export the data this way.  Now I am no SSIS ninja, so I was just never able to get this working with Azure, and I was spending far too much time on something that I shouldn’t need to be spending much time on.

I’ve heard rumblings Microsoft’s Sync Framework could address the issue, but, uh, see the previous point.  Who’s got time for that?

So of course, Red Gate to the rescue.  Generally speaking their SQL Compare and SQL Data Compare solve this type of problem beautifully, they are excellent at copying SQL content from one server to another to keep them in sync.  The latest formal release of their products (v8.5 as of this writing) does not support SQL Azure.  However, they do have beta versions of their new v9.0 products, which do support SQL Azure.  Right now you can get time-locked beta versions for free, so get yourself over to http://www.red-gate.com/Azure and see if they are still available.  If you’re reading this after the beta program has expired, just pony up the cash and by them, they are beyond worth it.

 

How?

OK, so how do we set this all up?  Basically, we create a scheduled task that creates a copy of the database on SQL Azure, downloads the copy to a local SQL Server database, and then creates a zipped backup of that database.

First, you need a SQL Server database server.  And go install the Azure-enabled versions of SQL Compare and SQL Data Compare.

Also, go get a copy of 7-Zip, if you have any interest in zipping the backups.

The scheduled task will execute a batch file.  Here’s that batch file:

SET SqlAzureServerName=[censored]
SET SqlAzureUserName=[censored]
SET SqlAzurePassword=[censored]

SET LocalSqlServerName=[censored]
SET LocalSqlUserName=[censored]
SET LocalSqlPassword=[censored]

echo Creating backup on Azure server
sqlcmd -U
%SqlAzureUserName%@%SqlAzureServerName% -P %SqlAzurePassword% -S %SqlAzureServerName% -d master -i C:\SQLBackups\DropAndRecreateAzureDatabase.sql

echo Backup on Azure server complete

echo Create local database SportsCommander_NightlyBackup
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d master -i C:\SQLBackups\DropAndRecreateLocalDatabase.sql

echo Synchronizing schema
"C:\Program Files (x86)\Red Gate\SQL Compare 9\SQLCompare.exe" /s1:%SqlAzureServerName% /db1:SportsCommanderBackup /u1:%SqlAzureUserName% /p1:%SqlAzurePassword% /s2:%LocalSqlServerName% /db2:SportsCommander_NightlyBackup /u2:%LocalSqlUserName% /p2:%LocalSqlPassword% /sync

echo Synchronizing data
"C:\Program Files (x86)\Red Gate\SQL Data Compare 9\SQLDataCompare.exe" /s1:%SqlAzureServerName% /db1:SportsCommanderBackup /u1:%SqlAzureUserName% /p1:%SqlAzurePassword% /s2:%LocalSqlServerName% /db2:SportsCommander_NightlyBackup /u2:%LocalSqlUserName% /p2:%LocalSqlPassword% /sync

echo Backup Local Database
for /f "tokens=1-4 delims=/- " %%a in (‘date /t’) do set XDate=%%d_%%b_%%c
for /f "tokens=1-2 delims=: " %%a in (‘time /t’) do set XTime=%%a_%%b
SET BackupName=SportsCommander_Backup_%XDate%_%XTime%
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d master -Q "BACKUP DATABASE SportsCommander_NightlyBackup TO DISK = ‘C:\SQLBackups\%BackupName%.bak’"

"C:\Program Files\7-Zip\7z.exe" a "C:\SQLBackups\%BackupName%.zip" "C:\SQLBackups\%BackupName%.bak"

del /F /Q  "C:\SQLBackups\%BackupName%.bak"

echo Anonymize Database For Test Usage
sqlcmd -U %LocalSqlUserName% -P %LocalSqlPassword% -S %LocalSqlServerName% -d SportsCommander_NightlyBackup -i "C:\SQLBackups\AnonymizeDatabase.sql"

 

The first thing this does is run a SQL script against the SQL Azure server (DropAndRecreateAzureDatabase.sql).  This script will create a backup copy of the database on Azure, using their new copy-database functionality.  Here’s that script:

DROP DATABASE SportsCommanderBackup
GO
CREATE DATABASE SportsCommanderBackup AS COPY OF SportsCommander
GO
DECLARE @intSanityCheck INT
SET @intSanityCheck = 0
WHILE(@intSanityCheck < 100 AND (SELECT state_desc FROM sys.databases WHERE name=’SportsCommanderBackup’) = ‘COPYING’)
BEGIN
— wait for 10 seconds
WAITFOR DELAY ’00:00:10′
SET @intSanityCheck = @intSanityCheck+1
END
GO
DECLARE @vchState VARCHAR(200)
SET @vchState = (SELECT state_desc FROM sys.databases WHERE name=’SportsCommanderBackup’)
IF(@vchState != ‘ONLINE’)
BEGIN
DECLARE @vchError VARCHAR(200)
SET @vchError = ‘Failed to copy database, state = ”’ + @vchState + ””
RAISERROR (@vchError, 16, 1)
END
GO

 

A few notes here:

  • We are always overwriting the last copy of the backup.  This is not an archive; that will be on the local server.  Instead, this always the latest copy.  Besides, extra Azure databases are expensive.
  • For some reason SQL Azure won’t let you run a DROP DATABASE command in a batch with other commands, even though SQL 2008 allows it.  As a result, we can’t wrap the DROP DATABASE in an “IF(EXISTS(“ clause.  So, we need to always just drop the database, which means you’ll have to create an initial copy the database drop for the first time you run the script.
  • The CREATE DATABASE … AS COPY OF will return almost immediately, and the database will be created, but it is not done the copying.  That is actually still running in the background, and it could take a minute or two to complete depending on the size of the database.  Because of that, we sit in a loop and wait for the copy to finish before continuing.  We put a sanity check in there to throw an exception just in case it runs forever.

 

Once that is complete, we create a local database and copy the Azure database down into that.  There are several ways to do this, but we chose to keep a single most-recent version on the server, and then zipped backups as an archive.  This gives a good balance of being able to look at and test against the most recent data, and having access to archived history if we really need it, while using up as little disk space as possible.

In order to create the local database, we run a very similar script (DropAndRecreateLocalDatabase.sql):

IF(EXISTS(SELECT * FROM sys.databases WHERE Name=’SportsCommander_NightlyBackup’))
BEGIN
DROP DATABASE SportsCommander_NightlyBackup
END
CREATE DATABASE SportsCommander_NightlyBackup

 

In this case, we actually can wrap the DROP DATABASE command in a “IF(EXISTS”, which makes me feel all warm and fuzzy.

After that, it’s a matter of calling the SQL Compare command line to copy the schema down to the new database, and then calling SQL Data Compare to copy the data down into the schema.  At this point we have a complete copy of the database exported from SQL Azure.

As some general maintenance, we then call sqlcmd to backup the database out to time-stamped file on the drive, and then calling 7-Zip to compress it.  You might want to consider dumping this out to a DropBox folder, and boom-goes-the-dynamite, you’ve got some seriously backed-up databii.

Lastly, we run an AnonymizeDatabase.sql script to clear out and reset all of the email addresses, so that we can use the database in a test environment without fear of accidentally sending bogus test emails out to our users, which I’ve done before and it never reflected well on us.

Run that batch file anytime you want to get a backup, or create a scheduled task in Windows to run it every night.

Anyhoo, that’s about it.  It’s quick, it’s dirty, but it worked for us in a pinch.  Microsoft is just getting rolling on Azure and adding more stuff every month, so I’m sure they will provide a more elegant solution sooner or later, but this will get us by for now.

Have you had a similar experience?  How are you handling SQL Azure backups?

Prologue

So I was working for a company a few years ago. The company had been around for a while, and had a bunch of genuinely intelligent senior developers working for them.

Over the course of several years, inevitably a set of common problems arose. Writing data access code was repetitive and error-prone. Many teams were sharing the same database but using different code, so there were inconsistencies in how similar business data was treated. Each application’s log files where all over the place, and each had their own approach for error handling. Some teams would go off on their own and unnecessarily reinvent the wheel. Moving a developer from one team to another required as much ramp-up time as a new hire. Each team used a different build and versioning strategy, with the most common strategy being “none.” Setting up a test environment with multiple applications took days. Recreating the production environment was virtually impossible. Chaos. Dogs and cats living together. Mass hysteria.

To address these issues, some of the more senior architects took it up themselves to build a framework that would greatly simplify everyone’s life. By putting in a little design upfront, they could build a framework layer that would solve many of the problems that the developers had been muddling through over the years, while at the same time homogenizing the code base.

Of course, the company had attempted this before. In fact, there were several previous frameworks over the years. But those previous frameworks were not as good as they could be, either because of design flaws, or changes in the way that the company’s applications work, or because they used outdated technology, or because the previous no-longer-with-the-company designers were now generally considered to be idiots. Anyhow, the new architects had learned from these mistakes, and were designing a new framework that would do a much better job of solving the problems. Due to the scope of such a project, it had been a work-in-progress for about a year. Sure, they were still working out some kinks, and it was not completely finalized yet, but this is a technology investment, and some growing pains were to be expected.

Déjà vu

Well, OK, I lied. This wasn’t one company I worked for. It was several different companies, all with the same story. In fact, it’s a little eerie how much you see this exact scenario play out at companies all over the industry.

Some senior developers have identified some recurring pain points for the developers, and they want to do something about it. As the company has grown, more and more developers have come on board, each with less and less experience, and things need to be brought back under control. By providing a framework, you can lay out the boundaries for developers to operating in, which will encourage consistency, will encourage code reuse, and in the end will allow the company to produce higher-quality software in less time with fewer developers, which will require less maintenance cost over time. In other words, it will pursue the single ultimate goal that should be at the center of every design decision, namely that which will advance the overall long-term profitability of the company more than any other option.

It sounds like a brilliant idea. And if it were to be accomplished, it would be great. But the unfortunate truth is that it doesn’t work. Without exception, I have only seen this result in more work for the developers, longer development cycles, more bugs, poorly compromised designs, and (worst of all) excessive, unhealthily conflict between the development teams.

Admit it, you’ve seen it too. Maybe you were even involved.

So What’s The Problem?

So why does this go wrong?

The problem is usually not intent. I don’t want to make it sound like the people involved are morons or that they are have any desire to do harm to their company. In fact, usually they are excellent developers who are trying to do the best they can to solve a problem. I can’t fault them for that. I just think the approach is a little misguided.

And the problem is not the people down in the trenches, pushing back on every change the framework team wants to introduce. No, these people are trying to get a job done. Their marching orders are not to solve the whole company’s crosscutting problems, but to ship their product on time and in budget, and many of them believe, perhaps rightfully so, that the framework keeps them from doing that as efficiently as they could.

Again, the problem is the approach.

The Challenge Of Frameworks

So what is a framework? Generally, people think of a framework as something that helps you get your job done by providing access to new functionality that you didn’t have before. This is usually the selling point, used when convincing a manager or developer that this is all such a great idea, but the reality is that the true nature of a framework lies not in what it helps you to do, but rather in how it limits you.

For example, the Slightly-Almighty, Moderately-Informative, Usually-Reliable Wikipedia says:

A software framework, in computer programming, is an abstraction in which common code providing generic functionality can be selectively overridden or specialized by user code providing specific functionality…

Software frameworks have these distinguishing features that separate them from libraries or normal user applications:

1. inversion of control – In a framework, unlike in libraries or normal user applications, the overall program’s flow of control is not dictated by the caller, but by the framework.[1]

2. default behavior – A framework has a default behavior. This default behavior must actually be some useful behavior and not a series of no-ops.

3. extensibility – A framework can be extended by the user usually by selective overriding or specialized by user code providing specific functionality.

4. non-modifiable framework code – The framework code, in general, is not allowed to be modified. Users can extend the framework, but not modify its code.

 

One of the key points here is that the framework is dictating the application flow, rather than the developer who is using it. This is what the Martin Fowler (who literally wrote the book on refactoring) would describe as a Foundation Framework:

A Foundation Framework is … built prior to any application that are built on top of it. The idea is that you analyze the needs of the various applications that need the framework, then you build the framework. Once the framework is complete you then build applications on top of it. The point is that the framework really needs to have a stable API before you start work on the applications, otherwise changes to the framework will be hard to manage due to their knock-on effects with the applications.

While this sounds reasonable in theory, I’ve always seen this work badly in practice. The problem is that it’s very hard to understand the real needs of the framework. As a result the framework ends up with far more capabilities that are really needed. Often its capabilities don’t really match what that the applications really need.

 

He recommends instead a Harvested Framework:

To build a framework by harvesting, you start by not trying to build a framework, but by building an application. While you build the application you don’t try to develop generic code, but you do work hard to build a well-factored and well designed application.

With one application built you then build another application which has at least some similar needs to the first one. While you do this you pay attention to any duplication between the second and first application. As you find duplication you factor out into a common area, this common area is the proto-framework.

As you develop further applications each one further refines the framework area of the code. During the first couple of applications you’d keep everything in a single code base. After a few rounds of this the framework should begin to stabilize and you can separate out the code bases.

While this sounds harder and less efficient than FoundationFramework it seems to work better in practice.

 

I’m not sure I would even call this a framework, because all of the things that make it work best are the parts that take it further and further from being a conventional “framework”.

So Are All Frameworks Bad?

Sweet suffering succotash, no. In my humble opinion, the .NET Framework is a thing of beauty. Back in my Win32 C++ days, MFC was not perfect, but worked serviceably well for what it was intended for, namely abstracting away the Win32 API. CMS frameworks like DotNetNuke and Drupal and Joomla have been become very popular. Apparently there is a subset of people who don’t hate Flash with a passion, and apparently those people love it. MVC frameworks like Rails and Django have caught on like wildfire, with ASP.NET MVC picking up a lot of momentum as well. Microsoft Azure and Google AppEngine are in the process of changing how we will build scalable cloud-based applications into the next decade.

Have you noticed a pattern here? None of them were built by you or anyone you know. They were not built to solve a business need, they were built to reinvent a platform. They were not built to get everyone using a “Customer” object the same way, they were build to make it easier for you to do whatever you want with whatever data you need. They were not built by 3 architects for 20 developers, they were built by 30 or 300 architects for 20,000 or 200,000 developers or more. They were not designed and built and delivered and completed in a few months, they were talked design and dog-fooded and tested and tweaked and redesigned over years by some of the smartest computer science experts in the business. Any yet, despite all that, most of them still sucked, and the ones we use today are the select few that survived.

The thing is this: you and your internal development team of architects are not going to build the next great framework. You’re not going to build a good one. You’re not even going to build an acceptable one.

And the other thing is this: If a framework is not great, it is awful, counterproductive, and destructive.

Get In Line

By definition, most frameworks try to define a new way for your developers to develop software. The keep you from doing certain things that have been seen as problematic, and require you do things the “right way”, assuming of course that the architects have actually thought through the right way to do things.

The problem is that there are plenty of good ways already, ways that those developers are already trained in and have spend years mastering, and you are not really as clever as you think you are. You can’t think of everything. Even if you could, you can’t design for everything. And even if you could, you shouldn’t. Trying to shoehorn them into an incomplete, shoddy, and unnecessarily restrictive framework will only breed resentment, at which point you are bleeding money that you’ll never see on an expense report. The productivity difference between a happy developer and a disgruntled developer is enormous, and constantly underestimated. You will also alienate and drive away all of your good developers, leaving you only with the not-so-great developers that really don’t have any better options.

Atlas Shrugged

In order to create a framework, you are taking on a massive responsibility. It’s not a case of adding a few features. You are building an entirely new layer of abstraction. In doing so, it is your responsibility to ensure that your framework provides the developer with every possible thing he will need, otherwise he will be stuck. If you create a data access framework, but never quite could figure out how to get BLOBs working, you’re really leaving the developer up a creek when he needs to store BLOBs. Sure, it’s a growth process, and there are always bugs to be fixed and features to be added, but when you are forcing a development team to use this framework, and in the 11th hour they realize it doesn’t have a critical feature that they need, you are introducing more obstacles then you are removing.

But We Have To Create Reusable Code!

No you don’t. Reusable code is bad.

So What?

So do we give up? Maybe. It depends.

So how do you boil the ocean? There are two answers:

1. You don’t. It’s too big of a problem.

2. One pot at a time.

It all depends on your goal.  Is it critical that you actually boil then entire ocean, or do you benefit from every little bit?

Ask yourself, do you REALLY need a framework? Do you REALLY have a problem here that needs to be solved? Do you REALLY think you will save your company time and money by pursuing this? Be honest. Try to be objective. If you find yourself getting the slightest bit excited about the idea of building a framework, recuse yourself from the decision because you are not thinking clearly.

Sure, a well-design framework may save time and money once it is complete, but it may never be complete, and it may never be any good, and the small improvement may not save your company enough to justify the huge expense. As awful as it may seem, the honest answer may be that it is in your company’s best interest to plow ahead with the same busted up junk you’ve had all along. It may not be the most rewarding thing in the world, but you are probably not getting paid to fulfill your dreams, you are getting paid to write the damn code.

So what do we do? Are we doomed to mediocrity? Not necessarily. The other option is to get your head out of the clouds and solve a few small problems at a time. Keep your eye out for redundancies and duplicated code, make note of them, but don’t do anything right away. When you are building a new component, don’t pull your hair out if it slightly resembles some other code, you don’t have to reuse use everything. Once you’ve identified a few low-hanging redundancies, go back and build some small libraries to consolidate that code. Don’t think big picture. Keep it simple. Keep it low-impact. Keep it clean. Put the big guns away. Keep constantly refactoring to make it a little better every day, and before you know it you’ll have system that doesn’t completely suck.

Too much legacy code never gets cleaned up because everyone thinks it is too hard to throw it all out and rewrite it, and no project manager will allow the developers to waste time refactoring code that already works. They are all probably right. A huge refactoring project is probably a waste of money, and it will almost certainly fail. But small, steady, incremental improvements will almost certainly make your world a better place.

OK, maybe it’s not always inherently bad, but it is definitely not what it’s cracked up to be. And it’s definitely not the end-all-be-all that most of us were taught.

In my early years, I was dogmatically taught never to write the same code twice. Anytime you see duplicated code, or even similar code, refactor it into a function, right? What if the code was slightly different? Create some parameters. What if you need one extra field in some cases? Add a little logic around that to decide what to return. What if you run into a case where you can’t fix one caller without breaking another caller? Add another layer of abstraction and logical branching to your function. At this point, you started out trying to improve your code-hygiene a little bit, and all you’ve done is spend the whole day creating a monstrosity that is almost impossible to understand. Well sure, you don’t think so, but hey, nobody likes to admit their kids are ugly.

Always, always, always, stop and ask yourself, “Am I creating more problems them I am solving?”

Keep It Simple, [Obscenity]!

The fact of the matter is, when you try to create reusable code, you are inevitably making things a lot more complicated. It’s rarely a case of just taking what you have already written and exposing it to another component or team.

If you are using the code for yourself, you just have to make sure it works for you based on how you are going to use it, and you should be able to clearly visualize all of the possible scenarios. However, in order to make the code reusable, you are creating a black box to be called by another developer; as such you need to make it work no matter what, with any sort of crazy input, and you will need to make sure that it is clearly reporting back to the caller when something goes wrong. You need to make sure that the interface makes sense to everyone, not just you. You may even need to write up some documentation (ugh). And when you are all done and have covered every base, someone will call your component in a way you never expected, and of course it won’t work because all of the code we write is rife with bugs, and everyone will argue about how this component “should” be used, and someone will demand to see the unit test coverage stats, and someone else will schedule a meeting to define the scope, and someone else ramble on about how we need a framework, and the manager rolls his eyes and swears never to let these damn developers try to build something fancy again, and nobody is actually getting any real work done.

Worser Is More Gooder

Of course, some will argue that you should always try to make your components as reusable as possible, because it requires you to work through all of those quality issues no matter what, and will produce better software. This would be great if your sole goal was to create the best software in the world, but nobody is paying you to do that. No, you are being paid to write software of sufficient quality within the time and budget allotted. If you spend unnecessarily time gold-plating your code, it may make you feel cool, but it is outright wasteful. I’m not saying to cut corners when you should not, but I what I AM saying is that there definitely are times when it actually does make sense to cut comers. You need to draw a line in the sand of exactly how good this product really needs to be and stick to that, otherwise you’ll never finish. You can software that is really cool, or you can build software that ships. Only one of those options is actually useful to society.

But, But, But….

OK, so don’t make something reusable if you done need to. However, what if you think you are going to need it to be reusable? I have a secret to tell you: YAGNI.

I can’t tell you how many times I’ve been working on a very specific implementation to solve a very specific problem, and somebody (who usually has initials in his job title) chimes in saying “well, make this and that and the other thing configurable, so when we sell some other product to some other company, we can include it just by flipping a few switches.” This person is figuring that “it’s just an ‘if’ statement, what’s the big deal?”, but that person is not taking into account the increased code complexity and the exponential increase in testing requirements. Often times, this is even a former developer, who actually knows how code works, but has been out of the trenches for long enough that it has impaired their judgment about code complexity and its day-to-day impacts.

But then again, math solves everything. Say your QA department originally had 143 test scripts to run through before release, and now you’ve added some new configurable switch into the system. If you really want to be thorough, you should now have 286 test scripts, because you now have to retest everything with the switch turned on, and then again with the switch turned off. Say you had actually added to different switches? 572 test scripts. That’s a lot of work for a feature that will never get used. And come on, between you and me and the desk, we all know that Mr. Executive SVP COO is never going to close those magical deals with XYZ Bank and ABC Manufacturing, partly because he has no idea what he’s selling, but also partly because you guys can’t ship anything because you now have 1144 test scripts to run through to get a small release out the door.

So How Do I Know What To Reuse?

If you aren’t going stand around and guess what to reuse, how we will know? Easy, stop trying to prematurely reuse anything. Instead, refactor your existing code to reuse that which is already duplicated, where you can reasonably say will actually benefit from reuse and reduce overall maintenance and complexity.

How do you know when that is? Use that fancy experience and judgment of yours, it’s the only thing that separates you from the robots that will be doing your job some day.

Well what if I don’t have the necessary experience and judgment? Easy, pay me to write your code for you. mmooney@mmdbsolutions.com. </plug>

Anyhow, my rule of thumb is: Never build something to be reused until you actually have two systems that actually need to reuse it right now. If you only have one thing that will use it right now, and that other mythical feature will also use it some day, deal with it when that feature comes along (hint: never). Even if that feature comes along, the changes that you actually predicted and designed for it correctly is approximately zero, so you’re still going to have to make design changes anyway. However, if you kept it lean and simple, it’s simply a matter of adding the new code you need to a small lightweight base. But if you tried to build something big and fancy and configurable, you’ll now have to spend twice as much time disassembling all the unnecessarily complicated junk you build around it to support the thing that you were unsuccessfully trying to support in the first place. The easiest code to refactor is the code that doesn’t yet exist.

So yeah, YAGNI.

 

This post is part 2 of a 382-part series on how to manage database changes, primarily for SQL Server, starting here.

I figured this would be a good week to discuss ways that you can make your world a better place by making small changes to things you do in your everyday work.  No, this post is not about inconvenient truths or sustainability or hybrids or manbearpig.  This post is about the importance of local development databases. 

image

The situation you see all too often is that a development team has a single database server in their development environment that they share, and everyone is developing application code locally while simultaneously making changes to a shared database instance.  Bad, bad, bad.

Captain Obvious

These days, most everyone develops their code locally.  That’s just want you do.  Many developers have learned the hard way that this is important, and won’t tolerate any divergence.  And for the less experienced developers who are doing it just because they are told to, eventually they will make the same mistakes and learn from them too.  This is such an easy lesson to learn that you don’t see too many people violate it intentionally.

Even if you HAVE to develop on a server environment, you’ll probably at least find a way to isolate yourself.  For example, SharePoint developers don’t tend to install the SharePoint platform on their local machines, mostly because it requires a server OS, but also because SharePoint is pure, unadulterated evil that will steal the very soul of any machine it comes into contact with.  Nonetheless, in those cases where a local machine is not practical, the developer will install the SharePoint software onto a virtual machine so that they can still work in isolation.

This local development approach is critically important to any form of version control or change management.  For all practical purposes, developers must have a workable environment that they can fully control and work in peace.  From there, developers check their code into source control, and hopefully it gets built from source control before being deployed to another server.  This gives each developer a degree of control over how much the other developers can screw them up, and more importantly it ensures that every change is traceable back to a date and time and person responsible.

This approach is so ingrained in so many developers, that often we take it for granted.  Just try to remind yourself regularly how awful it was that time that everyone was working directly on the same developer server, and nobody can keep track of who changed what when.  Or better yet, how fired up everyone got the last time somebody sneaked directly into the production server and started mucking around.

The database is like the village bicycle

I am consistently surprised how often developers go to so much trouble to isolate their local application development environment, and then point their local application code to a shared development database server that the whole team is working on.

If you never need to make database changes, and nobody on your team needs to make database changes, this can certainly work.  In that case, the database behaves like a third party service, rather than an actively developed part of the system.

However, if you are ever making database changes, you need to isolate your database for the exact same reasons that you need to isolate your application code.

Imagine that you working on a project that involves several layers of DLLs communicating with each other.  Because you are in active development, you and your team are constantly making changes that affect the interfaces between those DLLs.  The result in is that you continually need to check in your changes in a whole batches; you can’t just check in a few files here and there because you will be breaking the interfaces for anyone else working in that code. 

The same rules must apply to the databases as well, for all of the same reasons.  At any given point in time, anyone should be able to pull the code that is in source control, build it, and run it.  However, if I’m making a series of changes to my local code and the shared development database, my crazy C# changes are isolated on my local machine, but coworkers are getting my database changes as they happen, so their systems will stop working all of the sudden, and they won’t even know why, or worse yet they will know exactly why and I’ll be the guy “who busted everything up.” 

Better yet, after a few days of wasting time on a bad design, I give up on it, and with one or two clicks I can undo all of my code changes and roll back to the main development code stream.  However, there is no one-click rollback to the database schema, and so now those changes need to be manually backed out.  Hopefully I kept a good list of the changes so I can do this without missing anything, but we all know that a few things will get missed, and now the development database becomes even more of a mutant branch of the true database schema, full of changes that nobody remember or owns, and it is all going to blow up and make us all look like fools when we are rushing to deploy it into QA next month.

DVCS isn’t all bad

Distributed Version Control Systems like Git and Mercurial are the fun new fad in version control, and everyone seems to think that they are so much cooler than more traditional and linear systems like Vault.  To me, it seems to grossly overcomplicate an already difficult issue by exacerbating the most problematic concepts, namely branching and merging.  But I’m a crusty old conservative who scoffs at anything new, so maybe (and even hopefully) I’m wrong.  I was quick to dismiss it as a new toy of bored astronauts, but some people at lot smarter than me have done the same and seem to be coming around to it, if not embracing it completely, so I will continue to believe that I am right for now, even though I know I’m probably wrong and will change my mind eventually.

But there is one idea in DVCS systems that I can get on board with, and that’s the idea that everyone is working in their own branch.  As we’ve discussed, you simply cannot be working in the same sandbox as everyone else, or you will have intractable chaos.  You should stay plugged into what everyone else is doing on a regular basis, usually through version control, but you must also isolate yourself, and you must do so thoroughly.

And here’s the thing (and this may very well be the idea that eventually opens my path to DVCS enlightenment): your local machine is branch.  Granted, it is not a very robust branch, because it only has two states (your current state and the latest thing in source control), but you are still essentially branched until you check in, in which case you will have to merge.  It might be a really small merge, because the changes were small or backwards compatible, or because you were not branched off locally for that long, or you are the only one working on a feature, or because you’ve been communicating with the rest of your team, or because you are the only person who actually does any work, but you are merging nonetheless.

What does this have to do with databases?  Branching is all about isolation.  You must isolate your development environment, and you must so thoroughly.  If you think of your machine as simply a branch of the source code, it crystallizes the idea that everything you are doing locally is a full stream of code, and it must contain everything needed to run that code, and must represent all of the changes in that code, including the database.  In a broader view, if you were to branch your code to represent a release or a patch or feature, you obviously should be branching your database code at the same time (assuming of course that your database is under version control).  If that is the case, and if the code on your local machine is nothing more than a primitive branch of what is in source control, then your local machine should also have its own copy of the database.

Database scripting is hard, let’s go shopping!

I know.  This makes pushing your code to the development server more difficult, because you have to script everything out, and honestly the last thing I really want to do is complicate the lives of developers.  In fact, I think the primary purpose of most development processes should be to reduce friction, and to make a developer’s life as worry-free as possible, so that they can focus on the real complicated business problems they are paid to solve, not the silly process crap that people invent to make themselves appear smart and organized.

That being the case, it may make your life a little harder to develop locally, and then write up all of the scripts necessary to push those changes to the dev server, but it is definitely worth it.  This is not a theoretical improvement that will hopefully save you time in the distant future, when design patters rule and everybody’s tasks are on index cards and you’ve achieved 100% code coverage in your unit tests.  No, this is a real, tangible, and immediate benefit, because you will save yourself effort when you deploy it to the next stage, namely QA or production.  At that point, you’ll already have everything organized and listed out, and you did so when you were still working on the code and everything is still fresh in your mind.  In my humble opinion, this is a much more maintainable process than everyone just trashing around in a wild-west development database, and then after it all spending days trying figure out which schema differences need to be included to release which features, because the change of getting that right consistently are almost non-existent.

And if this is really too much work for you to do well, maybe we can find you a ball to bounce with instead.  Or maybe some UML diagrams and crayons to draw with.  Either way, get the hell out of my code.

Beating a dead horse

Hopefully I’ve convinced you that you should have your own copy of the database for you local development.  I could go on forever giving locally reasons for this.  Or, I could give specific examples, like a recent client that did not follow this pattern, and our team was constantly breaking each other because the code and database was out of sync, even though we were just a small team of 3 experienced and practical developers sitting in a single office right next to each other, working together well and communicating all day, but the lack of database isolation made the issues unavoidable.

So yeah, I could go on and on.  But I won’t, because it’s getting boring.  If you’re agreeing with me by now, feel free to go read something else

I can’t run SQL Server locally, my machine sucks!

I’ve heard a lot of people say that they can’t run SQL Server locally, and sometimes they are right, but I think a lot of the time it is an excuse.

Maybe you don’t have a license to SQL Server.  That’s usually fine, SQL Server Express Edition is free.  Sure, it has some limitations, like the lack of a SQL profiler, but there are great free tools out there like the one from AnjLab.  And if you still need a full-featured copy, the developer edition costs less than $50.  Can you or your company not spare $50 for something like that?  Really?

Or maybe your machine doesn’t have enough memory.  It’s true, SQL will eat up memory like nobody’s business and if you have Visual Studio 2008 and Outlook 2007 running, it can be pretty heavy.  But I’ve found that as long as you have 3 or 4 GB of RAM, it works pretty well, and doesn’t everyone have that these days?  Sure, a lot of you are stuck with crappy old machines that your employer gave you because he considers you to be a high-priced janitor, and he can’t justify in his mind spending a few hundred extra to make help you be more productive, but in that case you have bigger problems than anything we’re going to solve here.  I would say, if possible, you should even shell out a few hundred and get some more memory for your machine, even if it’s a work machine and they won’t reimburse you for it.  I know plenty of people who would be opposed to this just out of principle, but those people and their principles can go have their own little pity party and see who comes; in the meantime I’d rather solve the problem and move on.

Too Beaucoup?

However there is certainly one potential problem that can be difficult to overcome.  What if your existing database is just too damn big to run locally?

One recent client had a production database which was used for a million unrelated purposes, and it was 30GB.  Another recent client had the vast majority of their of their business data spread across two databases that were each about 300 GB.  Sometimes, the database is just too big to copy down to your local machine.  There are a few ways to deal with the problem.

Sometimes the best option is to separate the schema and the data.  Strip down the data, get rid of the 300 GB, and get the minimum amount of sample data necessary to run your applications.  Maybe clear it out entirely, and have some scripts or automated tests that generate a batch of sample data.  Often times this will require a lot of analysis to determine what is necessary, and what all of the data is being used for, but that’s not an entirely bad thing.  If you get a good and portable development database out of it, while also getting a better understanding of how the data is being used, then that has a lot of benefits.  Granted, this is not easy by any stretch, but it may be doable.  It all depends on your situation.

Another option is to setup a single high-powered development database server, and give each developer their own instance of the database on the server.  This approach can have its own problems as well, such as people getting confused about which database instance belongs to who, and having enough disk space to store the inevitable terabytes of data.

So if you have any of these problems, my sympathies to you, and I hope that you can find a workable solution. 

The End

So hopefully you are now inspired to change your process at little bit, or just entertained for a little while. 

Or, if you’ve been trying to do the same thing but aren’t getting the buy-in you need, you have a little more reference material for saying “I read an article about this on the interwebs, and some guy said…”.  It’s one reason I make these things so long, as a favor to you, because the person you are arguing with isn’t actually going to read all of this crap, so maybe they will just give up and let you do whatever you want.  The person that wins the argument usually isn’t the person who is right, it’s usually the person who is willing to waste the most time arguing about it.

Good luck!

So you have a million passwords, right?  Every site out there requires you to enter a username and password to buy a widget that plugs into your doohickey for whatever silly little hobby you have that is supposed to distract you from writing code in your free time so you don’t feel like a complete loser. 

Or better yet, you need a username and password to buy flowers for your wife.  Or you need a username and password if you’re buying a $3.75 Cozy Coupe Wheel With Capnut, plus $7.95 shipping, when all you really need is the damn capnut anyway because the first one got slightly bent during assembly so every now and then the wheel pops off and your horrified toddler is trapped in plastic car wreck.  Or something like that.

And of course, we all know that you shouldn’t be using the same password everywhere.  At the very least, you should be using a different password for everything important and sensitive, like email and banking and online gambling.

Some say Open ID may be the answer.  It certainly is gaining popularity with many sites in the development community.   But the real test will be if it ever catches on with people who have real lives, and really couldn’t care less about your cool new shared authentication mechanism, and don’t really know or care that they shouldn’t reuse their favorite celebrity’s nickname as their password everywhere.

But even then, even if the world were to become thusly enlightened, a large number of the sites our there start using Open ID as their core authentication, there will still be countless little sites out there written by internal IT departments who have never even heard of Open ID and certainly aren’t going to trust some new-fangled “Web 2.0” technology, when they’ve spent the last 10 years working their way up to “Enterprise Architect” of their little fiefdom, and they are certainly smart enough to build a completely secure authentication system from scratch that is going to be so much better than anything anyone has ever seen, thank you very much.

So, yeah, you’re probably still going to be stuck with a million passwords.  Or maybe just half a million, but it’s still the same problem.  If someone dumps a half-ton of manure on your front lawn, are you really relieved that it wasn’t a full ton?

Password Safe to the rescue

I’ve been using Password Safe for years, and I definitely recommend it.  It’s very easy to add new entries, to quickly generate secure passwords, and to attach additional notes (like the answers to the stupid security questions that don’t have clear and definitive answers). 

Of course, it doesn’t have to be Password Safe, there are plenty of other good and free products out there, but I’m not that familiar with them so I’m going to assume that they make your computer burst into flames.

Another benefit of Password Safe, besides the lack of flames, is that the database is very portable, so you can easily copy it to another computer.  However, what about keeping the databases synchronized across multiple computers you ask?

DropBox to the rescue of Password Safe’s … rescue, or something

Since the Password Safe database is just a file, it’s actually pretty easy to keep them synchronized across a few separate machines.  As a pathological-job-hopper-turned-consultant, I’ve usually had some new machine for some reason every six months or so, and I end up with a LOT of copies of my password database floating around.  But after a few years of headaches and manually copying/merging password databases, services like DropBox came along and solved the problem for me. 

Since DropBox treats a directly on your machine as a share and automatically syncs that directory across all of your machines through the DropBox cloud (+1 Google buzzword, yay), then all you have to do is to keep your working copy of the password database on your DropBox share, and voila, you always have your up-to-date passwords at your fingertips.

Well almost.  Of course, there is one gotcha.  When you have Password Safe open in read/write mode, it locks the file (more specifically, it locks the .plk file).  This will actually prevent block the DropBox sync process and prevent it from synchronizing not just the database file, but also anything else on the share.  If you’re like me, you very rarely make changes to your password list, so I just go into Password Safe and select the option to always open database files as read-only by default, and everyone is happy.

Good luck.

 

I was having this conversation with coworker last week, and it seemed like the 100th time I’ve said this.  Then it popped up on StackOverflow this morning, and that got me typing, so here’s yet another salty developer’s rant on the topic.

There is no question that NULLs should be avoided when possible, because of the problems they can introduce several problems with indexing and query syntax. 

Due to the problems that NULLs can introduce, there has been a push to stop using them.  However, like most such movements, it has gotten out of control, to the point that some people fanatically insist that NULLs should never be used in the database.  I was in this camp for a lot of years, before I found it to be a little overzealous.

The answer is somewhere in between.  NULLs should be avoided whenever possible, but there are indeed valid business reasons for storing them. 

Often times you need to store an optional numeric value, and many people will tell you to just store a zero to indicate "no value", but this is an even worse antipattern of storing a magical value that really means something else.  What if then you can’t use zero for one field, because zero is considered a meaningful value as well, or because this value is being used as the multiple of a divisor, so you need to use some other magical value (-1?) for that field in just this case, and now you have a worse problem, because all of your optional numeric fields are now behaving differently.  Oye.

Dates are an even more compelling candidate for nullable fields.  The standard "no value" that people use in .NET is the default unassigned DateTime value, which is DateTime.MinValue, or more specifically January 1, 0001.  However, you cannot write this magic value into SQL database because the default minimum value for a SQL Server DATETIME field is January 1, 1973.  You now have to have something that checks that you are translating those values properly as they are written to and read from the database, and you have to have defensive coding all over the place that checks for whether your date fields are less than SqlDateTime.MinValue, instead of just checking whether they are equal to DateTime.MinValue.  Double Oye.

My preference is to deal with the values as they truly are, and not to build a lot of artificial constructs to hide the true meaning and usage of the field.  If a field may very well not have a value, make it nullable, and make it nullable in your application objects as well (if you language supports such a thing).  Then, anytime you are using that field, you are required to consider what should be done in the case where it is NULL, but that is actually a good thing.  Generally I am opposed to making developers waste brain cycles on unnecessary code complexity, but that is because it steals focus away from the true business problem being solved; however, in this case, the lack of a value IS part of the true business problem and must be thought through.  If you are just defaulting these values, then the developer writing a formula or algorithm will be less likely to think through those edge conditions where the values are missing, and may not even realize at the time that it is a possibility that those values are missing.

 

At the very least, it’s probably not nearly as good as it should be.  As a pathological job-hopper who sees one company after another wasting hidden hours and days on insufficient database change control strategies, I’m sick of it.

So how does my company handle database changes?  I’ve asked this of potential employers many times, and I usually just get blank stares, or a vague answer along the line of “uh… with scripts?”  Sometimes, if they are honest, they just reply “poorly.”  There’s not even an explicit Joel Test for it.  Considering how bad the situation is at most companies, the real test is not whether their database change management is any good, but just whether they are willing to recognize how problematic it really is.

Given how much thought and effort goes into source code control and change management at many of these same companies, it is confusing and a little unsettling that so much less progress has been made on the database change management front.  Many developers can give you a 15 minute explanation of their source code strategy, why they are doing certain things and referencing books and blog posts to support their approach, but when it comes to database changes it is usually just an ad-hoc system that has evolved over time and everyone is a little bit ashamed of it. 

Hopefully we can get to the bottom of why this is and come up with a few ways to make it better.  It’s not like the database is important, is it?

 
I like you, but not in that way

So why don’t developers think about databases the same way that they think about source code? 

The most obvious reason is that database changes are fundamentally different in many ways.  While source code is usually just a collection of files that recompiled, versioned, and released at any given time, databases are much more temperamental.  They have existing data, and they have history.  Sure, source code has history, which you can review for reference purposes, but in databases the historical lineage of a table is actually very important.  In C#, if you added a field to a class, and then a week later someone else changed the type, and then a month later someone else changed the name, it usually doesn’t really matter too much when or in what order those changes happened, all that matters is the current state of the code and that everything builds together and works in its latest state.  However, if you were to do the same thing to a new field in a table, it definitely makes a difference, because there are data implications at every step of the way.  This alone scares a lot of developers away from maintaining the database.

To many developers, there is something fundamentally uncontrollable about databases.  They don’t fit into the safe model that we’re use to.  Managing those changes is definitely introduces new challenge, and many developers just don’t want to be bothered.

 

Doing Business As

Another major reason for the difference is just cultural.  Developers want the database to work well, and they may even like tinkering around with some stored procedures from time to time, but at the end of the day they like to be able to wash their hands of anything that slightly resembles being a “DBA”.

When you compare the DBAs (database administrators) and SCMs (stupid code monkeys), often times there is just a different way of looking at the world.

Developers often see themselves as the general of a utilitarian army of code that they build and train and then order off to carry out their mission, and they usually won’t hesitate to trash it all and replace it if it proves itself unable to complete the mission.  DBAs on the other hand are used to dealing with gentle lion that could easily lose it temper and kill everyone in the room if it’s not treated with kindness and respect.  Developers often have the option to wipe the slate clean and start over, and usually want to, especially when they are dealing with someone else’s code.  DBAs however are stuck with the original version and they need to keep it alive and make it work until the point that the data can be migrated to a better system, and we all secretly know that is never going to happen.

 

Code source control is an over-ripe banana

As developers, we are awash with tools and techniques for managing our source code changes.  We have a seemingly endless number of source control platforms to choose from, each with more features than we could every possibly need or even understand, and some people build entire careers around defining complicated branching and merging strategies for massive codebases.  I figured for a while that the ideas of practical source control where pretty well nailed down, but then you read Eric’s Sink’s posts on Distributed Version Control  Systems, and get an idea of how quickly the field is continuing to evolve.

At some internal IT department, a group of developers is toiling away on a set of dull yet fabulously enterprise-y software that five people will use.  Jimmy checks his code changes into his company’s version control system of choice, where it is automatically held at the gates until it is code-reviewed by a senior developer, and then it is checked into the team’s current release branch.  Meanwhile the continuous integration build server will download, compile, package, and unit test the code to make sure that Jimmy hasn’t broken anything, and that nobody else’s changes broke Jimmy.  The package code is then migrated through an integration environment, QA environment, UAT environment, and staging environment on its way to production.  All the while, as the changes are validated from one environment to another, version numbers are automatically assigned to allow anyone to trace back the exact revision for a given build, and the corresponding code changes slowly work their way towards the code-release promised land, the Main Trunk.  Those branches can get pretty damned complicated, even when everything is going smoothly, and it never goes smoothly.

Code branchs can get  

Hopefully this is a process that Jimmy’s company evolved out of mixture of necessity and forethought over the years.  The other, less attractive, and far more common scenario is that the company hired an astronaut trapped in a developer’s body, bored with his work and not bound by any sense of urgency, who assured the company that he was going to implement a lot of big complicated processes because that’s just want professional companies do. 

In the end, a whole lot of people are paying a whole lot of attention to managing the source code.  Hopefully at your company, you are paying attention to this to.

 

Databases, the redheaded stepchild of source control

Now ask yourself, how does Jimmy handle database changes? 

For example, say that all database changes need to be checked into an isolated directory in source control; after all we’re not savages.  However, since they don’t really “build” a database, that a directory which is ignored by their continuous integration server.  This in turn breaks the unit tests that are pointing to the integration database server, so Jimmy then needs to run those scripts manually in the integration environment. 

In this process, Jimmy sees other database scripts that were checked in recently in the same source control directory, but he has no way to know which scripts have already been applied to the integration server.  For the briefest moment, Jimmy considers applying those scripts as well, just to make sure that the integration server is fully up-to-date, but then he realizes that he can’t be sure which scripts have already been run without manually comparing the schema and scripts to see which have been applied, and this would make Jimmy the defacto owner for any issue that arise because of it.  With his own tasks and deadlines to worry about, Jimmy doesn’t have the time or patience for this silliness, so he just deploys his scripts, forgets about the others, and hopes for the best. 

It’s worth noting here that this is the kind of things that can silently kill software quality.  A motivated developer just tried to make things a little better, but the process was so frustratingly inadequate that it was impractical for him to do so.  Software companies depend on their developers taking the initiative to improve things, and when they are discouraged from doing so, either by person or by process, the company will slowly slide backwards into mediocrity, and it will drag every developer’s morale with them.

 

Now once Jimmy makes the changes to the integration server database, that also breaks some other developers that have been using that server for their development.  Those developers now need to stop and download the latest code changes to get back in sync, cursing Jimmy’s name the whole way.

Anyhow, during the next deployment to QA, someone needs to remember that these specific changes need to get deployed.  Since there is no defined strategy for tying database changes to code changes, every time code is deployed there is a little bit of confusion around exactly which database changes need to be released, which were already released, and what order the scripts need to be run in.  Jimmy is getting upset.

Another darker possibility is that instead Jimmy needs to submit his changes to the database review board, a collection of detached idealists, college professors without the college, who will criticize every aspect of the scripts in order to justify their existence, but will not really offer any true value because they don’t understand the business problem that needs to be solved, nor do they appreciate the application considerations beyond the database that need to be satisfied.

One of the long term impacts of this is that Jimmy will look for any possible way to accomplish what he is trying to do without making database changes, because, in his colorful youthful vernacular, “making changes to the database is a sodding bitch.”  And if he does indeed need to change the database, he’ll try to accomplish it just by changing stored procedures, because changing table schemas are even worse.  In the end, he’s definitely not trying to find the appropriate solution to the problem; instead he backed into a situation of being forced to find a “good-enough” solution that will minimize his hassle, regardless of the downstream impacts. 

From now on, he’ll look for any way he can accomplish it by only changing stored procedures and not changing the underlying schema.  If he’s lucky (sort of), he’ll find a way that he can just kludge the stored procedures to work around the problem for now, and let it be someone else’s problem to solve later.  He has long since given up trying to find the “right” solution, because it is so exhausting the current state of things is so busted up that it’s not even worth trying anymore.

Further complicating the issue, some developers and DBAs make one-off changes in the QA and production environments without going through source control.  Either they need to make an emergency fix in one of the environments and forget to go back and add the scripts to source control, or worse they just don’t believe that databases should be under source control (I’ve seen this attitude far too often from some DBAs, because they can’t stand the idea of being forced to use a source control system managed by some other developers, just so that they can make changes to their own database).  Pretty soon, every environment is a perverted branch of the one true database, and trying to identify why a test works in one environment and fails in another quickly becomes a nightmare.

 

Some day, things will be different!

So what’s the solution?  Well, in my humble opinion, we need something like this:

  • The database must be versioned, so that it is easy to tell which changes have applied and when they were applied.
  • All database changes must be checked into source control.
  • All database changes can tied to the code changes that they affect, ideally checked into source control as a part of the same changeset transaction.
  • The database changes are built along with the code changes.
  • The database changes are deployed along with the code changes.
  • The continuous integration server must be able to build and update its own copy of the database, so that it can run automated tests of code and scripts that are checked in at the same time.

I’ve seen some companies that have had home-grown utilities that come close to accomplishing this, but in the end they all fell just a little bit short, which is not too bad.  However, the vast majority of companies I’ve seen were not even in the ball park.

Some of you are probably asking, “doesn’t Visual Studio Team System” do this?  Yeah, I think so.  Probably, but who knows.  Honestly I tried working with it a few times, and it caused me nothing but problems.  Sure, I could spend a lot of time mastering all the quirks, but I’m looking for something a little bit more accessible here.  The underlying concepts are hard enough; we need an approach that simplifies it, and I just don’t think that VSTS accomplishes that.  More importantly, and also along the lines of accessibility, VSTS costs a fortune, and so most developers will never have access to it, so I’d like to fine something that can work for the other 95% of developers out there that are stuck using reasonably-priced tools.

What about Red Gate’s SQL Compare and SQL Data Compare products?  Absolutely, they tools are indeed awesome products that can prove invaluable in many cases.  However, they are often leveraged once an unknown number of database changes have already been made by an unknown number of people.  However, I think the better solution, whenever possible, is to step back and track the changes as they come in, proactively considering the changes, just like we do for regular source code, which allows for far more robust change management and accountability for the database.

So that’s the idea.  In the next few weeks I’ll have some more posts on this topic, getting into some more specifics about how to solve this.  Hopefully we can find a simple, easy to manage solution, that developers can adapt to quickly, and which promotes database changes to the first-class citizen of the change management world that it deserves to be. 

If you’re following along, I’ve recently named this effort “Parallax” for long and windy reasons.  You can read all about it here.

I’ve copied the project up to CodePlex, at http://parallax.codeplex.com.  Feel free take a look.

There’s not much there yet, just a few database tables, but it can get us started talking about the approach we’re taking to the database.

 

Here’s what we have so far:

ParallaxDB_1

 

There are a few things going on here:

Singular Table Names

I’ve used singular table names for all tables.  This is a long and well-worn debate, whether to use singular or pluralized table names.

For years, I was a strong proponent of the pluralized table names.  At the time, it was probably mostly because that’s what I had used before, and like most people I tend to come up with rational arguments to support my subjective preferences, because cognitive dissonance is yucky.

Plural Table Names Are Good

One of the main reasons for plural table names, at least for me, is that from one perspective it matches what the table actually represents.  It is a container that has a list of work items, or a list of users, or a list of whatever.  If we’re going to store a list of work items in a container, by golly we should call that thing “WorkItems”.  It’s the same logic that is applied to object models; if your Person object contains a list of addresses, you’re going to call the property something like “Addresses” or “AddressList”, not “Address”.  The idea of describing a whole list of customers as just “Customer” just rubbed me the wrong way.

Singular Table Names Are More Gooder

However, there are also several benefits to using singular table names.  First, it avoids a lot of confusion about the proper way to pluralize a name. 

Should the table that contains a list of person records be called “People” or “Persons”?  “People” is a more common name, but “Persons” has the benefit of simplicity, because it is just appending an “s” to the end.  I’ve heard this argument go back and forth, and it doesn’t matter who’s right or wrong because the whole fact that you’re discussing it just a damn waste of time.  You are introducing artificial and unnecessary complexity to your project, which you and every other developer will have to worry about for the life of the project.  Hidden productive costs like this add up significantly over time, and they suck the life out of your developers, of often times without them even realizing why they hate their job.

 

Also, assuming you follow some sort of standardized naming convention for primary and foreign keys, this can improve the readability and writability of your code, because you will not be mixing singular and plural names.  To me, this:


SELECT * FROM WorkItem
INNER JOIN WorkItemType
ON WorkItem.WorkItemTypeID = WorkItemType.ID

 

Flows a little better this:


SELECT * FROM WorkItems
INNER JOIN WorkItemTypes
ON WorkItems.WorkItemTypeID = WorkItemTypes.ID

 

Lastly, if you ever want to have any scripts for performing automated maintenance or generate code, it’s a lot easier to only have to deal with a single tense to deal with and not have to worry about pluralized names.  We’re going to be doing at least a little of that in Parallax.

Keep It Consistent

So like in most things, there are valid reasons for either approach, and in the end it probably doesn’t matter too much which you choose.  However, what does matter is that you do choose and stick to it.  If you start switching between plural and singular names, it is one of the first steps towards chaos.  Your developers will have to continuously waste a unacceptable of time to stop and think “was that table called Person, or Persons, or People, or Peoples, or Peeps, or…?”

 

Primary and Foreign Keys

When I create a table, the first thing I do is created a clustered integer primary key named “ID”.  99% of the time, I do it every time.

Naming Conventions

Could I include the table name in the primary key column, e.g. “WorkItemID” instead of “ID”?  Sure, but I chose not to, again for subjective reasons.  I feel that the primary key is a really important field on the table, and should jump out in a query as the single most important and identifying field in the table.  When you look at a query and are trying to visual exactly what it is doing, I feel that it makes it a just a little bit clearer to see the primary keys first and work out from there.  Of course, one shortcoming is that it can get a little unclear when you start using vague subquery aliases, like “subq.ID”, but I think that is just another good reason to not use vague aliases in the first place.

For foreign keys, I just add “ID” to the referenced table name, such as “WorkItemID”. This way, when you’re writing a query, you never really have to stop and think what the field name is, or worse yet, go look it up in schema.  Now there are cases that this doesn’t work, like when you have multiple independent foreign keys in table referencing the same table.  For example, the WorkItem table references the AlmUser table for two different fields, the reporting user and the assigned to user, and we’ll be adding several more shortly.  In this case, we can’t call them both “AlmUserID”, so we’ll add a prefix to them to identify their purpose, “ReportingAlmUserID” and “AssignedToAlmUserID”.  This can also be used in cases where the reason for the key is not immediately evident, but this starts down a slippery slope of naming things certain ways just because you feel like it on a certain day, which should be avoided as much as possible.

Clustered Index

What does the clustered part mean?  A table can have a single clustered index, and that means that every row in the table will actually be physically stored in that order, which means that retrieving records by that index faster than a standard index.  The tradeoff is that inserting new records into the middle of the index can be slow because it affects the physical layout of the table.  However, this works especially well for auto-incrementing integer fields, because the new records always go at the end of the table and the end of the index. 

Natural/Intelligent Keys

This extra, auto-generated ID column approach is called a surrogate key, which means that it is not derived from any application data.  So why reserve a whole extra field for this?  Another approach is to use a “natural” or “intelligent” key, which is something a naturally occurring piece of data that is sure to be unique across all the rows.  One of the more common example you see is using Social Security Number as the key for an Employee table. 

There are several reasons not to use natural keys, but the first and most important to me, is: Get Your Damn Business Rules Out Of My Schema.  I’m designing a database here, and my primary concern is that the data can be reliably and efficiently stored and retrieved.  You may say today that every employee has a Social Security Number and that it’s guaranteed to be unique, and then in 6 months we need to change the application because we need to enter employee records before we have their SSN, or somehow a dead employee’s SSN got reissued to a new employee, or a new government regulation restricts from storing SSNs in the database, or about a million other things that you won’t think of because you’re really not nearly as smart as you think you are.  If you’ve built your entire system around the assumption that a piece of data will be unique, you’ve really painted yourself into a corner when that assumption is inevitably proven to be wrong.  Wouldn’t you rather just start with a single ID field that you can always rely on to be unique?  It’s only purpose in life is to make sure that your data can be reliably identified, and that makes me feel warm and fuzzy.

And for a very real case of this, just this past fall I had to change SportsCommander to allow users to change their username.  Why would you ever allow users to do this?  Because they needed to.  The original implementation auto-generated cryptic usernames, and this was only done to match a legacy system it had replaced, but the users could never remember their username.  So despite how much it drove me nuts to change this after the fact, we accepted that was critically important to our user’s satisfaction with the site to allow them to select a new username.  In the end, the technology gets less important, and you either have fantastically happy users or you’re selling flowers by the highway.  If would have been easy to say up front that “users can never change their username, why would we ever allow that?”, and if we had used the username as a primary key, we’d really be screwed.  Luckily, we already knew that we were simpleminded fools who could not predict the future, and did our best to protected ourselves against that. 

Of course, in the original version of SportsCommander, we could have also used the USA Volleyball ID number to identify users, because every user in the system is a USAV member with their own number.  However, it turned out that some users had their USAV number change from one season to the next.  Also, we subsequently expanded SportsCommander to support organizations beyond USA Volleyball, so the identifier further ceased to be the grandly unique identifier it originally seemed to be.

 

Also, it’s slow.  Sure, if it’s indexed, it can be really fast, but a lookup with a string field will never be as fast as an integer field.  It also undermines any gain achieved by using clustered indexes, because your data is no longer sequential.  the difference may be negligible a lot of the time, and I definitely don’t want to start doing premature optimization or complicating the code to squeeze our a tiny bit of performance.  However, this is a small thing that can scale significantly depending on the amount of data and tables you have, because you are are doing constantly doing these lookups on against almost every row you touch. 

Of course, natural keys aren’t all bad.  One positive benefit of natural keys is that they can make bulk importing or synchronizing of data easier.  For example, imagine that you are importing a list of employee records and also a list of their timesheet hours.  You’ll make two passes through the data, once for the employee records themselves, and then a second pass for the timesheet hours.  During the second pass, you’ll need to correlate the hours with the employee records that you’ve already inserted, and it’s a lot easier and quicker if you can use a value you already have (like the SSN), than to have to use that value to lookup another arbitrary value (like an ID field).  However, in the end, I really don’t care about this all that much.  It’s a small subset of the usage of the database, and I feel it pales in comparison to the other factors.  Sure, it might make a DBA’s life a little more difficult for some data import they have to run in the background from time to time, but honestly I really don’t give a damn.  Life is full of compromises, and the total system is almost always better off, in terms of stability and total development time, with a more consistent and reliable surrogate key.

GUID Keys

So why not use Globally Unique Identifiers (GUIDs) for keys instead of numeric IDs?  GUIDs definitely have their place, but it really depends on what you’re trying to do.

GUIDs have a lot of benefits due to the fact that they are always globally unique, while your cute little identity (i.e. auto-incrementing)  ID is only unique within your table.  The identity isn’t even unique throughout the database, because each table maintains its own list (this of course assumes SQL Server, not Oracle, I don’t know how MySQL handles this because I really don’t care). 

Having globally unique keys can be useful if you are doing any replication from one database to another, or if you need to store the data in several different databases for other reasons, because you can assign a GUID ID to a record and be assured that it won’t conflict with ID from another database.  For example, if you had a database that stored a master list of users, and those users are represented in several satellite databases, maybe for several separate web applications at the same company, GUID keys would make sense because the data by its very nature should be unique beyond the database. 

Another benefit is that because they are non-sequential, you don’t have to worry about people guessing adjacent values.  If you are going to include any IDs in a query string, it’s usually important to use values that will not allow the user to guess additional values.  If user is on a page called “EditRecord.aspx?ID=12”, you know for sure that some user is going to try typing in “EditRecord.aspx?ID=13” just to see what would happen.  Ideally in these cases you application will restrict the user from getting access to records that he/she/it/heshe should not have access to, but it’s a good idea to not encourage your users to go looking for these types of vulnerabilities.  If the user sees “EditRecord.aspx?ID=ceb143ce-b4e9-4d59-bedb-bb07c5f0eb41”, good luck trying to guess the another value from that table.

However, GUIDs are slow.  Again, if it’s indexed properly, it should perform sufficiently, but much like string fields GUIDs will never be as fast as integers.

Also, GUIDs make debugging a lot harder.  You can’t yell over the cube wall “hey Barack, can you take a look at record 49bebb82-a96c-4bfc-b222-0b9df372ec2a, it looks like you screwed something up again!”.  Even when you are just running a few queries to track down a problem in the middle of the night, it goes a lot quicker and burns fewer brain cycles if you can reference an easily-remembered number, like 8675309, instead of a 36-character identifier.

So in the end, I usually just use integer keys for everything.  That way I get the best performance, the easily debugging, and it fits my needs 95% of the time.  There still are always cases that you could use GUID identifier, so I just put an extra GUID column on those tables.  That gives us the fast and easy integer key lookups for most queries, and the GUID uniqueness added-on when we need it.  We’ve done that here for the AlmUser table, because there are a lot of cases that you don’t want to have users guessing the user ID number.

 

Reserved Words

Often times when you are creating a schema, you use table or field names that are reserved by SQL.  You may name a table “User” or name a field “Description”.  You may not think anything of it at the time, and SQL will let you do it without complaint, but then when you try to write queries against it you’ll get syntax errors or weird behavior.  A lot of developers know the frustration of typing that last character of the field name and watching the font color turn blue.

Of course, SQL provides a way around this.  By wrapping your identifiers in brackets (like [User].[Description]), you can help SQL avoid this confusion.  Any database layer that automatically generates SQL code should ALWAYS do this, because you never know what you might encounter.

However, typing in the brackets is a pain, and it hampers readability, and people often forget to do it, and it’s one of the last things I want to have to worry about while I’m writing code.

Most other programming platforms, like C+ or C#, prevent you from even using reserved words for your own purposes, because really, who needs the headache of keeping them straight?  They are called “reserved” for a very good reason, and I hate that SQL doesn’t actually “reserve” these words.

Anyhow, just because SQL doesn’t enforce the rule doesn’t mean that we can’t enforce it ourselves.  That being the case, stay away from this list of SQL Server 2008 reserved words.  You can use them as any part of a name, such as “AlmUser” or “ItemDescription”, but just don’t use them by themselves.

 

AlmUser Table

Just about every database I’ve designed that drives a web application includes some sort of a User table.  It may store authentication information, or personal information, or may just be an intersection point for everything else.

In this case we will store some basic identifying information, like FirstName and LastName.  We’ll expand it over time to include additional fields.  There is also a GUID column, in case we need to pass a User ID down to the browser, that is the version we’d like to use so that people can’t guess the other user’s IDs as well.

However, you may notice that it does not include the necessary fields for authentication, like Password.  We’re going to outsource that for now using the ASP.NET Membership Provider interface.  The default ASP.NET Membership Provider just stores the information in some database tables, but you can also have providers to connect to Active Directory, Open ID, or some any number of other third-party authentication systems.  For now, we’ll use the default provider, and we may store the tables in the Parallax database just to keep everything together, but there will be NO connection or references between those membership tables and the rest of the Parallax schema.  At any point in time, we will want to be able to pull that out and replace it with another provider.

Granted, this approach may be overkill, and it is usually not necessary for most applications to keep their membership information completely abstracted.  However, as I’ve mentioned, I’m not pretending that we building a practical application here with realistic design constraints; this is a sandbox where we will explore several technologies, so membership providers is one of those topics I’d like to explore.

 

Audit Columns

Every table ends in 4 fields, indicating who created the record and when, and who last updated the record and when.  Granted, this only displays two dimensions of the history, rather than a full history of who changed what when, but I’ve found it comes in handy when trying to figure out when the data went awry.

You’ll notice that we use the user GUIDs to store the user.  I’ve seen implementations where they used actual foreign keys to the user table, and it caused enormous problems.  The checking of the key constraint can slow the performance of inserts and updates, albeit slightly, but that’s still a big no-no.  Your auditing code should have no detrimental impact on the database.  Also, what happens if for some reason the ID value is wrong?  You really don’t want to reject the change for this reason, because these are just audit fields, not real transactional information.  It’s important that the value be right, but it’s a lot less important than the rest of the record.  Lastly, and the most crippling, is that by the time the database had grown to hundreds of tables, it was virtually impossible to delete a user record.  Anytime we tried, it would run through every key referencing the table (2 per table) to determine if there were any other records referencing the deleted record.  Due to the number of relationships and the volume of data, this would timeout after a few minutes.

So clearly, foreign keys are out.  We could still use the ID from the table without a key, but I see a few issues with that.  First, every other place in the database that stores an ID reference field has a foreign key.  That’s a core rule of the database, and I don’t want to give anyone, especially myself, the opening to start getting creative with whether to enforce foreign keys.  Also, the value is not very intuitive.  Lastly, I don’t want anyone to see that field, think that someone else forgot a key constraint, and go ahead and start adding them.  If you don’t make it clear in the code how your code is supposed to be used, it’s your own fault when it is inevitably misused.

So in the past I’ve learned towards putting the username in there, and it worked great.  It was straightforward, unique, succinct, and human-readable.  However, as we mentioned above when discussing natural keys, even usernames can change.

So we’re trying the user GUID this time around.  It’s not nearly as friendly as the username, but it certainly is unique.  And it avoids the usage confusion that you get with ID fields.  I’m still not crazy about this one, but we’ll see how it goes.

 

Data Types

NVARCHAR vs. VARCHAR

For a long time I always used VARCHAR, because I didn’t have a reason to use NVARCHAR.  Then, I ran into a few apps that needed to have NVARCHARs here and there because we had some international customers, and it was a pain keeping track of which fields were which.  So I just started using NVARCHAR for everything, but in hindsight I think that was a mistake.  Doug Blystone (co-founder of MMDB Solutions and a guy who knows more about SQL databases that I could ever hope to) has finally convinced me to stop using them unless absolutely necessary.  I usually shrug off performance issues for the sake of simplicity until they are a demonstrable problem, but this is one of the cases where it can lead to a hidden and pervasive performance drag that won’t jump out at you, but will slow you down just a little bit every step of the way.  These are the types of performance issues that add a second or two to the page load here and there, which makes you users subconsciously hate your guts, even if they don’t really know why.  It takes up twice as much space, it makes comparisons of string values take longer because there are twice as many bytes to compare, and it’s just not necessary unless you’re dealing with Asian languages (or anything language that may have thousands of characters).  We are years removed from the day that we can afford to hire someone to translate our applications to Japanese, and I’m guessing we’ll never do business in China because I hate Commies.

Simple Types

That being said, we’re using VARCHAR.  We’ll also be using UNIQUEIDENTIFIER, DATETIME, DATE, TIME, BIT, INT, FLOAT, TEXT, and maybe VARBINARY.  And that’s it.  No smallint or real or money or any of that crap.  We are going to use the minimum number of logically distinct data types that meet our needs.  I have no interest in getting into data-type arms war, trying to save a byte here or there to create some sort of idealistically pure representation of the data’s inner beauty.  I, nor any other developer, will waste a minute of time worrying about if we can cast this bigint into an integer or that real into a float without data loss.  We’re going to keep this simple, so help me God, and so you and your sql_variant friends better get the hell off my lawn before I get my shotgun.

I actually have used the new DATE and TIME fields yet, but I’ve long felt the need for them.  A DATETIME field is really two different things, which usually works out OK, but when you need to store just a date or just a time, it really gets clunky.  Then when you start having to worry about time zones, it can get just plain weird.  We have a few cases in SportsCommander now where we have to do a little code gymnastics to store just a date or just a time in a  DATETIME field, and have the inactive part properly ignored without interfering with the active part.  This will be my first foray into using these data types, so we’ll see how well they work.

 

So that’s a start for the database stuff.  It ran a little long for 3 tables, but you need to state your ground rules up front, even if it’s just for your own benefit, lest you start getting lazy and breaking them.  So here we go.

I created a CodePlex project for hosting the ALM system.  I even gave it one of those fancy codenames the kids are always talking about: Parallax ALM

Parallax is a scientific term to describe how the relationships of items in space look different from different angles.  According to Wikipedia:

Parallax is an apparent displacement or difference in the apparent position of an object viewed along two different lines of sight, and is measured by the angle or semi-angle of inclination between those two lines.

 

A simple example is the speedometer in a car.  When you are driving you can read your speedometer accurately, because you are staring straight at the speedometer, the way it was designed to be looked at.  However, when your passenger looks at the speedometer, he/she/it will see something different; because the indicator is slightly raised off the surface, the relationship of the indicator to the surface will be distorted, and so it will appear that you are going slower than you actually are (assuming, of course, that you are driving on the correct site of the road).  I won’t even get into why objects in the mirror are closer than they actually appear, but for now you can just assume that it’s a government conspiracy.

In a car, this is a simple case of right and wrong.  The driver is right, and anyone who disagrees can get the hell out and walk.  However, on a software project, that’s not always the case.  As developers, we usually just assume that we are right and the business/executive/PM/QA/ people are idiots, but if anyone has the authority or persuasiveness to tell everyone else to get out and walk, it sure isn’t a bunch of sociologically-crippled developers.

Really, in a software project, everyone is usually a little bit right and wrong.  Sure, there are often gen-u-ine jackasses out there, but they are usually a lot less common than we think, and we are usually right far less than we want to believe.

So we need to get everyone on the same page.  Or reading the same book.  But everyone speaks a different language, especially from one job description to another, so everyone needs their own version.  We need to present the same information to everyone in a way to makes sense to them as individuals, and maybe even makes their life easier.  Just watch a executive’s eyes glaze over as you explain iterative development to him, but put yourself in his shoes as he’s trying to explain why he needs some sort of metric, anything really, to know what the status of a project is and whether it’s falling behind schedule.  All of that information is really the same stuff, just looking at it from a different angle to fill a different job need.  You cannot reliably count on people to solve these problem themselves, because most people had an enormous aversion to putting themselves in someone else’s position, mostly because it throws a giant spotlight on everything they themselves maybe doing wrong.  So let’s solve the problem with some software.  You’ll still need to have people want to work together, but hopefully it will actually be easy for them this time around.

So there you go.

Posted in ALM.
Technologies

So what technologies are we going to use for this thing?  Hopefully a little bit of everything, as one of my goals is to use this to learn some new things.

But to start, it will look like this:

  • SQL Server 2008
  • IIS/ASP.NET 3.5
  • LINQ-to-SQL

 

Moving forward, I’d like to introduce the following technologies at some point, just for fun:

  • Entity Framework (EF)
  • NHibernate
  • Windows Communication Foundation (WCF)
  • Windows Workflow Foundation (WWF)
  • Silverlight
  • ASP.NET MVC
  • Microsoft Reporting Services
  • IronPython
  • F# (eh, why not?)

 

Some of these may be used to complement what already exists, such as exposing a WCF layer or sprinkling some Silverlight here or there.

Other things may require replacing a whole layer of code, like changing the data layer from LINQ-to-SQL to the Entity Framework.  Hopefully I’ll be able to create side-by-side instances of the system running each type of data layer, which would be interesting for comparing the performance vs. development effort of each.

Some things I just want to learn, like ASP.NET MVC and IronPython, so I’m going to try to find a way to get that into the product somehow.

Obviously, very little of this experimentation would make any business sense in the real world, but that’s part of the fun of it.  I’ll build out the initial data layer in LINQ-to-SQL because that’s something I already know and have experience with, but then replacing it with NHibernate and/or the Entity Framework will be a great learning exercise, and hopefully to teach me a thing or two about the true pros and cons of each, beyond marketing fluff and fanboy crap.  And along the way, I’ll try to post everything I learn on this blog.

Development Tools

As far as tools, I’m using Microsoft Visual Studio 2008 Team Suite, mostly because we already have it through the Microsoft BizSpark program.  However, I realize that it’s pretty impractical for most people to purchase Team Suite, so I’ll try to keep everything limited to the basic tools available in the Standard or Professional editions.  Since one of my goals is to provide something that other people can learn from (whether it be from my successes or failures), I think it’s important that I stick to tools that are readily available to most developers.

As such, I’ll steer clear of things like Visual Studio testing suite, and I’ll make use of more open source tools like NUnit and WatiN instead. 

For databases, I’ll avoid the original Visual Studio database projects (because they are mostly useless), and I’ll avoid the new fancy new Visual Studio SQL Server database projects (because they aren’t readily available to most developers and they are too complicated in my opinion).  However, this is still a sticky problem, as I have yet to see a good and free/cheap tool for managing database development.  Doug Blystone (co-founder of MMDB and resident SQL Server whiz) has been working on a utility that consolidates a lot of the do’s and don’ts concurrent SQL development that we’ve learned over the years, and we plan to release this very soon.  If you find it useful, great; if not, you can at least follow along with everything I’m doing.

 

Anyway, I’m going to get started on the data model first, so stay tuned.

Posted in ALM.