The recent Seagate hard drive debacle combined with the timely discussions on Time Machine restores got me thinking.

When I architect enterprise systems for my clients, I’m always talking about the same things:  performance, capacity, availability and security.  These concepts sound simple enough up front, but when we get down to it, there’s always more to them than meets the eye.

Take for example availability.  What does it mean to have your data “available” to you?  I’m willing to bet at face value, this means having what you consider important always accessible.  If we’re talking Web 2.0, this might mean that services like gmail, flickr, dropbox, or mobileme are always online when you want to access them.  If we’re talking your DNS-323 NAS, that means that the thing is always up and responding on your network.  It also means the ability to handle - and recover from - failure.  Regardless of the scenario, it should also mean that the data that is important to you must be kept safe, and must be recoverable.  After all, it was a catastrophic hard drive failure on my work laptop was one of the driving reasons for me to buy the 323 in the first place.

Having your data available to you under different scenarios means a lot of different things.  Inherently, architecting for availability means overcoming single points of failure, and evaluating the risks and trade-offs for your particular setup.  It also means carefully considering fault tolerance, data backup and recovery procedures.

Hit the jump for more of my rambling… and what that means to you and your NAS.

Let’s start by looking at some risk scenarios.

In the case of the 323, the simple fact of having two drives means I am protected against failure of one drive.  For example, both disks are spinning away, either in a raid-1 array or some sort of replicated scheme whereby they are rsynced hourly or nightly.  What happens if a drive fails?  If they are raided, there’s a good chance that the remaining drive holds out long enough for you to replace the failed drive.  That is the prime benefit of raid-1.

But what happens if you replace the failed drive, and then try to recover a file you saved to the 323 the other day, but accidentally deleted?  It’s gone.  For good.  Raid-1 ensures your physical hard drives are available, but it doesn’t protect you from being an idiot and trashing your data.

It gets a bit worse if you’re like me and chose to stay away from raid-1.  I use a scheme where important folders from drive A get replicated to drive B nightly at 3am.  What happens if I copy over a bunch of important files at 5pm, and then go out for the night — and drive A fails at 10pm?  That means that drive B is functional, but does not have a copy of those important files — they’ve been lost for good.  Had I chosen raid-1, I would have had a copy of them by virtue of raid-1.  In my case, however, it’s come down to a trade off.  Since I want to maximize hard drive and not replicate all my data, I’m sacrificing the real-time copy benefit of raid-1 as the trade off for maximum data.  It’s a risk I choose to live with.

There are two concepts at play here - physical disk availability, and data availability through backup.  To ensure physical disk availability, many of you choose to run your 323 in raid-1.  But have you given thought to the actual disks you’ve put in your NAS?  If you’re like me, you’ve got two Seagate 1Tb drives, the same model number, made around the same time, probably made in the same factory, maybe even made in the same batch.  And if you’re like me, that means both drives are susceptible to that crappy firmware bug.  It also means that both drives are far more likely to fail at the same time.  Talk about putting all your eggs in one basket!  (Or uh, two identical ones, as the case may be).  Having said that, I would actually recommend everyone run two drives of the same capacity, but of different manufacturers.  For example, 1TB Seagate and 1TB Western Digital.  That would rather effectively minimize the risk of both drives failing at the same time (or both being affected by a stupid firmware bug).  In my case, I have to yank both Seagate drives out of the NAS and reflash their firmware to patch them up.  Or just leave them be and hope that they are never corrupted by the firmware bug.  In both cases I lose; I could brick both drives with a firmware update, or both drives could brick themselves if I don’t upgrade the firmware.  The risk of data loss is very real.  If I had been smarter, and ran 1 Seagate and 1 Western Digital, the risk of data loss is much less, because I’d only have to pull one drive (the Seagate) to fix the firmware.

In terms of data availability, the question shifts slightly to “how do I not only make sure my data is available at this instant, but also if my NAS dies?”  Or maybe even, “What if I’m an idiot and realize that I need that important file I accidentally deleted last week?”  Apple tries to answer both questions with Time Machine — by allowing you to go back in time from a different piece of hardware and grab an earlier revision of that file, before your overwrote it or deleted it.  Same with dropbox, and its ability to grab previous revisions of the file.

There’s a third type of availability I usually find myself talking about, and that’s service availability.  Loosely defined, that means that whatever services the solution is providing are expected to be available all the time.  This is of most importance to web 2.0, cloud computing, and web sites in general.  I always expect to be able to get at my gmail, facebook, flickr, whatever.  In terms of my home NAS, however, it means something different.  I expect it to respond when I try to access it; I expect the shares to always be available, and I expect my data to be there.  I expect SSH to be running, and I also expect the iTunes sharing to work.  Consistently.  (And it makes me mad when mt-daapd sucks and it never does).  Service availability in this case is provided by the software running on the NAS.  There are a lot of different things we can do to ensure service availability on the 323, but I can assure you they are out of the scope of this article, and not particularly necessary for a home network.

Now that you are thinking about different risk scenarios, and have an understanding on types of availability, the questions shift to:  “Which configuration is right for me?”  “How much thought and planning should go into my NAS setup?”  The choices you make to ensure your desired level of availability really depends on what your own requirements are for the type of data you’re storing on the NAS.

Is it just a dump for your music and home movies?  Or a time machine/carbon copy cloner backup point for all the computers in your house?  Maybe it’s a shared network drive for all the corporate data in your small business.  Ultimately, it’s up to you to decide.  In any case, give some serious thought to how valuable your data is and how you would feel if its gone.  Architect the level of availability in your own home network accordingly.  And remember this mantra –  “Raid is for disk availability, not data backup!”.  Having RAID-1 alone is nowhere near enough protection to keep your important data safe.  Like that digital video of your kid taking his/her first steps.  Or your entire mp3 library.

The last thing I want you all to think about is recovery of your data in the event that something goes wrong.  This comes back to trying to eliminate single points of failure, and how to recover from it if something were to happen.  The 323 has redundant disks, but what about if a power supply blows?  if the network interface fails?  if lightning strikes and the power supply, network card AND both disks are toast?  hmm… What if fire strikes and destroys your PCs and your NAS?  What if you are robbed and both are stolen?  What if your computer got a virus or otherwise blew up and you needed to restore from a backup?  (Time machine or otherwise?)  How would you recover that important data?

In the first scenario (a dead 323), we might have been saved if the data was backed up nightly to an external disk, or maybe even ANOTHER (secondary) 323 on our home network.

In the second scenario (fire or theft), the only way our data would have been saved is if it was backed up off-site.  Either through rsnyc to a remote server, or to an online backup service.

In the third scenario, the point I’m trying to make is that backups are only as good as your ability to restore from them.  In the comments on my “Poor Man’s Apple Time Capsule” article, people are (wonderfully) questioning the benefits of being able to enable Time Machine backups to a NAS if you can’t restore from it?  And you know what?  They’re 100% right.  But the onus does not fall on me to test it — you are responsible for your data, and only you can gauge the importance of it.

My conclusion is this, and it’s really quite simple: 

Determine how important your data is to you.  Consider the main causes for data loss (see below), and what you bought the NAS box for in the first place.  Figure out how to get your data back if you do something stupid, or if something stupid happens to you.  Figure out if your solution makes sense. If it doesn’t, take a few steps to make sure you are protected.  I have a feeling that’s why you bought a NAS in the first place :)

 Main causes for Data Loss:

  • 44% Hardware Failure: Most hard drive manufacturers have reduced their warranties from 36 to 12 months
  • 30% Human Error: Accidental file overwrites and deletions
  • 12% Software Corruption: Programming errors, improper application terminations
  • 7% PC Virus: Inferior anti-virus software, updated signature files.
  • 7% Theft, fire, flood or other natural disaster
    (source)

Options for data availability, backup and recovery on the D-link DNS-323 NAS:

  • Ok:  Use raid-1 (maybe).  Or some folder replication scheme.
  • Good:  Do the above and also take nightly backups — say, to an external 1TB hdd in a USB enclosure, mounted through the USB port on the back of the 323
  • Better:  Take nightly snapshots of your data using rsnapshot or backupnetclone, or your Time Machine volume (a snapshot is a backup with history) to an external disk, or another 323 on your network with redundant disks.
  • Best: Take nightly snapshots of your data to a remote location.  And test your recovery procedure!

Questions?  Comments?  Fears?  Real-life horror stories?  Discuss in the comments!


5 Comments to “Some thoughts on your data, backups, failed drives, and architecting for availability”  

  1. 1 grits

    One rational approach is a simple classification scheme for your data.
    1) If you don’t care about it, you don’t back it up. Think system files, source files you can d/l, etc.
    2) If you’ll be seriously inconvenienced by its loss, back it up to your 323.
    3) If you’ll have a nervous breakdown if it’s lost, back it up to a remote location. Think newborn pictures of your only child. (If you have more than one child, you know they all look the same as newborns anyway.) The remote location could be Amazon S3, or even a friend’s remote 323. With this approach, you hopefully get a smaller and smaller subset of data as you go up the chain of importance. Now horto needs to write me a script to log into my S3 account and replicate my Tier 3 items to a specified bit bucket.

  2. 2 Kyle

    What about using Rsync to copy changed data to the second HDD. I’m with you on your earlier post about the reliability of raid. Does anyone know if this box used MDADM? If so I wouldn’t be so bad to take the disk out and plug it into a Linux box to get the data off of one drive.
    Anyways as I was adding to your post about syncing to a “replicated” folder I use this script to make backups of my Ubuntu server:

    This makes a base image:

    ————-start————-
    #!/bin/sh
    TIMESTAMP=$(date “%Y-%m-%d.%H-%M-%S”)
    name=/media/backup/backup.${TIMESTAMP}
    echo STARTING……
    date
    rsync -aHx –progress –numeric-ids –delete \
    –include-from=backup.include –exclude-from=backup.excludes \
    –delete-excluded / $name/
    echo
    echo
    echo FINISHED…
    ———–end——————-

    This next script updates the backup and only take up drive space to what needs to be added since it uses hard links:

    ————-start————-
    #!/bin/sh
    TIMESTAMP=$(date “%Y-%m-%d.%H-%M-%S”)
    name=/media/backup/backup.${TIMESTAMP}
    current=/media/backup/backup.current
    old=`readlink /media/backup/backup.current`
    echo STARTING……
    date
    echo
    echo COPING HARDLINKS…
    cp -al $old $name
    echo
    echo STARTING RSYNC BACKUP…
    rsync -vaHx –progress –numeric-ids –delete \
    –include-from=/root/backup.include –exclude-from=/root/backup.excludes \
    –delete-excluded / $name/ >> $name.log
    echo
    echo LINKING THE NEW BACKUP WITH backup.current…
    rm $current
    ln -s $name $current
    echo
    echo FINISHED…
    ———–end——————-

    At the end you can type to see the size of the backup and see that the hard links size increases relative to what files you added to the backup.
    I don’t have one of these boxes yet but I’m really thinking of getting one for me and some family members. I love your posts here and how this little guy is ‘hackable’.

  3. 3 Kyle

    Opps the command for “typing to see the size” was left out:

    # du -shc *

    626M backup.2008-12-04.15-54-38
    2.5M backup.2008-12-04.15-54-38.log
    60M backup.2009-01-20.09-02-13
    48K backup.2009-01-20.09-02-13.log
    204M backup.2009-04-01.10-50-00
    1.4M backup.2009-04-01.10-50-00.log
    0 backup.current
    46G total

  4. 4 Ro

    I’ve been thinking through this as well. I’m pretty happy with my 323, it’s been pretty reliable for the past year of operation. I’ve been wondering if I took a 3rd drive and rotated it through my RAID 1 array weekly if that would satisfy my need for ‘off-site” backup. The idea was that I’d plug it into the 323, it would replicate the data onto the new drive, and I would take the old drive and put it away in a fireproof safe (or equivalent). Yes, I could lose a week’s worth of data, but isn’t that a hugely economical way to up the certainty of recovery in the event of a failure?

  5. 5 admin

    @Ro: Yup, that’s a great idea. Another option might be to mount an external hdd using the USB port on the back, and rsync from your RAID array to that external drive.

Leave a Reply