Drives Die

So we had yet another calamity in the Systems Boy household last week: A hard drive failure in a four year old, 15" PowerBook. Oddly, a workmate had the exact same thing happen to him within days of our catastrophe. In fact, there's been all manner of hardware failure in recent days. I know that drives are prone to dying after a number of years, but geez! It sure seems like lately there's been a steady shit stream aimed squarely at the tech fan. Makes me ponder the more cosmic aspect of this biz.

[Gazes dreamily off into space for a moment. Then abruptly snaps to.]

The trigger for this failure, ironically, was our attempt to make a backup. (Oh, technology gods, thou art a riot!) See, our original goal was to update the OS to Leopard, but with all the craziness going on these days we decided to clone the drive before we proceeded with said update. But in the course of cloning, it would appear in retrospect, we hit a bad block and triggered the first of what would be many, many disk errors. Unable to pull a backup, we began our descent into drive repair hell in our latest heroic attempt to salvage that ever-important thing contained on and lost from drives: the data.

File-Level Attempts

Our first try was with Disk Utility, which consistently reported, in all red text, that it could neither verify nor repair the file system. Right. On to attempt number two.

Disk Warrior is my go-to utility for any sort of file system damage that Disk Utility is unable to repair. I've rarely seen a disk that one of these two apps couldn't fix. Today would be one of those rare days. After mounting the drive on a known good system using Target Disk Mode, we let Disk Warrior perform its initial scan of the drive. What we found was decidedly ugly. Disk Warrior told us that it was unable to replace the borked directory with its shiny new, replacement directory because of a "disk malfunction."

That's when we knew the drive was fried.

Disk Warrior Report: Bad News

When a hard drive has problems, 99% of the time those problems are directory related. That is, the hard drive contains data about the files on disk — where they belong, how many there are, how the disk is partitioned and so on. And usually, when there is a problem with a drive, it is because this information has been corrupted somehow. These days there are numerous utilities that can easily and accurately repair these sorts of problems, Apple's included Disk Utility among them. Sometimes the damage is too extensive, though, so we turn to something a bit more drastic, like Disk Warrior. Disk Warrior forgoes the repair, and instead scans the disk and creates a brand-spankin' new directory, replacing the broken one with its new one once you've made sure everything is cool, and perhaps made a backup. Now, when Disk Warrior is unable to do this it's indicative of a much more serious problem. When this happens it is very likely that the drive hardware is beginning to fail.

Time for a new drive.

What Disk Warrior does in these instances is it shows you the best picture it can muster of the drive's contents in a read-only preview, and then advises you to backup as much as you can before total failure. So that's what we did. You're never sure how much time you have in these situations, so we went through folder by folder trying to locate and backup the most important files first. With each successive copy the drive became slower and slower. Luckily, we were able to pull the most recent, most important files. Most everything else was backed up or able to be easily reconstructed.

Block-Level Attempts

Once we had gotten the most important stuff we decided to see what else we could get. I tried running some rsync commands and got some stuff that way, but not much, and it was taking forever. Once I'd given up trying things at the file level, I decided to make my last ditch effort with a well-worn but powerful little UNIX command called, simply, dd. (No, it does not stand for "Drives Die," though maybe it should.)

The dd command reads data from a disk at the block level and copies it from standard input to standard output which can then be written to a file of your choosing. I use dd by running it on the /dev entry of the drive in question and writing the output to a disk image file (DMG):

sudo dd bs=512 if=/dev/disk3s3 of=/Volumes/Work/LastDitch-DD-01.dmg conv=noerror,sync

The good thing about dd is that you can instruct it to skip damaged sections of the disk. That's what the "noerror" option is for. The downside to dd is that it wants to read the entire disk, and that makes it very slow. In this instance I was not able to rescue any data, mainly because, as I soon discovered from my dd runs, the disk was just too far gone. I did learn some interesting strategies for using dd to recover data though.

The first thing you can try if dd is running slowly is to increase the block size. This is how much data dd will consider before moving to the next read. The default is 512 bytes. I've read upping that to 51200 will sometimes yield speedier results:

sudo dd bs=51200 if=/dev/disk3s3 of=/Volumes/Work/LastDitch-DD-02.dmg conv=noerror,sync

In my case it did not, primarily, I believe, because there was a problem in the beginning of the drive, and dd was having trouble moving past that spot. So another thing you can tell dd to do is to skip a certain portion of the drive, say the first 2 GBs:

sudo dd bs=51200 if=/dev/disk3s3 skip=2000000 of=/Volumes/Work/LastDitch-DD-03.dmg conv=noerror,sync

Finally, you can also tell dd to only write in 1 GB chunks, using the count option:

sudo dd bs=51200 if=/dev/disk3s3 count=1000000 skip=2000000 of=/Volumes/Work/LastDitch-DD-03.dmg conv=noerror,sync

I was getting some good results after having skipped the first 2 GBs — apparently they were really damaged — so I decided to write a script that would skip the first 2 GBs and then begin writing out 1 GB chunks of data. It would've looked something like this:

sudo dd bs=51200 if=/dev/disk3s3 count=1000000 skip=2000000 of=/Volumes/Work/LastDitch-DD-Chunck-01.dmg conv=noerror,sync

sudo dd bs=51200 if=/dev/disk3s3 count=1000000 skip=3000000 of=/Volumes/Work/LastDitch-DD-Chunck-02.dmg conv=noerror,sync

sudo dd bs=51200 if=/dev/disk3s3 count=1000000 skip=4000000 of=/Volumes/Work/LastDitch-DD-Chunck-03.dmg conv=noerror,sync

...

Etc, etc, up to the 40 GBs needed to scour the drive. I never got to write the script, though, because the last dd command seized up and the drive began making the clicking, knocking and whirring sounds of its agonized and tortured death. It was quickly dying. We could do no more.

At this point, mainly for my own edification, I decided to see what could be done outside the confines of my home office. I decided to get a quote from Drive Savers.

Hardware-Level Attempt

Drive Savers, perhaps wisely, does not list prices for their services on their website. To get an estimate you have to give them a call. When I called them I was greeted by a very friendly and helpful service person — yes, person — which was really nice. The last thing you want to deal with when you're having a mechanical failure is a machine. The person on the other end of the line asked me a few basic questions to gauge what state the drive was currently in, things like what attempts I had made to rescue the data, would the drive mount, and the like. After entering this info into her systems she directed me the "Tips, Techniques and Solutions" page on their website (very useful — love the drive sound audio samples), stressing above all that in order to have the best chance of recovery at this point the drive should not be powered on again. She also offered up some information about the company and what they do: For one, they started with Mac data recovery and are an all-Mac shop, which surprised me a little. She also pointed me to information on the Drive Savers clean room, a vital part of data recovery at the hardware level. She then took my email and contact info and gave me both a written and verbal estimate of how much I could expect to spend should I decide to go ahead and have Drive Savers attempt to save my data (I don't think they'll actually save the drive). All in all it was a very pleasant and informative experience. Normally I am loathe to use the phone for business, but Drive Savers really seems to know what they're doing, at least when it comes to pre-sales customer service, and that counts for a lot in my book.

This is, of course, all prep for the fact that, if you do want to make the attempt at data recovery, you'll be expected to drop a significant amount of money. This is hardly surprising. Those clean rooms don't look particularly cheap to build or maintain. And if data recovery at the hardware level is anything like it is at the software level, it is a laborious and time consuming process. I was given a range of prices ($500-$2700 dollars) and told that the cheapest I could expect to get away with — the economy plan, which isn't as fast as some of the other, more expensive plans — was $500 dollars. But it was likely I'd pay somewhere closer to the upper third of the range, more like $1500 to $2000 dollars. It all depended, of course, on how much data Drive Savers could recover.

I didn't really find these prices particularly surprising. I'd long heard how much such a recovery could cost. That it would be pricey. I was glad that I was not in a situation that required me to fork out this amount of money. I'm glad such a service exists for the odd catastrophe, though I hope never to have to use it. Drive Savers' website offers advice on keeping backups:

"Backup strategies:

* Invest in redundant backup systems

* Establish a structured backup procedure to make copies of all critical data files, using software compatible with the operating system and applications

* Periodically test the backups to verify that data, especially databases and other critical files, are being backed up properly

* Keep at least one verified copy of critical data offsite"

Sage advice, all. Take it from those who know all too well.

The Belly of the Beast

Once we'd decided not to use a hardware data recovery service the only thing left to do was spec out, buy and install a new hard drive. This wasn't terribly difficult, but as is so often the case, there was the odd snag or two.

Before we even bought a drive, I wanted to see how hard it would be to open the PowerBook for servicing. If it was going to be a bear — and some PowerBooks are certainly easier to crack than others — I'd let the fine technicians at Tekserve do the job. So I went in search of manuals and instructions for this particular model of PowerBook. Without too much trouble I was able to locate, at Apple's site, the manual for our 1.67 MHz, 15" Aluminum PowerBook. It contained no instructions for hard drive replacement, which is generally a sign that Apple would rather you not attempt the repair yourself. That got me a little worried.

Finally, however, I found instructions — great instructions, no less — at the venerable — awesome, actually iFixit.com. iFixit, for those of you who don't know, provides step-by-step, illustrated guides on taking apart and performing repairs on Apple hardware. For free. They're amazing. I feel guilty not buying anything from their site. Oh yeah, they also sell parts, tools and service as well. I love them. And from what I could see, the repair would be tedious — lots of screws — and would require a trip to the hardware store — blasted tiny hex screws! — but it would be doable. Still, taking things one step at a time, I thought I'd perform the teardown before buying the drive. Just in case.

And perform I did. Using iFixit's excellent guide, I was able to crack the PowerBook in short order. I was ready to buy a drive.

Buying a Drive

There are two things SysAdmins typically are, particularly when it comes to technology: cheap and lazy. Hunting for a replacement drive brought both of these qualities in my personality to bear. I was looking for the cheapest replacement I could find, at the location closest to my house, a SysAdmin's dream hunt. The closest proper computer tech shop to me is Tekserve, with Best Buy a close second. Tekserve doesn't list what bare drives they carry, if any. But Best Buy seems to have the goods. But Best Buy is still a good half hour train ride, so I did some physical recon at my nearest Radio Shack, which happens to be right around the corner. They informed me that, though they did not have any bare drives in stock, they did have portable USB drive on sale. Drives from which I could pull and the internal component and install it in the now drive-less PowerBook. In fact, they had a 160 GB Iomega Prestige for less than a bare drive would have run me at Best Buy — a mere $75 clams post-sales-tax. Not bad. I took it.

I'd like to pause here and see if anyone can guess why this didn't work out for me. You have pretty much all the data you need in this article to figure it out. But don't feel bad if you can't. The good lord knows I surely didn't. I'll wait a minute... Pretend there's Jeopardy countdown music playing... Aaand...

Okay. Did you guess it?

I got the drive home, popped it out of its case and went to put it in the open PowerBook. But it didn't fit. (Have you guessed it yet?) Here's the thing: PowerBooks use 2.5" ATA drives (Parallel ATA, or PATA), but drives in today's externals are all now SATA (Serial ATA) drives. Blast!

Oh well. At least it was cheap.

Another quick look at the web revealed that all the bare drives at Best Buy were SATA as well. Blast again!

The nearest ATA drive I could find was at J&R, which is all the way downtown, almost at the very tippy-tip of Manhattan — far. So that's where we went.

Once we got back, we installed the drive and — the very first thing to go right all day — it worked. Perfectly. Things were finally looking up.

Once we had installed the drive it was simply a matter of formatting it, installing the latest version of Leopard (which is all we ever wanted to do in the first place) and copying over the rescued and reconstructed data. Oh, did I mention that the reason the client wanted Leopard was for Time Machine? Yup. Backups. Great timing. So we set up Time Machine as well. All that went exceedingly smoothly and our repair is, at last, complete. Whew! What an ordeal!

But, man, did I ever learn a lot.

The Life and Death of Hard Drives

So yes, drives die. How they die, though, is almost as important as how they lived, and certainly as interesting. It's somewhat comforting to know that this drive, while quite dead indeed, did not die in vain. Rarely have I had the opportunity to learn so much about practical drive recovery. I have that PowerBook drive — specifically its death, in fact — to thank for my lesson.

More Data Recovery

It's been a bad couple of weeks for data loss in the Systems Boy household. Fortunately, it's been a fairly good week for data recovery, so we've mostly broken even, minus the time lost recovering data, of course.

Most recently, something seems to have taken a large (by which I mean everything) bite out of a very important CSS file. See, we tend to use Coda to build sites at our house, and we tend to work over the network as the most expedient means to that end. Now, working on a website over the network is not without its perils, as I'm sure you're aware. Particularly if you're working wirelessly, and particularly if you're working on a server of unknown reliability. So, a very awesome someone I know (okay, yes! I have a girlfriend!) was doing exactly that when all of a sudden her CSS file appeared to be completely empty. Mind you, she was not working on the file. She merely had it open while she worked on another document in another tab. But after switching to the CSS tab, the CSS file — which she'd been working on obsessively for about a week — appeared to be empty.

Now I've had the same thing happen to me after a network dropout — or, more likely, a server disconnect — and the solution in my case was to simply shut down and restart Coda. Mine was largely a cosmetic issue brought on, I assume, by Coda's inability to reconnect to the documents after a disconnect. So I told her to simply restart Coda, confident that the problem would correct itself. But it didn't. Even after restarting Coda, still no CSS joy. The file was there, but it was completely empty!

This is the point at which panic generally sets in. (And no, that is not a reference to the makers of Coda.)

Panic

If there's anything I've learned in my near-ten years of professional systems work, it's that data is rarely ever completely wiped out in a single stroke. And if there's anything else I've learned, it's not to panic. So I coolly, calmly set about the task of recovering the file while my exhausted and infuriated sweetheart went to bed.

The first thing I did was to check the server to see if any backups had been made. I know that her provider, and some of the software she uses, make automatic backups from time to time. So I downloaded anything and everything I could find from the server that might prove useful, including a backup of the entire site for safekeeping. I soon discovered that there was nothing even remotely recent enough to contain the missing CSS file. So I started looking in the local home account, first by grepping for anything with "css" in the name. Some Coda cache files came up, some of which were fairly recent, but none failed to yield the data I was searching for. I searched /tmp as well, to no avail.

Finally, after a couple of hours of downloading and grepping and searching and hoping, I was about ready to give up. As a last ditch effort I decided to use the find command on the entire local hard drive:

find / -name *.css*

This command will search the entire file system for any file whose name contains the string ".css." And it turned out to be the winner. The command yielded a ton of useless results, many of which came from application documentation. But in the end a Coda cache file turned up in:

/private/var/tmp/501

Of all places!

Moreover, this file had a time stamp very near the time of the disappearing data. So I made a copy of it (okay, I made, like, four copies of it) and uploaded it to the server. The next day my sweetie confirmed: I'd found the file! The day was saved!

So remember, people: Stay calm, and always try find before giving up the ghost. And for poops sake, make a backup!

Whew! That was close!

Note To Self: Restart autofs

I just looked all over Hell's half acre for this (okay, I performed a perfunctory Google search) and I couldn't find a definitive answer. Now I know and I just wanted to make a quick note of it for posterity. In the olden days (i.e., a few months ago), in order to get any mounted to shares to re-mount, we would restart automount thusly:

sudo killall -HUP automount

This no longer works. Now we must restart autofs. To restart autofs on Mac, do this:

sudo killall -HUP autofsd

To be additionally thorough, though this should not be necessary, you could also restart automount, which now looks slightly different (note the "d", which is new):

sudo killall -HUP automountd

None of this is surprising, but then again, if you're not sure you're doing it right (like you run the command and nothing happens and you want to be sure you're doing the right thing) it helps to have it written down somewhere.

Enjoy!

Default Shell Hell

There's a common occurrence in the world of systems administration. Once I describe it you'll probably all nod you're heads knowingly and go, "Yeah, that happens to me all the time." It happened to me recently, in fact.

I was attempting to set a Linux system to authenticate via a freshly-built LDAP server — something I've done many, many times — and it just wasn't working. I could authenticate and log in fine via the shell, but no matter what I tried, whenever I would attempt to log in to Gnome, I'd get an error message saying that my session was ended after less than 10 seconds, that maybe my home account was wonky or I was out of disk space, and that I could read some error messages about the problem in a log called .xsession-errors in my home account.

Of course, certain that my home account was fine and that I had plenty of disk space, the first thing I checked was the .xsession-errors log, which yielded little useful information, and which information led me on a complete and utter wild goose chase. From everything I could glean from this rather sparse log, there seemed to be a problem with Gnome or X11 not recognizing the user. I showed the error to some UNIX-savvy co-workers, one of whom demonstrated that, when booting into run-level 3, logging in and then starting X, login worked fine, thus proving my hypothesis. So began several days of research into Linux run-levels, Gnome, X11, PAM, NSS Switch and LDAP authentication on Linux. All of which was exceptionally informative, but which, of course, failed to yield a positive result.

The final, desperate measure was to scour every forum I could, and try every possible fix therein. And, lo and behold, there, at the bottom of some obscure post on some unknown Linux forum (okay, maybe not that unknown), was my answer: set the default shell. Could it be so simple?

But wait, wasn't the default shell set on my server already?

I checked my server, and sure enough, because of a typo in my Record Descriptor header, the default shell had not been set for my users. Seems X11/Gnome needs this to be explicitly specified in an LDAP environment, because in said environment it is (for some reason that remains beyond me) unable to read the system default.

Setting the default shell for users on my LDAP server (yes, it is a Mac OS X Server) did the trick, and I can now log in normally to Linux over LDAP.

So, after days of researching a problem the solution all boiled down to one, dumb, overlooked setting on my server, a fact I found referenced only at the bottom of some strange and obscure internet forum. Sound familiar? What, pray tell then, should we call this phenomenon? We really need a term for it. Or a perhaps an axiom? Maybe a law or a razor or a constant. Something like:

"For every seemingly complex OS problem there is almost always an astoundingly simple solution which can usually be found at the bottom of one of the more obscure internet forums."

A corollary of which might go something like:

"Always check the bottoms of forums first."

We'll call it Systems Boy's Razor. Yeah, that should do nicely.

If anyone has any better suggestions here, I'm always open. Feel free to let 'em rip in the comments. Otherwise, check your default shells, people. Or at least make sure you have them set.

NetBoot Part 4

So this is going great. I have a really solid Base OS Install, and a whole buttload of packages now. Packages that set everything from network settings to custom and specialized users. I can build a typical system in about 45 minutes, and I can do most of the building from my office (or any other computer in the lab that has ARD installed).

I'm also getting fairly adept at making packages. A good many of my packages are just scripts that make settings to the system, so I'm getting pretty handy with the bash and quite intimate with dscl. But, perhaps most importantly, I'm learning how to make all sorts of settings in Leopard via the command-line that I never knew how to do.

The toughest one so far has been file sharing. In our lab we share all our Work partitions to the entire internal network over AFP and SMB. In the past we used SharePoints to modify the NetInfo database to do so, but this functionality has all been moved over to Directory Services. To complicate matters, SAMBA no longer relies simply on standard SMB configuration files in standard locations, and the starting and stopping of the SMB daemon is handled completely by launchd. So figuring this all out has been a headache. But I think I've got it!

Setting Up AFP

Our first step in this process is setting up the share point for AFP (AppleFileshareProtocol) sharing. This wasn't terribly difficult to figure out, especially now that I've been using Directory Services to create new users. To create an AFP share in Leopard, you use dscl. Once you grok the syntax of dscl it's fairly easy to use. It basically goes like this:

command node -action Data/Source value

The "Data Source" is the thing you're actually operating on. I like to think of it as a plist entry in the database — like a hierarchically structured file — which it basically is, or sometimes I envision the old-style NetInfo structures. To get the needed values for my new share, I used dscl to look at a test share I'd created in the Sharing Preferences:

dscl . -read SharePoints/TEST

The output looked like this:

dsAttrTypeNative:afp_guestaccess: 1

dsAttrTypeNative:afp_name: TEST

dsAttrTypeNative:afp_shared: 1

dsAttrTypeNative:directory_path: /Volumes/TEST

dsAttrTypeNative:ftp_name: TEST

dsAttrTypeNative:sharepoint_group_id: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXX

dsAttrTypeNative:smb_createmask: 644

dsAttrTypeNative:smb_directorymask: 755

dsAttrTypeNative:smb_guestaccess: 1

dsAttrTypeNative:smb_name: TEST

dsAttrTypeNative:smb_shared: 1

AppleMetaNodeLocation: /Local/Default

RecordName: TEST

RecordType: dsRecTypeStandard:SharePoints

Okay. So I needed to use dscl to create a record in the SharePoints data source with all these values. Fortunately, the "sharepoint_group_id" is not required for the share to work, because I'm not yet sure how to generate that number. But create the share with all the other values and you should be okay:

sudo dscl . -create SharePoints/my-share

sudo dscl . -create SharePoints/my-share afp_guestaccess 1

sudo dscl . -create SharePoints/my-share afp_name My-Share

sudo dscl . -create SharePoints/my-share afp_shared 1

sudo dscl . -create SharePoints/my-share directory_path /Volumes/HardDrive

sudo dscl . -create SharePoints/my-share ftp_name my-share

sudo dscl . -create SharePoints/my-share smb_createmask 644

sudo dscl . -create SharePoints/my-share smb_directorymask 755

sudo dscl . -create SharePoints/my-share smb_guestaccess 1

sudo dscl . -create SharePoints/my-share smb_name my-share

sudo dscl . -create SharePoints/my-share smb_shared 1

This series of commands will create a share called "My-Share" out of the drive called "HardDrive."

After modifying the Directory Services database, it's always smart to restart it:

sudo killall DirectoryService

And we need to make sure AFP is running by starting the daemon and reloading the associated Launch Daemons:

sudo AppleFileServer

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.AppleFileServer.plist

sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.AppleFileServer.plist

Not the easiest process, but not too bad. SMB was much tougher to figure out.

Setting Up SMB

Setting up SMB works similarly, but everything is in a completely different and not-necessarily standard place. To wit, Leopard has two different smb.conf files: one that's auto-generated (and which you should not touch) in /var/db, and one in the standard /etc location. Fortunately, it turned out, I didn't have to modify either of these. But still, it led to some confusion. The way SMB is managed in Leopard is rather roundabout and interdependent. Information about SMB share is stored in flat files — one per share — in /var/samba/shares. So, to create our "my-share" share, we need a file named for the share (but all lower-case):

sudo touch /var/samba/shares/my-share

And in that file we need some basic SMB info to describe the share:

#VERSION 3

path=/Volumes/HardDrive

comment=HardDrive

usershare_acl=S-1-1-0:F

guest ok=yes

directory mask=755

create mask=644

Next — and this was the tough part to figure out — we need to modify one, single, very important preference file that basically informs Launch Services that SMB should now be running:

sudo defaults write /Library/Preferences/SystemConfiguration/com.apple.smb.server "EnabledServices" '(disk)'

This command modifies the file com.apple.smb.server.plist in our /Library/Preferences/SystemConfiguration folder. That file is watched by launchd such that when it is modified thusly, launchd knows to start and run the smbd daemon in the appropriate fashion. Still, for good measure, I like to reload the LaunchDaemon for the SMB server by hand. Don't need to, but it's a nice idea:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.smb.server.preferences.plist

sudo launchctl load -F /System/Library/LaunchDaemons/com.apple.smb.server.preferences.plist

That's pretty much it! There are a few oddities: For one, the new share will not initially appear in the Sharing Preferences pane, nor will the Finder show it as a Shared Folder when you open the window.

Shared Folder: This Won't Show Without a Reboot

(click image for larger view)

But the share will be active, and all will be right with the world after a simple reboot. (Isn't it always!) Also, if you haven't done it already, you may have to set permissions on your share using chmod in order for anyone to see it.

I was kind of surprised at how hard it was to set up file sharing via the command-line. But I'm glad I stuck with it and figured it out. It's good knowledge to have.

Hopefully someone else will find it useful as well.