Alan Hargreaves' Blog

The ramblings of a former Australian SaND TSC* Principal Field Technologist

Daaaaaaaad, the computer isn’t booting

These were the words that my 10 year old boy yelled to me on Sunday.

I’m documenting this as I tried to imagine facing this as an end user, rather than as a Solaris kernel support engineer, and shuddered

The machine he is talking about is the OpenSolaris box that I installed for them and recently upgraded to SRU4 on Friday (more on supported updates shortly).

The box (silver) had been sitting at the boot load screen (those of us old enough to remember the original Battlestar Galactica would refer to it as the cylon screen) with the disk light hard on and the disk rattling away threatening to send itself off into hyperspace. It had apparently been like this for a few hours (he lost interest and went to watch TV before thinking to tell me).

He’d tried resetting it and it didn’t help.

“OK”, I thought, “I’ve just upgraded the box, maybe there was a problem with SRU4, let’s boot into the prior boot environment.” Easy enough to do, just reset and select the prior environment in grub.

No dice. Same issue.

Failsafe boot? No it appears that we don’t have one of those.

Right, a single user boot. I want to have a look at what is going on on the console, So we need to get rid of the graphics crud at the start.

This isn’t too hard. Have a look at the options in the text boot to be sure, but all I did was hit ‘e’ (edit) in grub, d (delete) the splash and graphics lines, e (edit) the unix line to take out the “,Graphics… ” stuff off the end of the command, hit Enter to go back a screen then hit b (boot) and watch what happens.

I didn’t have to wait long.

Let me give you a little more background on this machine. It really is scrounged together. The root pool consists solely of a 4gb disc removed from an ultra 10.

The root zpool was 100% full. The disc full messages scrolled for a while.

OK, once we waited for a few minutes we got the prompt asking for a login name and password to drop us to a root single user. OK, let’s go looking for where the space issue is.

A ‘zfs list‘ showed me that rpool/export/home was a little larger than I expected. Unfortunately, as the pool was full, I couldn’t mount those. No worries, let’s poke around on / to try to find something to remove to make enough space so we can mount things.

A good place to look for such space on a workstation is in /var/log, specifically the Xorg logs.

Let’s remove one of those, ….

Bzzzzzt wrong.

Copy on write, …. In order to unlink a file we need to write a new block for the directory entry. Oops no free blocks.

The trick is to lose the space without having to rewrite the directory entry. We need to truncate one of the logs.

# : > Xorg.0.log.old

Much better. For good measure I zapped Xorg.0.log as well.

OK, that looks much better.

Let’s mount rpool/export/home and have a look.

# zfs mount rpool/export/home

Ahhh, the kids home directories each have a largish core in them. Remove those, unmount /export/home. Now, as I mounted rpool/export/home and not rpool/export, a directory got created in /export. We need to remove that or the filesystem/local service won’t start correctly (it will complain about /export having stuff in it).

Logout of that shell and the system continues on to milestone=multiuser and we’re good again and Jake is off to do his daily moves in Kingdom of Loathing and resume his Club Penguin.

Advertisements

Written by Alan

April 6, 2009 at 8:26 am

Posted in OpenSolaris

15 Responses

Subscribe to comments with RSS.

  1. sounds like time for some quotas on /export/home, and possibly modifying coreadm
    not to mention a trip to the storage closet, surely you can find a couple 9GB or larger drives to use for the root pool, use the 4GB drive as a paperweight.
    James

    James Dickens

    April 6, 2009 at 8:55 am

  2. I ran into a similar problem with a virtual machine where I had been rather stingy in allocating my virtual disk (for good reasons of course). The problem is freeing those first few blocks so that rm can work. I didn’t think of truncating the file in the shell. I saved my neck by removing some old snapshots.
    This problem is bound to be confusing and frustrating to users who don’t understand why they can’t remove a file to free disk space when the disk is full. People need to know how to get that initial foothold so they can start cleaning up in earnest. We now have a couple of ideas – are there others? How does one disseminate such information?

    Rand Huntzinger

    April 6, 2009 at 10:34 am

  3. It seems like each zpool should put in a mini internal reservation of a few KB or a meg so you can always unlink entries. It is good to know the redirect hack to get around a full pool, but it is so lame to need it.

    Bill Hathaway

    April 6, 2009 at 11:02 am

  4. This "cylon" screen is one of the things I like *least* about OpenSolaris. If *anything* goes wrong during the boot, you’re hosed and have to hardware reset.
    In fact, if you ever make the mistake of svcadm disable gdm, you have to reboot if you don’t have remote access to the machine.
    ISTR that I’ve filed at least one CR on this issue. I consider this problem a high priority bug.
    Until this problem is resolved (perhaps by giving virtual terminal access to other text mode login screens, or by giving a hotkey to disable the cyclon graphics) I recommend always disabling the graphical boot in grub.

    Garrett D'Amore

    April 6, 2009 at 11:22 am

  5. It’s worth noting that in build 111 and higher you should be able to hit any key during the boot sequence to cause "cylon" to turn off, and return control to the console.

    Daniel Price

    April 6, 2009 at 1:05 pm

  6. Dan, you wouldn’t happen to have the CR # handy that made that change?
    Garrett, sounds like the change Dan mentions may address your concern.
    Bill, perhaps, but, which is going to be easier to explain to an end user? Truncating a file or temporarily reducing a reservation?
    Rand, that’s why I wrote this blog 🙂
    James, yea well I have a 9gb disk here and a couple of other 4’s. I *was* going to mirror the initial 4, now I’m thinking of migrating to the 9.
    alan.

    Alan Hargreaves

    April 6, 2009 at 1:17 pm

  7. Alan – I was thinking of some very small amount of internal reserved zpool space that could be used to allow unlinking even when it appeared full. I definitely didn’t think of it as something that would require manual fiddling with by the end user, so perhaps saying a ZFS reservation isn’t the right term. I agree that if we have to tell an end-user to modify a ZFS property to be able to delete files in a space full scenario it isn’t useful.

    Bill Hathaway

    April 6, 2009 at 1:27 pm

  8. coreadm -d process 🙂

    andrewk8

    April 6, 2009 at 2:12 pm

  9. Alan, you might also want to consider "coreadm -d process", especially on a machine for the kids

    Boyd Adamson

    April 6, 2009 at 5:35 pm

  10. I miss a "logadm" entry in OpenSolaris default crontab. That should be included in the default installation, don’t you think?
    What about a blog entry about logadm?

    Antonio

    April 7, 2009 at 2:49 am

  11. I agree with Bill that it would be nice if some space was reserved so that the inability to rm wouldn’t occur but of course that isn’t the case now. It sounds easy to implement but I’ve been around long enough to know that what sounds easy isn’t always.

    Rand Huntzinger

    April 7, 2009 at 5:41 am

  12. Hi,
    Sounds like a bug to me basically as even if your root pool becomes full the box should still be usable to a recoverable state. Thus the FS/Kernel should know this and reserve some space so that one can never find ones self where one can no longer use admin commands to fix a problem such as the above.
    btw, what is the screen your all talking about? Galactica makes me sea sick.. lol
    Best Regards,

    EdwardO'Callaghan

    April 7, 2009 at 10:40 am

  13. This is a bug. The system should not crash when it runs out of disk space. It should continue working and report an out of disk space error.
    To me this implies that a certain amount of space needs to be reserved to preserve the operation of the system.

    MC

    April 7, 2009 at 11:54 pm

  14. MC, I’m not sure where you got the idea that the system crashed.
    It didn’t.
    It was taking forever to boot because it was repeatedly trying to log startup messages to a log on /.
    The bit that made it harder was that you don’t see the console by default in the build I am on (see Dan’s comment above, this is fixed in dev).
    Alan.

    Alan Hargreaves

    April 8, 2009 at 12:08 am

  15. Sorry, I’m not blaming it on you 🙂 My point stands, it errored in a way it should not.

    MC

    April 8, 2009 at 6:18 am


Comments are closed.

%d bloggers like this: