Sunday, March 21, 2010

Linux boot: playing with initrd (initial RAM disk)

This is a tutorial on how initrd works from a user's perspective. I'm experimenting with using blogsplot to post tutorials. Going to post this in several stages, and edit as I go. Example was run on a VM running a custom compiled CentOS 5 Linux 2.6.18 kernel. This contains Red Hat Linux specific stuff, but is probably similar to other desktop/server Linux distros.
When your Linux system boots, it goes through a number of stages. From the very low level, we typically start by pressing the button to first go through a power on self test (POST). Next, the system loads some BIOS firmware so we can get a baseline for the hardware supported. Then, we figure out what device to boot to, usually a hard drive. If you have GRUB or another bootloader, these small applications know what you have on your disks and help you select and boot whichever operating system suits you. Linux then begins to load. Here is a screen shot of what early Linux boot looks like to about where it starts poking around for hard drives:
The "Booting 'CentOS (2.6-18.8-UV5)'" gives the GRUB human friendly name right before it actually boots our kernel. After a few other boot debug statements, we see that it uncompressed the kernel and boots. Finally, we see "Red Hat nash version 5.1.19.6 starting." This is about what happens in between those two and how we boot nash.
In the old days (and on embedded systems), booting was simple. You have a hard drive and a kernel. The kernel had a driver so it never had issues knowing how to handle the hard drive it was on or needed to boot to. But, as systems have grown, the kernel doesn't typically bundle all of the drivers in it. For booting, the most critical are file system drivers and hard drive controllers. So you compile a kernel module for your hard drive controller and put it on the hard drive. So we just need to get it off of the hard drive at boot so we can load the driver...no wait.... So we have a chicken and the egg problem: we need the hard drive driver to load the hard drive. initrd does the magic to solve this.
Lets take a look at whats in /boot. On my machine:
[mcmaster@gespenst initrd]$ ls -lh /boot/
(abbreviated)
total 49M
drwxr-xr-x 2 root root 1.0K Nov 12 21:18 extlinux
drwxr-xr-x 2 root root 1.0K Sep 3 07:30 grub
-rw------- 1 root root 3.0M Jan 14 2009 initrd-2.6.18.8-UV4.img
drwx------ 2 root root 12K Aug 4 2008 lost+found
-rw-r--r-- 1 root root 79K Mar 12 2009 message
-rw-r--r-- 1 root root 923K Jan 14 2009 System.map-2.6.18.8-UV4
-rw-r--r-- 1 root root 2.1M Jan 14 2009 vmlinuz-2.6.18.8-UV4
In order to boot, we need two key items: a kernel and some data to put into it. vmlinz is the Linux image. It gets its name from originally using z compression as it self decompresses. Other algorithms are also supported. The data is in the initrd image. Its an INITial Ram Disk for us to boot to. Lets see whats in it. Setup a sanbox directory somewhere on your system to play in:
[mcmaster@gespenst ~]$ mkdir ~/buffer/initrd
[mcmaster@gespenst ~]$ cd ~/buffer/initrd/
And copy over initrd so we can mess with it (on my system they were readable only by root):
[mcmaster@gespenst initrd]$ sudo cp /boot/initrd-2.6.18.8-UV4.img .
[mcmaster@gespenst initrd]$ sudo chown mcmaster/mcmaster initrd-2.6.18.8-UV4.img
What does trusty ol' "file" say about it?
[mcmaster@gespenst initrd]$ file initrd-2.6.18.8-UV4.img
initrd-2.6.18.8-UV4.img: gzip compressed data, from Unix, last modified: Wed Jan 14 09:52:20 2009, max compression
Ah ha! So lets uncompress it. Note that gunzip doesn't just do gzip format files, so it gets angry if it doesn't end in something standard. Either rename it to something ending in .gz or pipe it:
[mcmaster@gespenst initrd]$ cat initrd-2.6.18.8-UV4.img |gunzip >initrd-2.6.18.8-UV4
What did we get?
[mcmaster@gespenst initrd]$ file initrd-2.6.18.8-UV4
initrd-2.6.18.8-UV4: ASCII cpio archive (SVR4 with no CRC)
A cpio archive. And I always thought the cpio command was useless. Whats in the box?
[mcmaster@gespenst initrd]$ cpio --verbose -t (abbreviated)
lrwxrwxrwx 1 root root 3 Jan 14 2009 sbin -> bin
drwx------ 3 root root 0 Jan 14 2009 lib
-rw------- 1 root root 27840 Jan 14 2009 lib/dm-mirror.ko
-rw------- 1 root root 37268 Jan 14 2009 lib/ehci-hcd.ko
...
-rw------- 1 root root 159732 Jan 14 2009 lib/scsi_mod.ko
drwx------ 3 root root 0 Jan 14 2009 dev
crw------- 1 root root 4, 67 Jan 14 2009 dev/ttyS3
crw------- 1 root root 1, 5 Jan 14 2009 dev/zero
crw------- 1 root root 4, 10 Jan 14 2009 dev/tty10
drwx------ 3 root root 0 Jan 14 2009 etc
drwx------ 2 root root 0 Jan 14 2009 etc/lvm
-rw------- 1 root root 15911 Jan 14 2009 etc/lvm/lvm.conf
-rwx------ 1 root root 2354 Jan 14 2009 init
drwx------ 2 root root 0 Jan 14 2009 sys
drwx------ 2 root root 0 Jan 14 2009 proc
drwx------ 2 root root 0 Jan 14 2009 bin
-rwx------ 1 root root 852164 Jan 14 2009 bin/kpartx
-r-x------ 1 root root 1464040 Jan 14 2009 bin/lvm
-rwx------ 1 root root 2381980 Jan 14 2009 bin/nash
lrwxrwxrwx 1 root root 10 Jan 14 2009 bin/modprobe -> /sbin/nash
-rwx------ 1 root root 1038596 Jan 14 2009 bin/dmraid
-rwx------ 1 root root 470244 Jan 14 2009 bin/insmod
drwx------ 2 root root 0 Jan 14 2009 sysroot
Careful extracting, there are some devices in there which can have odd repercussions if created without need. Chances are you don't really want those created upon extract. Just run as a normal user and they will harmlessly flop. Lets extract it:
[mcmaster@gespenst initrd]$ mkdir decompressed
[mcmaster@gespenst initrd]$ cd decompressed/
[mcmaster@gespenst decompressed]$ cpio -i <../initrd-2.6.18.8-UV4
cpio: dev/ttyS3: Operation not permitted
cpio: dev/zero: Operation not permitted
...
cpio: dev/tty10: Operation not permitted
13718 blocks
[mcmaster@gespenst initrd]$ ls -lh decompressed/
total 32K
drwx------ 2 mcmaster mcmaster 4.0K Dec 30 18:03 bin
drwx------ 3 mcmaster mcmaster 4.0K Dec 30 18:03 dev
drwx------ 3 mcmaster mcmaster 4.0K Dec 30 18:03 etc
-rwx------ 1 mcmaster mcmaster 2.3K Dec 30 18:03 init
drwx------ 3 mcmaster mcmaster 4.0K Dec 30 18:03 lib
drwx------ 2 mcmaster mcmaster 4.0K Dec 30 18:03 proc
lrwxrwxrwx 1 mcmaster mcmaster 3 Dec 30 18:03 sbin -> bin
drwx------ 2 mcmaster mcmaster 4.0K Dec 30 18:03 sys
drwx------ 2 mcmaster mcmaster 4.0K Dec 30 18:03 sysroot
As non-root we couldn't make devices, so it went splat in some regards.
But that's okay, we just want to look at the normal files and know that those devices would have existed.
So, this is the initial filesystem loaded onto your box. What happens now is the file "init" is ran. I believe this is hard coded in the kernel. Lets see whats in there:
#!/bin/nash

mount -t proc /proc /proc
setquiet
echo Mounting proc filesystem
echo Mounting sysfs filesystem
mount -t sysfs /sys /sys
echo Creating /dev
mount -o mode=0755 -t tmpfs /dev /dev
mkdir /dev/pts
mount -t devpts -o gid=5,mode=620 /dev/pts /dev/pts
mkdir /dev/shm
mkdir /dev/mapper
echo Creating initial device nodes
mknod /dev/null c 1 3
mknod /dev/zero c 1 5
mknod /dev/systty c 4 0
...
mknod /dev/ttyS3 c 4 67
echo Setting up hotplug.
hotplug
echo Creating block device nodes.
mkblkdevs
echo "Loading ehci-hcd.ko module"
insmod /lib/ehci-hcd.ko
...
echo "Loading dm-snapshot.ko module"
insmod /lib/dm-snapshot.ko
echo Waiting for driver initialization.
stabilized --hash --interval 250 /proc/scsi/scsi
mkblkdevs
echo Scanning and configuring dmraid supported devices
echo Scanning logical volumes
lvm vgscan --ignorelockingfailure
echo Activating logical volumes
lvm vgchange -ay --ignorelockingfailure VolGroup00
resume /dev/VolGroup00/LogVol01
echo Creating root device.
mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00
echo Mounting root filesystem.
mount /sysroot
echo Setting up other filesystems.
setuproot
echo Switching to new root and running init.
switchroot
Its a script executing with the "nash" interpreter. And now you know where /dev/null comes from and some other important devices. Basically, we mount some of the basic kernel information pseudo-filesystems (procfs, sysfs) and create important device nodes such as /dev/null that will make some programs freak out without. Next, we load in the device drivers we packed into the initrd image. Then can be seen in /lib. If you are using LVM, it will then scan for LVs now that we have the bootstrap filesystem and device drivers loaded.
Now heres my favorite part. We just mounted some bogus filesystem on /. We need to trash that so we can put our real filesystem on that. This is summarized in the last part of this script:
echo Creating root device.
mkrootdev -t ext3 -o defaults,ro /dev/VolGroup00/LogVol00
echo Mounting root filesystem.
mount /sysroot
echo Setting up other filesystems.
setuproot
echo Switching to new root and running init.
switchroot
We mount our real root at some random mount point /sysroot. Hmm thats not quite right, but getting closer. And then after a bit of prep, switchroot? What magic is that? As it turns out, there is a special black magic system call, pivot_root(2). From the man page pivot_root(2):
int pivot_root(const char *new_root, const char *put_old);

DESCRIPTION
pivot_root() moves the root file system of the current process to the directory put_old and makes new_root the new root file system of the current process.
Fancy. Once thats done, we are pretty much set. We have our expected filesystem mounted. The last thing to do is to call /sbin/init and we move on to our normal system startup.
Hopefully this gave you some background on how we get from GRUB to our normal boot. Any comments, suggestions, corrections, etc are most welcome!