ZFS Internals (part #3)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INCACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Hello all…
Ok, we are analyzing the ZFS ondiskformat (like Jack, in parts). In my first post we could see the copy-on-write semantics of ZFS in a VI editing session. It’s important to notice a few things about that test:
1) First, we were updating a non-structured flat file (text);
2) Second, all blocks were reallocated (data and metadata);
3) Third, VI (and any software i know), for the sake of consistency, rewrite the whole file when changing something (add/remove/update);
So, that’s why the “non touched block” was reallocated in our example, and this have nothing to do with the cow (thanks to Eric to point that). But that is a normal behaviour for softwares updating a flat file. So normal that we just remember that when we make such tests, because in normal situations those “other writes” operations did not have a big impatc. VI actually creates another copy, and at the end move “new” -> “current” (rsync for example, does the same). So, updating a flat file, is the same as creating a new one. That’s why i did talk about mailservers that work with mboxes… i don’t know about cyrus, but the others that i know, rewrite the whole mailbox for almost every update operation. Maybe a mailserver that creates a mbox like a structured database, with static sized records, could rewrite a line without need to rewrite the whole file, but i don’t know any MTA/POP/IMAP like that. See, these tests let us remember why databases exist. ;-)
Well, in this post we will see some internals about ext2/ext3 filesystem to understand ZFS.
Let’s do it…
ps.: I did these test on an Ubuntu desktop

# dd if=/dev/zero of=fakefs bs=1024k count=100
# mkfs.ext3 -b 4096 fakefs
mke2fs 1.40.8 (13-Mar-2008)
fakefs is not a block special device.
Proceed anyway? (y,n) y
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
25600 inodes, 25600 blocks
1280 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=29360128
1 block group
32768 blocks per group, 32768 fragments per group
25600 inodes per group

Writing inode tables: done
Creating journal (1024 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 29 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Ok filesytem created, and you can see important informations about it, what i think is really nice. I think ZFS could give us some information too. i know, it’s something like “marketing”: We did not pre-allocated nothing… c’mon! ;-)
ps.: Pay attention that we have created the filesystem using 4K blocks, because it is the bigger available.
Now we mount the brand new filesystem, and put a little text file on it.

# mount -oloop fakefs mnt/
# debugfs
debugfs: stats
... snipped ...
Directories:              2
 Group  0: block bitmap at 8, inode bitmap at 9, inode table at 10
           23758 free blocks, 25589 free inodes, 2 used directories
debugfs: quit
# du -sh /etc/bash_completion
216K /etc/bash_completion
# cp -pRf /etc/bash_completion mnt/

Just to be sure, let’s umount it, and see what we got from the debugfs:

# umount mnt/
# debugfs
debugfs: stats
Directories:              2
 Group  0: block bitmap at 8, inode bitmap at 9, inode table at 10
           23704 free blocks, 25588 free inodes, 2 used directories

debugfs: ls
2  (12) .    2  (12) ..    11  (20) lost+found    12  (4052) bash_completion
debugfs:  ncheck 12
Inode   Pathname
12      /bash_completion

Very nice! Here we already have a lot of informations:
– The file is there. ;-)
– The filesystem used 54 blocks to hold our file. I know it looking at the free blocks line (23758 – 23704 = 54)
– The inode of our file is 12 (don’t forget it)
But let’s go ahead…

debugfs: show_inode_info bash_completion
Inode: 12   Type: regular    Mode:  0644   Flags: 0x0   Generation: 495766590
User:  1000   Group:  1000   Size: 216529
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 432
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x48d4e2da -- Sat Sep 20 08:47:38 2008
atime: 0x48d4d858 -- Sat Sep 20 08:02:48 2008
mtime: 0x480408b3 -- Mon Apr 14 22:45:23 2008
BLOCKS:
(0-11):1842-1853, (IND):1854, (12-52):1855-1895
TOTAL: 54

Oh, that’s it! Our file is using 53 data block (53 * 4096 = 217.088), and more one metadata (indirect block 1854). We already have the location too: 12 data blocks from the position 1842-1853, and 41 data blocks from the position 1855 to 1895. Yes, we don’t believe it! We need to see it by ourselves…

# dd if=Devel/fakefs of=/tmp/out skip=1842 bs=4096 count=12
# dd if=Devel/fakefs of=/tmp/out2 skip=1855 bs=4096 count=41
# cp -pRf /tmp/out /tmp/out3 && cat /tmp/out2 >> /tmp/out3
# diff mnt/bash_completion /tmp/out3
9401a9402
>
\ Não há quebra de linha no final do arquivo
debugfs: quit

ps.: just the brazilian portuguese for “Warning: missing newline at end of file” ;-)
Now let’s do the same as we did for ZFS…

# vi mnt/bash_completion
 change
unset BASH_COMPLETION_ORIGINAL_V_VALUE
 for
RESET BASH_COMPLETION_ORIGINAL_V_VALUE
# umount mnt
# sync

The last part, we wil look the new layout of the filesystem…

# mount -oloop fakefs mnt
# debugfs
debugfs: show_inode_info bash_completion
Inode: 12   Type: regular    Mode:  0644   Flags: 0x0   Generation: 495766590
User:  1000   Group:  1000   Size: 216529
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 432
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x48d4f6a3 -- Sat Sep 20 10:12:03 2008
atime: 0x48d4f628 -- Sat Sep 20 10:10:00 2008
mtime: 0x48d4f6a3 -- Sat Sep 20 10:12:03 2008
BLOCKS:
(0-11):24638-24649, (IND):24650, (12-52):24651-24691
TOTAL: 54

Almost the same as ZFS… the block data are totaly new ones, but the metadata remains in the same inode (12/Generation: 495766590)! Without the copy-on-write, the metadata is rewriten in place, what in a system crash was the reason of the fsck (ext2). But we are working with a ext3 filesystem, so the solution for that possible fail is “journaling”. Ext3 has a log (as you can see in the creation messages), where the filesystem controls what it is doing before it really does. That not solves the problem, but makes the recovery easier in the case of a crash.

But what about our data, we need to see our data, show me our…….. ;-)
From the new locations:

# dd if=Devel/fakefs of=/tmp/out4 skip=24638 bs=4096 count=12
# dd if=Devel/fakefs of=/tmp/out5 skip=24651 bs=4096 count=41
# cp -pRf /tmp/out4 /tmp/out6 && cat /tmp/out5 >> /tmp/out6
# diff mnt/bash_completion /tmp/out6
9402d9401
<
\ Não há quebra de linha no final do arquivo

and the originals (old)data blocks...

# dd if=Devel/fakefs of=/tmp/out skip=1842 bs=4096 count=12
# dd if=Devel/fakefs of=/tmp/out2 skip=1855 bs=4096 count=41
# cp -pRf /tmp/out /tmp/out3 && cat /tmp/out2 >> /tmp/out3
# diff mnt/bash_completion /tmp/out3
9397c9397
< RESET BASH_COMPLETION_ORIGINAL_V_VALUE
---
> unset BASH_COMPLETION_ORIGINAL_V_VALUE
9401a9402
>
\ Não há quebra de linha no final do arquivo

That's it! Copy-on-write is: "Never rewrite a live block (data or metadata)". Here we can see one of the big diff about ZFS and other filesystems. VI rewrite the whole file (creating a new one), ok both filesystems did allocate other data blocks for it, but the metadata handling was using different approuches (what actualy do the whole difference). But if we change just one block? If our program did not create a new file, just rewrite the block in place? Subject to other post...
peace!