ZFS Internals (part #3)
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
Hello all…
Ok, we are analyzing the ZFS ondiskformat (like Jack, in parts). In my first post we could see the copy-on-write semantics of ZFS in a VI editing session. It’s important to notice a few things about that test:
1) First, we were updating a non-structured flat file (text);
2) Second, all blocks were reallocated (data and metadata);
3) Third, VI (and any software i know), for the sake of consistency, rewrite the whole file when changing something (add/remove/update);
So, that’s why the “non touched block” was reallocated in our example, and this have nothing to do with the cow (thanks to Eric to point that). But that is a normal behaviour for softwares updating a flat file. So normal that we just remember that when we make such tests, because in normal situations those “other writes” operations did not have a big impatc. VI actually creates another copy, and at the end move “new” -> “current” (rsync for example, does the same). So, updating a flat file, is the same as creating a new one. That’s why i did talk about mailservers that work with mboxes… i don’t know about cyrus, but the others that i know, rewrite the whole mailbox for almost every update operation. Maybe a mailserver that creates a mbox like a structured database, with static sized records, could rewrite a line without need to rewrite the whole file, but i don’t know any MTA/POP/IMAP like that. See, these tests let us remember why databases exist. ;-)
Well, in this post we will see some internals about ext2/ext3 filesystem to understand ZFS.
Let’s do it…
ps.: I did these test on an Ubuntu desktop
# dd if=/dev/zero of=fakefs bs=1024k count=100 # mkfs.ext3 -b 4096 fakefs mke2fs 1.40.8 (13-Mar-2008) fakefs is not a block special device. Proceed anyway? (y,n) y Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 25600 inodes, 25600 blocks 1280 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=29360128 1 block group 32768 blocks per group, 32768 fragments per group 25600 inodes per group Writing inode tables: done Creating journal (1024 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 29 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
Ok filesytem created, and you can see important informations about it, what i think is really nice. I think ZFS could give us some information too. i know, it’s something like “marketing”: We did not pre-allocated nothing… c’mon! ;-)
ps.: Pay attention that we have created the filesystem using 4K blocks, because it is the bigger available.
Now we mount the brand new filesystem, and put a little text file on it.
# mount -oloop fakefs mnt/ # debugfs debugfs: stats ... snipped ... Directories: 2 Group 0: block bitmap at 8, inode bitmap at 9, inode table at 10 23758 free blocks, 25589 free inodes, 2 used directories debugfs: quit # du -sh /etc/bash_completion 216K /etc/bash_completion # cp -pRf /etc/bash_completion mnt/
Just to be sure, let’s umount it, and see what we got from the debugfs:
# umount mnt/ # debugfs debugfs: stats Directories: 2 Group 0: block bitmap at 8, inode bitmap at 9, inode table at 10 23704 free blocks, 25588 free inodes, 2 used directories debugfs: ls 2 (12) . 2 (12) .. 11 (20) lost+found 12 (4052) bash_completion debugfs: ncheck 12 Inode Pathname 12 /bash_completion
Very nice! Here we already have a lot of informations:
– The file is there. ;-)
– The filesystem used 54 blocks to hold our file. I know it looking at the free blocks line (23758 – 23704 = 54)
– The inode of our file is 12 (don’t forget it)
But let’s go ahead…
debugfs: show_inode_info bash_completion Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation: 495766590 User: 1000 Group: 1000 Size: 216529 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 432 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x48d4e2da -- Sat Sep 20 08:47:38 2008 atime: 0x48d4d858 -- Sat Sep 20 08:02:48 2008 mtime: 0x480408b3 -- Mon Apr 14 22:45:23 2008 BLOCKS: (0-11):1842-1853, (IND):1854, (12-52):1855-1895 TOTAL: 54
Oh, that’s it! Our file is using 53 data block (53 * 4096 = 217.088), and more one metadata (indirect block 1854). We already have the location too: 12 data blocks from the position 1842-1853, and 41 data blocks from the position 1855 to 1895. Yes, we don’t believe it! We need to see it by ourselves…
# dd if=Devel/fakefs of=/tmp/out skip=1842 bs=4096 count=12 # dd if=Devel/fakefs of=/tmp/out2 skip=1855 bs=4096 count=41 # cp -pRf /tmp/out /tmp/out3 && cat /tmp/out2 >> /tmp/out3 # diff mnt/bash_completion /tmp/out3 9401a9402 > \ Não há quebra de linha no final do arquivo debugfs: quit
ps.: just the brazilian portuguese for “Warning: missing newline at end of file” ;-)
Now let’s do the same as we did for ZFS…
# vi mnt/bash_completion change unset BASH_COMPLETION_ORIGINAL_V_VALUE for RESET BASH_COMPLETION_ORIGINAL_V_VALUE # umount mnt # sync
The last part, we wil look the new layout of the filesystem…
# mount -oloop fakefs mnt # debugfs debugfs: show_inode_info bash_completion Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation: 495766590 User: 1000 Group: 1000 Size: 216529 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 432 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x48d4f6a3 -- Sat Sep 20 10:12:03 2008 atime: 0x48d4f628 -- Sat Sep 20 10:10:00 2008 mtime: 0x48d4f6a3 -- Sat Sep 20 10:12:03 2008 BLOCKS: (0-11):24638-24649, (IND):24650, (12-52):24651-24691 TOTAL: 54
Almost the same as ZFS… the block data are totaly new ones, but the metadata remains in the same inode (12/Generation: 495766590)! Without the copy-on-write, the metadata is rewriten in place, what in a system crash was the reason of the fsck (ext2). But we are working with a ext3 filesystem, so the solution for that possible fail is “journaling”. Ext3 has a log (as you can see in the creation messages), where the filesystem controls what it is doing before it really does. That not solves the problem, but makes the recovery easier in the case of a crash.
But what about our data, we need to see our data, show me our…….. ;-)
From the new locations:
# dd if=Devel/fakefs of=/tmp/out4 skip=24638 bs=4096 count=12 # dd if=Devel/fakefs of=/tmp/out5 skip=24651 bs=4096 count=41 # cp -pRf /tmp/out4 /tmp/out6 && cat /tmp/out5 >> /tmp/out6 # diff mnt/bash_completion /tmp/out6 9402d9401 < \ Não há quebra de linha no final do arquivo
and the originals (old)data blocks...
# dd if=Devel/fakefs of=/tmp/out skip=1842 bs=4096 count=12 # dd if=Devel/fakefs of=/tmp/out2 skip=1855 bs=4096 count=41 # cp -pRf /tmp/out /tmp/out3 && cat /tmp/out2 >> /tmp/out3 # diff mnt/bash_completion /tmp/out3 9397c9397 < RESET BASH_COMPLETION_ORIGINAL_V_VALUE --- > unset BASH_COMPLETION_ORIGINAL_V_VALUE 9401a9402 > \ Não há quebra de linha no final do arquivo
That's it! Copy-on-write is: "Never rewrite a live block (data or metadata)". Here we can see one of the big diff about ZFS and other filesystems. VI rewrite the whole file (creating a new one), ok both filesystems did allocate other data blocks for it, but the metadata handling was using different approuches (what actualy do the whole difference). But if we change just one block? If our program did not create a new file, just rewrite the block in place? Subject to other post...
peace!
rsync will do block level updates if you use the –inplace argument.