ZFS Internals (part #8)
PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!
That’s is something trick, and i think is important to write about… i’m talking about ZFS vdevs.
ZFS has two types of vdevs: logical and physical. So, from the ZFS on-disk specification, we know that a physical vdev is a writeable media device, and a logical vdev is a grouping of physical vdevs.
Let’s see a simple diagram using a RAIDZ logical vdev, and five physical vdevs:
+---------------------+ | Root vdev | +---------------------+ | +--------------+ | RAIDZ | VIRTUAL VDEV +--------------+ | +----------+ | 128KB | CHECKSUM +----------+ | 32KB 32KB 32KB 32KB Parity .------. .------. .------. .------. .------. |-______-| |-______-| |-______-| |-______-| |-______-| | vdev1 | | Vdev2 | | Vdev3 | | Vdev4 | | Vdev5 | PHYSICAL VDEVS '-____-' '-____-' '-____-' '-____-' '-_____-'
The diagram above was just an example, and in that example the data that we are handling in the RAIDZ virtual vdev is a block of 128KB. That was just to make my math easy, so i could divide equal to all phisycal vdevs. ;-)
But remember that with RAIDZ we have always a full stripe, not matter the size of the data.
The important part here is the filesystem block. When i did see the first video presentation about ZFS,
i had the wrong perception about the diagram above. As we can see, if the system reclaims a block, let’s say the 128KB block above, and the physical vdev 1 gives the wrong data, ZFS just fix the data on that physical vdev, right? Wrong… and that was my wrong perception. ;-)
ZFS RAIDZ virtual vdev does not know which physical vdev (disk) gave the wrong data. And here i think there is a great level of abstraction that shows the beauty about ZFS… because the filesystems are there (on the physical vdevs), but there is not an explict relation! A filesystem block has nothing to do with a disk block. So, the checksum of the data block is not at the physical vdev level, and so ZFS cannot know directly what disk gave the wrong data without a “combinatorial reconstruction” to identify the culprit. From the vdev_raidz.c:
784 static void 785 vdev_raidz_io_done(zio_t *zio) 786 { ... 853 /* 854 * If the number of errors we saw was correctable -- less than or equal 855 * to the number of parity disks read -- attempt to produce data that 856 * has a valid checksum. Naturally, this case applies in the absence of 857 * any errors. 858 */ ...
That gives a good understanding of the design of ZFS. I really like that way of solving problems, and to have specialized parts like this one. Somebody can think that this behaviour is not optimum. But remember that this is something that should not happen all the time.
In mirror we have a whole different situation, because all the data is on any device, and so ZFS can match the checksum, and read the other vdevs looking for the right answer. Remember that we can have n-way mirror…
In the source we can see that a normal read is done in any device:
252 static int 253 vdev_mirror_io_start(zio_t *zio) 254 { ... 279 /* 280 * For normal reads just pick one child. 281 */ 282 c = vdev_mirror_child_select(zio); 283 children = (c >= 0); ...
So, ZFS knows if this data is OK or not, and if it is not, it can fix it. But without
to know which disk but which physical vdev. ;-) The procedure is the same without the
combinatorial reconstruction. And as a final note, the resilver of a block is not copy
on write, so in the code we have a comment about it:
402 /* 403 * Don't rewrite known good children. 404 * Not only is it unnecessary, it could 405 * actually be harmful: if the system lost 406 * power while rewriting the only good copy, 407 * there would be no good copies left! 408
So the physical vdev that has a good copy is not touched.
As we need to see to believe…
mkfile 100m /var/fakedevices/disk1 mkfile 100m /var/fakedevices/disk2 zpool create cow mirror /var/fakedevices/disk1 /var/fakedevices/disk2 zfs create cow/fs01 cp -pRf /etc/mail/sendmail.cf /cow/fs01/ ls -i /cow/fs01/ 4 sendmail.cf zdb -dddddd cow/fs01 4 Dataset cow/fs01 [ZPL], ID 30, cr_txg 15, 58.5K, 5 objects, rootbp [L0 DMU objset] \\ 400L/200P DVA[0]=<0:21200:200> DVA[1]=<0:1218c00:200> fletcher4 lzjb LE contiguous \\ birth=84 fill=5 cksum=99a40530b:410673cd31e:df83eb73e794:207fa6d2b71da7 Object lvl iblk dblk lsize asize type 4 1 16K 39.5K 39.5K 39.5K ZFS plain file (K=inherit) (Z=inherit) 264 bonus ZFS znode path /sendmail.cf uid 0 gid 2 atime Mon Jul 13 19:01:42 2009 mtime Wed Nov 19 22:35:39 2008 ctime Mon Jul 13 18:30:19 2009 crtime Mon Jul 13 18:30:19 2009 gen 17 mode 100444 size 40127 parent 3 links 1 xattr 0 rdev 0x0000000000000000 Indirect blocks: 0 L0 0:11200:9e00 9e00L/9e00P F=1 B=17 segment [0000000000000000, 0000000000009e00) size 39.5K
So, we have a mirror of two disk, and a little file on it… let’s do a little math, and
smash the data block from the first disk:
zpool export cow perl -e "\$x = ((0x400000 + 0x11200) / 512); printf \"\$x\\n\";" dd if=/tmp/garbage.txt of=/var/disk1 bs=512 seek=8329 count=79 conv="nocreat,notrunc" zpool import -d /var/fakedevices/ cow cat /cow/fs01/sendmail.cf > /dev/null zpool status cow pool: cow state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 mirror ONLINE 0 0 0 /var/disk1 ONLINE 0 0 1 /var/disk2 ONLINE 0 0 0 errors: No known data errors
Now let’s export the pool and read our data from the same offset in both disks:
dd if=/var/fakedevices/disk1 of=/tmp/dump.txt bs=512 skip=8329 count=79 dd if=/var/fakedevices/disk2 of=/tmp/dump2.txt bs=512 skip=8329 count=79 diff /tmp/dump.txt /tmp/dump2.txt head /tmp/dump.txt # # Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers. # All rights reserved. # Copyright (c) 1983, 1995 Eric P. Allman. All rights reserved. # Copyright (c) 1988, 1993 # The Regents of the University of California. All rights reserved. # # Copyright 1993, 1997-2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. # head /etc/mail/sendmail.cf # # Copyright (c) 1998-2004 Sendmail, Inc. and its suppliers. # All rights reserved. # Copyright (c) 1983, 1995 Eric P. Allman. All rights reserved. # Copyright (c) 1988, 1993 # The Regents of the University of California. All rights reserved. # # Copyright 1993, 1997-2006 Sun Microsystems, Inc. All rights reserved. # Use is subject to license terms. #
So, never burn your physical vdevs, because you can (almost) always get some files from it.
Even if the ZFS can’t. ;-)
peace
You’ve got great stuff here, and I’m trying to take it to the next level for data recovery/forensic work. Say I know I have corruption at a known physical offset on a known physical device. It may or may not be located within a file, but want to fix it intelligently. How can I efficiently do a limited combinatorial reconstruction so I cam either fix that area of the disk, or at least identify a file? The zpool scrub is wasteful, and perhaps the zdb -c would be faster, but both are overkill it seems if I can cross-reference to a file, or even enumerate the full directory tree of a pool?
Hello David!
Thanks, and i’m glad you did find something useful on this site. ;-)
Take a look if you are searching for this:
https://www.solarisinternals.com/wiki/index.php/ZFS_forensics_scrollback_script
https://hub.opensolaris.org/bin/view/Project+forensics/ZFS-Forensics
I did not look at the script, but seems like is the thing you are looking for. Anyway, if you are trying to understand the code to identify some file corruption on ZFS, you can look at the source code as ZFS has it already.
Leal
I found the foundation of what I need to do in one of the proceedings ..
https://www.osdevcon.org/2008/files/osdevcon2008-max.pdf
If I read this right, I’ll have to brute force file-walk the entire directory tree to know for sure if a certain range of physical blocks is in use. Certainly it would have to be compiled instead of interpreted in a shell script, since I want to get the data in minutes, not hours. Since I also have to deal with the fact that I only know the true physical block numbers on the disks, then this does add a whole level of complexity that makes it something that I’m not comfortable doing.
Curious, let’s say that the code was magically written, any ideas on how long it would likely take to return the name of an affected file given just the physical device name and block number? Is the data organized on the HD in such a way that the directory tree could be parsed efficiently? Worst case scenario, are we talking about enduring a few GB of near random I/O, or would it actually be much worse?
I’m looking for some thoughts on whether it even could be written to find the file in a few minutes, or would it potentially take much longer.
Oh RAID and filesystem’s. What a subject.
Open source OS’s are great about how can you recover “things” or “lost files” when you know what you’re doing. Even in the worst case scenario you can have faith.
I wonder when compression on ths fs level will be available to another fs’s besides ZFS.
Abraços!