ZFS Internals (part #7)
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
Ok, we did talk about dva’s on the first and second posts, and actually that is the main concept behind the block pointers and indirect blocks in ZFS. So, the dva (data virtual address), is a combination of an physical vdev and an offset. By default, ZFS stores one DVA for user data, two DVAs for filesystem metadata, and three DVAs for metadata that’s global across all filesystems in the storage pool, so any block pointer in a ZFS filesystem has at least two copies. If you look at the link above (the great post about Ditto Blocks), you will see that the ZFS engineers were thinking about add the same feature for data blocks, and here you can see that this is already there.
# ls -li /cow/fs01/ 4 file.txt # md5sum /cow/fs01/file.txt 6bf2b8ab46eaa33b50b7b3719ff81c8c /cow/fs01/file.txt
So, let’s smash some ditto blocks, and see some ZFS action…
# zdb -dddddd cow/fs01 4 ... snipped ... Indirect blocks: 0 L1 1:11000:400 0:11000:400 4000L/400P F=2 B=8 0 L0 1:20000:20000 20000L/20000P F=1 B=8 20000 L0 1:40000:20000 20000L/20000P F=1 B=8 segment [0000000000000000, 0000000000040000) size 256K
You can see above (bold) the two dva’s for the block pointer L1(level 1): 0:11000 and 1:11000.
# zpool status cow pool: cow state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 /var/fakedevices/disk0 ONLINE 0 0 0 /var/fakedevices/disk1 ONLINE 0 0 0 errors: No known data errors
Ok, now we need to export the filesystem, and simulate a silent data corruption on the first vdev that clears the first block pointer (1K). We will write 1K of zeros where should be a bp copy.
# zfs export cow # perl -e "\$x = ((0x400000 + 0x11000) / 512); printf \"\$x\\n\";" 8328 # dd if=/var/fakedevices/disk0 of=/tmp/fs01-part1 bs=512 count=8328 # dd if=/var/fakedevices/disk0 of=/tmp/firstbp bs=512 iseek=8328 count=2 # dd if=/var/fakedevices/disk0 of=/tmp/fs01-part2 bs=512 skip=8330 # dd if=/dev/zero of=/tmp/payload bs=1024 count=1 # cp -pRf /tmp/fs01-part1 /var/fakedevices/disk0 # cat /tmp/payload >> /var/fakedevices/disk0 # cat /tmp/fs01-part2 >> /var/fakedevices/disk0
That’s it, our two disks are there, whith the first one corrupted.
# zpool import -d /var/fakedevices/ cow # zpool status cow pool: cow state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 /var/fakedevices/disk0 ONLINE 0 0 0 /var/fakedevices/disk1 ONLINE 0 0 0 errors: No known data errors
Well, as always the import procedure was fine, and the pool seems to be perfect. Let’s see the md5 of our file:
# md5sum /cow/fs01/file.txt 6bf2b8ab46eaa33b50b7b3719ff81c8c /cow/fs01/file.txt
Good too! ZFS does not know about the corrupted pointer? Let’s try it again…
# zpool status pool: cow state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 /var/fakedevices/disk0 ONLINE 0 0 1 /var/fakedevices/disk1 ONLINE 0 0 0
As always, we need to try to access the data to ZFS identify the error. So, following the ZFS concept, and the message above, the first dva must be good as before at this point. So, let’s look..
# zpool export cow # dd if=/var/fakedevices/disk0 of=/tmp/firstbp bs=512 iseek=8328 count=2 # dd if=/var/fakedevices/disk1 of=/tmp/secondbp bs=512 iseek=8328 count=2 # md5sum /tmp/*bp 70f12e12d451ba5d32b563d4a05915e1 /tmp/firstbp 70f12e12d451ba5d32b563d4a05915e1 /tmp/secondbp
;-)
But something is confused about the ZFS behaviour, because i could see a better information about what was fixed if i first execute a scrub. Look (executing a scrub just after import the pool):
# zpool scrub cow # zpool status cow pool: cow state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Sun Dec 28 17:52:59 2008 config: NAME STATE READ WRITE CKSUM cow ONLINE 0 0 0 /var/fakedevices/disk0 ONLINE 0 0 1 1K repaired /var/fakedevices/disk1 ONLINE 0 0 0 errors: No known data errors
Excellent! 1K was the bad data we have injected on the pool (and the whole diff between one and another output)! But the automatic repair of the ZFS did not show us that, and does not matter if we execute a scrub after it. I think because the information was not gathered in the automatic resilver, what the scrub process does. if there is just one code to fix a bad block, both scenarios should give the same information right? So, being very simplistic (and doing a lot of specutlation, without look the code), seems like two procedures, or a little bug…
Well, i will try to see that on the code, or use dtrace to see the stack trace for one and other process when i got some time. But i’m sure the readers know the answer, and i will not need to…
;-)
peace