ZFS Internals (part #7)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Ok, we did talk about dva’s on the first and second posts, and actually that is the main concept behind the block pointers and indirect blocks in ZFS. So, the dva (data virtual address), is a combination of an physical vdev and an offset. By default, ZFS stores one DVA for user data, two DVAs for filesystem metadata, and three DVAs for metadata that’s global across all filesystems in the storage pool, so any block pointer in a ZFS filesystem has at least two copies. If you look at the link above (the great post about Ditto Blocks), you will see that the ZFS engineers were thinking about add the same feature for data blocks, and here you can see that this is already there.

# ls -li /cow/fs01/
4 file.txt

# md5sum /cow/fs01/file.txt 
6bf2b8ab46eaa33b50b7b3719ff81c8c  /cow/fs01/file.txt

So, let’s smash some ditto blocks, and see some ZFS action…

# zdb -dddddd cow/fs01 4
... snipped ...
Indirect blocks:
               0 L1  1:11000:400 0:11000:400 4000L/400P F=2 B=8
               0  L0 1:20000:20000 20000L/20000P F=1 B=8
           20000  L0 1:40000:20000 20000L/20000P F=1 B=8

                segment [0000000000000000, 0000000000040000) size  256K

You can see above (bold) the two dva’s for the block pointer L1(level 1): 0:11000 and 1:11000.

# zpool status cow
  pool: cow
 state: ONLINE
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	cow                       ONLINE       0     0     0
	  /var/fakedevices/disk0  ONLINE       0     0     0
	  /var/fakedevices/disk1  ONLINE       0     0     0

errors: No known data errors

Ok, now we need to export the filesystem, and simulate a silent data corruption on the first vdev that clears the first block pointer (1K). We will write 1K of zeros where should be a bp copy.

# zfs export cow
# perl -e "\$x = ((0x400000 + 0x11000) / 512); printf \"\$x\\n\";"
8328
# dd if=/var/fakedevices/disk0 of=/tmp/fs01-part1 bs=512 count=8328
# dd if=/var/fakedevices/disk0 of=/tmp/firstbp bs=512 iseek=8328 count=2
# dd if=/var/fakedevices/disk0 of=/tmp/fs01-part2 bs=512 skip=8330
# dd if=/dev/zero of=/tmp/payload bs=1024 count=1
# cp -pRf /tmp/fs01-part1 /var/fakedevices/disk0
# cat /tmp/payload >> /var/fakedevices/disk0
# cat /tmp/fs01-part2 >> /var/fakedevices/disk0

That’s it, our two disks are there, whith the first one corrupted.

# zpool import -d /var/fakedevices/ cow
# zpool status cow
  pool: cow
 state: ONLINE
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	cow                       ONLINE       0     0     0
	  /var/fakedevices/disk0  ONLINE       0     0     0
	  /var/fakedevices/disk1  ONLINE       0     0     0

errors: No known data errors

Well, as always the import procedure was fine, and the pool seems to be perfect. Let’s see the md5 of our file:

# md5sum /cow/fs01/file.txt 
6bf2b8ab46eaa33b50b7b3719ff81c8c  /cow/fs01/file.txt

Good too! ZFS does not know about the corrupted pointer? Let’s try it again…

# zpool status
  pool: cow
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

	NAME                      STATE     READ WRITE CKSUM
	cow                       ONLINE       0     0     0
	  /var/fakedevices/disk0  ONLINE       0     0     1
	  /var/fakedevices/disk1  ONLINE       0     0     0

As always, we need to try to access the data to ZFS identify the error. So, following the ZFS concept, and the message above, the first dva must be good as before at this point. So, let’s look..

# zpool export cow
# dd if=/var/fakedevices/disk0 of=/tmp/firstbp bs=512 iseek=8328 count=2
# dd if=/var/fakedevices/disk1 of=/tmp/secondbp bs=512 iseek=8328 count=2
# md5sum /tmp/*bp
70f12e12d451ba5d32b563d4a05915e1  /tmp/firstbp
70f12e12d451ba5d32b563d4a05915e1  /tmp/secondbp

;-)
But something is confused about the ZFS behaviour, because i could see a better information about what was fixed if i first execute a scrub. Look (executing a scrub just after import the pool):

# zpool scrub cow
# zpool status cow
  pool: cow
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Sun Dec 28 17:52:59 2008
config:

	NAME                      STATE     READ WRITE CKSUM
	cow                       ONLINE       0     0     0
	  /var/fakedevices/disk0  ONLINE       0     0     1  1K repaired
	  /var/fakedevices/disk1  ONLINE       0     0     0

errors: No known data errors

Excellent! 1K was the bad data we have injected on the pool (and the whole diff between one and another output)! But the automatic repair of the ZFS did not show us that, and does not matter if we execute a scrub after it. I think because the information was not gathered in the automatic resilver, what the scrub process does. if there is just one code to fix a bad block, both scenarios should give the same information right? So, being very simplistic (and doing a lot of specutlation, without look the code), seems like two procedures, or a little bug…

Well, i will try to see that on the code, or use dtrace to see the stack trace for one and other process when i got some time. But i’m sure the readers know the answer, and i will not need to…
;-)
peace

Leal's blog

ZFS Internals (part #7)

About Marcelo Leal

Leave a Reply Cancel reply

ZFS Internals (part #7)

Related Posts

ASCiiVMSSDashboard for Azure Virtual Machine Scale-Set

AroundCorners World Application

Packt $5 eBook Bonanza!

About Marcelo Leal

Leave a Reply Cancel reply