ZFS Internals (part #1)
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
Do we have a deal? ;-)
A few days ago i was trying to figure it out how ZFS copy-on-write semantics really works, understand the ZFS on disk layout, and my friends were the zfsondiskformat specification, the source code, and zdb. I did search on the web looking for something like what i´m writing here, and could not find anything. That´s why i´m writing this article, thinking it can be useful for somebody else.
Let´s start with a two disks pool (disk0 and disk1):
# mkfile 100m /var/fakedevices/disk0 # mkfile 100m /var/fakedevices/disk1 # zpool create cow /var/fakedevices/disk0 /var/fakedevices/disk1 # zfs create cow/fs01
The recordsize is the default (128K):
# zfs get recordsize cow/fs01 NAME PROPERTY VALUE SOURCE cow/fs01 recordsize 128K default
Ok, we can use the THIRDPARTYLICENSEREADME.html file from “/opt/staroffice8/” to have a good file to make the tests (size: 211045). First, we need the object ID (aka inode):
# ls -i /cow/fs01/ 4 THIRDPARTYLICENSEREADME.html
Now the nice part…
# zdb -dddddd cow/fs01 4 ... snipped ... Indirect blocks: 0 L1 0:9800:400 1:9800:400 4000L/400P F=2 B=190 0 L0 1:40000:20000 20000L/20000P F=1 B=190 20000 L0 0:40000:20000 20000L/20000P F=1 B=190 segment [0000000000000000, 0000000001000000) size 16M
Now we need the concepts in the zfsondiskformat doc. Let´s look the first block line:
0 L1 0:9800:400 1:9800:400 4000L/400P F=2 B=190
The L1 means two levels of indirection (number of block pointers which need to be traversed to arrive at this data). The “0:9800:400” is: the device where this block is (0 = /var/fakedevices/disk0), the offset from the begining of the disk (9800), and the size of the block (0x400 = 1K), respectivelly. So, ZFS is using two disk blocks to hold pointers to file data…
ps.: 0:9800 is the Data virtual Address 1 (dva1)
At the end of the line there are two other important informations: F=2, and B=190. The first is the fill count, and describes the number of non-zero block pointers under this block pointer. Remember our file is greater than 128K (the default recordsize), so ZFS needs two blocks (FSB), to hold our file. And the second is the birth time, what is the same as the txg number(190), that creates that block.
Now, let´s get our data! Looking at the second block line, we have:
0 L0 1:40000:20000 20000L/20000P F=1 B=190
Based on zfsondiskformat doc, we know that L0 is the block level that holds data (we can have up to six levels). And in this level, the fill count has a little different interpretation. Here the F= means if the block has data or not (0 or 1), what is different from the levels 1 and above, where it means “how many” non-zero block pointers under this block pointer. So, we can see our data using the -R option from zdb:
# zdb -R cow:1:40000:20000 | head -10 Found vdev: /var/fakedevices/disk1 cow:1:40000:20000 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 000000: 505954434f44213c 50206c6d74682045 !DOCTYPE html P 000010: 2d222043494c4255 442f2f4333572f2f UBLIC "-//W3C//D 000020: 204c4d5448204454 6172542031302e34 TD HTML 4.01 Tra 000030: 616e6f697469736e 2220224e452f2f6c nsitional//EN" " 000040: 772f2f3a70747468 726f2e33772e7777 https://www.w3.or 000050: 6d74682f52542f67 65736f6f6c2f346c g/TR/html4/loose
That´s nice! 16 bytes per line, that is our file. Let´s read it for real:
# zdb -R cow:1:40000:20000:r ... snipped ... The intent of this document is to state the conditions under which VIGRA may be copied, such that the author maintains some semblance of artistic control over the development of the library, while giving the users of the library the right to use and distribute VIGRA in a more-or-less customary fashion, plus the right to
ps.: Don´t forget that is the first 128K of our file…
We can assemble the whole file like this:
# zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump # zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump # cat /tmp/file2.dump >> /tmp/file1.dump # diff /tmp/file1.dump /cow/fs01/THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file1.dump 5032d5031 <
Ok, that´s warning is something we can understand. But let´s change something on that file, to see the copy-on-write in action... we will use VI to change the "END OF TERMS AND CONDITIONS" line (four lines before the EOF), to "FIM OF TERMS AND CONDITIONS".
# vi THIRDPARTYLICENSEREADME.html # zdb -dddddd cow/fs01 4 ... snipped ... Indirect blocks: 0 L1 0:1205800:400 1:b400:400 4000L/400P F=2 B=1211 0 L0 0:60000:20000 20000L/20000P F=1 B=1211 20000 L0 0:1220000:20000 20000L/20000P F=1 B=1211 segment [0000000000000000, 0000000001000000) size 16M
All blocks were reallocated! The first L1, and the two L0 (data blocks). That´s something a little strange... I was hoping to see all the block pointers reallocated (metadata), and the data block that holds the bytes i have changed. The first data block that holds the first 128K of our file, now is on the first device (0), and second block is still on the first device (0), but in another location. We can be sure by looking the new offsets, and the new txg creation time (B=1211). Let´s see our data again, getting it from the new locations:
zdb -R cow:0:60000:20000:r 2> /tmp/file3.dump zdb -R cow:0:1220000:20000:r 2> /tmp/file4.dump cat /tmp/file4.dump >> /tmp/file3.dump diff /tmp/file3.dump THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file3.dump 5032d5031 <
Ok, and the old blocks, they are still there?
zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump cat /tmp/file2.dump >> /tmp/file1.dump diff /tmp/file1.dump THIRDPARTYLICENSEREADME.html Warning: missing newline at end of file /tmp/file1.dump 5027c5027 < END OF TERMS AND CONDITIONS --- > FIM OF TERMS AND CONDITIONS 5032d5031 <
Really nice! In our test the ZFS copy-on-write moved the whole file from on region on disk to another. But if we were talking about a really big file, let´s say 1GB? Many 128K data blocks, and just a 1K change. ZFS copy-on-write would reallocate all data blocks too? And why ZFS reallocated the "untouched" block in our example (the first data block L0)?
Something to look in another time. Stay tuned... ;-)
peace.
nice writeup, clear and easy to understand your point!
I’m interested to read your findings with larger files. I agree that the reallocation of unchanged FSBs is unexpected and am curious to understand why.
Absolutely excellent work!!!
Hello
Your article is very interesting, since very few information is available about zdb command !
I have to remarks about your test :
First, I my opinion, for the COW mecanism to take place,
you have to take a snapshot of the filesystem
zfs snapshot cow@test_snap
Second, If you use vi to modify the file, the entire file will be written to disk, so all the blocs will be reallocated ?
If you really want to see the second bloc reallocated, you have to copy the file with another name, edit the copied file, and then do a
rsync –inplace newfile oldfile
Hello guy! good points… but, the VI (like any software, i think), did read the whole file for update it. Imagine a mailbox file, 1gb, i think to append a new message there is no need to read the whole file (append), but what about a change in the middle of the file, because of a user edit (webmail access)?
I think there is no way in a situation like that, to update a file without reading it. ;-)
I think that is important to know to configure a ZFS server for mail storage. The old maildir/mailbox discussion, now with the ZFS cow variable. It´s better to handle thousands users with many small files in ZFS, or to have a few giga bytes files rewritten over and over again?
ZFS probably tries to preserve good readahead performance by avoiding unnecessary fragmentation. Your 2 FSBs were probably reallocated together because it was only 2 blocks– minimal IO impact, and there is benefit to keeping the file’s blocks together. Try a larger file, like 1MB, change 1 block and I suspect it will only reallocate the changed block. It is probably the case that files smaller than the recordsize get reallocated entirely, and any larger files only have their changed blocks reallocated. This is only speculation, however. :)
Eric is right.
vi is not a mail client. It is a very ancient program.
It may be sensible for a mail client to keep the mail file open and modify only the parts of the mail file that need to.
But vi doesn’t care. It writes the entire file when you exit or run the w command.
But, copy-on-write mechanism has nothing to do with that.
It is designed for take fast snapshot of a filesystem (for backup purpose for instance). It already exist in VXFS (and in UFS).
When you take a snapshot of a filesystem, each time you try to write a file bloc, ZFS reads the old block, and saves it in a backing store (hence copy on write).
when you read the snapshot, you access the old blocks if they were modified, and the current blocks if they didn’t change.
The big difference between VXFS snapshots and ZFS snapshot, is that ZFS keeps its backing store inside the pool and you don’t have to allocate a specific size. It’s a bit like the checkpoint feature of VXFS.
But you should post a question in opensolaris.org/os the Sun Microsystems people will explain it better than me.
Hello Guy!
I think you are misunderstanding the ZFS “cow” feature. The copy-on-write is not related with the snapsthot feature. The cow is used to guarantee filesystem consistency without “fsck”, so without chance of leaving the filesystem in an inconsistent state. For that, ZFS *never* overwrite a block, always that ZFS needs to write something, “data” or “metadata”, it writes it to a brand new location. Here in my example, because my test was with a “flat” file (not structured), the VI (and not because it is an ancient program ;), needs to rewrite the whole file to update a little part of it. You are right in the point that has nothing to do with the cow. But i did show that the cow takes place in “data” updates too. And sometimes we forget that flat files are updated like that…
Thanks a lot for your comments!
Leal.
“You cannot escape the responsibility of tomorrow by evading it today.” – Abraham Lincoln
Demi Moore
Great site and good writeup. Perhaps the best thing to consider is that instead of referring to ZFS as COW (Copy on Write), Sun called it ROW (Relocate on Write).