ZFS Internals (part #1)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INCACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!
Do we have a deal? ;-)

A few days ago i was trying to figure it out how ZFS copy-on-write semantics really works, understand the ZFS on disk layout, and my friends were the zfsondiskformat specification, the source code, and zdb. I did search on the web looking for something like what i´m writing here, and could not find anything. That´s why i´m writing this article, thinking it can be useful for somebody else.

Let´s start with a two disks pool (disk0 and disk1):

# mkfile 100m /var/fakedevices/disk0
# mkfile 100m /var/fakedevices/disk1
# zpool create cow /var/fakedevices/disk0 /var/fakedevices/disk1
# zfs create cow/fs01

The recordsize is the default (128K):

# zfs get recordsize cow/fs01
NAME      PROPERTY    VALUE     SOURCE
cow/fs01  recordsize  128K      default

Ok, we can use the THIRDPARTYLICENSEREADME.html file from “/opt/staroffice8/” to have a good file to make the tests (size: 211045). First, we need the object ID (aka inode):

# ls -i /cow/fs01/
         4 THIRDPARTYLICENSEREADME.html

Now the nice part…

# zdb -dddddd cow/fs01 4
... snipped ...
 Indirect blocks:
               0 L1  0:9800:400 1:9800:400 4000L/400P F=2 B=190
               0  L0 1:40000:20000 20000L/20000P F=1 B=190
           20000  L0 0:40000:20000 20000L/20000P F=1 B=190

                segment [0000000000000000, 0000000001000000) size   16M

Now we need the concepts in the zfsondiskformat doc. Let´s look the first block line:

0 L1  0:9800:400 1:9800:400 4000L/400P F=2 B=190

The L1 means two levels of indirection (number of block pointers which need to be traversed to arrive at this data). The “0:9800:400” is: the device where this block is (0 = /var/fakedevices/disk0), the offset from the begining of the disk (9800), and the size of the block (0x400 = 1K), respectivelly. So, ZFS is using two disk blocks to hold pointers to file data…

ps.: 0:9800 is the Data virtual Address 1 (dva1)

At the end of the line there are two other important informations: F=2, and B=190. The first is the fill count, and describes the number of non-zero block pointers under this block pointer. Remember our file is greater than 128K (the default recordsize), so ZFS needs two blocks (FSB), to hold our file. And the second is the birth time, what is the same as the txg number(190), that creates that block.

Now, let´s get our data! Looking at the second block line, we have:

0  L0 1:40000:20000 20000L/20000P F=1 B=190

Based on zfsondiskformat doc, we know that L0 is the block level that holds data (we can have up to six levels). And in this level, the fill count has a little different interpretation. Here the F= means if the block has data or not (0 or 1), what is different from the levels 1 and above, where it means “how many” non-zero block pointers under this block pointer. So, we can see our data using the -R option from zdb:

# zdb -R cow:1:40000:20000 | head -10
Found vdev: /var/fakedevices/disk1

cow:1:40000:20000
          0 1 2 3 4 5 6 7   8 9 a b c d e f  0123456789abcdef
000000:  505954434f44213c  50206c6d74682045 !DOCTYPE html P
000010:  2d222043494c4255  442f2f4333572f2f  UBLIC "-//W3C//D
000020:  204c4d5448204454  6172542031302e34  TD HTML 4.01 Tra
000030:  616e6f697469736e  2220224e452f2f6c  nsitional//EN" "
000040:  772f2f3a70747468  726f2e33772e7777  https://www.w3.or
000050:  6d74682f52542f67  65736f6f6c2f346c  g/TR/html4/loose

That´s nice! 16 bytes per line, that is our file. Let´s read it for real:

# zdb -R cow:1:40000:20000:r
... snipped ...
The intent of this document is to state the conditions under which
VIGRA may be copied, such that the author maintains some
semblance of artistic control over the development of the library,
while giving the users of the library the right to use and
distribute VIGRA in a more-or-less customary fashion, plus the
right to

ps.: Don´t forget that is the first 128K of our file…

We can assemble the whole file like this:

# zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump
# zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump
# cat /tmp/file2.dump >> /tmp/file1.dump
# diff /tmp/file1.dump /cow/fs01/THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file1.dump
5032d5031
<

Ok, that´s warning is something we can understand. But let´s change something on that file, to see the copy-on-write in action... we will use VI to change the "END OF TERMS AND CONDITIONS" line (four lines before the EOF), to "FIM OF TERMS AND CONDITIONS".

#  vi THIRDPARTYLICENSEREADME.html
# zdb -dddddd cow/fs01 4
... snipped ...
Indirect blocks:
               0 L1  0:1205800:400 1:b400:400 4000L/400P F=2 B=1211
               0  L0 0:60000:20000 20000L/20000P F=1 B=1211
           20000  L0 0:1220000:20000 20000L/20000P F=1 B=1211

                segment [0000000000000000, 0000000001000000) size   16M

All blocks were reallocated! The first L1, and the two L0 (data blocks). That´s something a little strange... I was hoping to see all the block pointers reallocated (metadata), and the data block that holds the bytes i have changed. The first data block that holds the first 128K of our file, now is on the first device (0), and second block is still on the first device (0), but in another location. We can be sure by looking the new offsets, and the new txg creation time (B=1211). Let´s see our data again, getting it from the new locations:

zdb -R cow:0:60000:20000:r 2> /tmp/file3.dump
zdb -R cow:0:1220000:20000:r 2> /tmp/file4.dump
cat /tmp/file4.dump >> /tmp/file3.dump
diff /tmp/file3.dump THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file3.dump
5032d5031
<

Ok, and the old blocks, they are still there?

zdb -R cow:1:40000:20000:r 2> /tmp/file1.dump
zdb -R cow:0:40000:20000:r 2> /tmp/file2.dump
cat /tmp/file2.dump >> /tmp/file1.dump
diff /tmp/file1.dump THIRDPARTYLICENSEREADME.html
Warning: missing newline at end of file /tmp/file1.dump
5027c5027
< END OF TERMS AND CONDITIONS
---
> FIM OF TERMS AND CONDITIONS
5032d5031
<

Really nice! In our test the ZFS copy-on-write moved the whole file from on region on disk to another. But if we were talking about a really big file, let´s say 1GB? Many 128K data blocks, and just a 1K change. ZFS copy-on-write would reallocate all data blocks too? And why ZFS reallocated the "untouched" block in our example (the first data block L0)?
Something to look in another time. Stay tuned... ;-)
peace.

KG - September 16th, 2008 at 4:07 am

nice writeup, clear and easy to understand your point!

Eric - September 16th, 2008 at 7:59 pm

I’m interested to read your findings with larger files. I agree that the reallocation of unchanged FSBs is unexpected and am curious to understand why.

benr - September 17th, 2008 at 5:40 am

Absolutely excellent work!!!

guy - September 17th, 2008 at 10:29 am

Hello

Your article is very interesting, since very few information is available about zdb command !

I have to remarks about your test :

First, I my opinion, for the COW mecanism to take place,
you have to take a snapshot of the filesystem
zfs snapshot cow@test_snap

Second, If you use vi to modify the file, the entire file will be written to disk, so all the blocs will be reallocated ?
If you really want to see the second bloc reallocated, you have to copy the file with another name, edit the copied file, and then do a
rsync –inplace newfile oldfile

Marcelo Leal - September 17th, 2008 at 11:15 am

Hello guy! good points… but, the VI (like any software, i think), did read the whole file for update it. Imagine a mailbox file, 1gb, i think to append a new message there is no need to read the whole file (append), but what about a change in the middle of the file, because of a user edit (webmail access)?
I think there is no way in a situation like that, to update a file without reading it. ;-)
I think that is important to know to configure a ZFS server for mail storage. The old maildir/mailbox discussion, now with the ZFS cow variable. It´s better to handle thousands users with many small files in ZFS, or to have a few giga bytes files rewritten over and over again?

Eric - September 17th, 2008 at 2:01 pm

ZFS probably tries to preserve good readahead performance by avoiding unnecessary fragmentation. Your 2 FSBs were probably reallocated together because it was only 2 blocks– minimal IO impact, and there is benefit to keeping the file’s blocks together. Try a larger file, like 1MB, change 1 block and I suspect it will only reallocate the changed block. It is probably the case that files smaller than the recordsize get reallocated entirely, and any larger files only have their changed blocks reallocated. This is only speculation, however. :)

guy - September 18th, 2008 at 5:05 am

Eric is right.

vi is not a mail client. It is a very ancient program.

It may be sensible for a mail client to keep the mail file open and modify only the parts of the mail file that need to.
But vi doesn’t care. It writes the entire file when you exit or run the w command.

But, copy-on-write mechanism has nothing to do with that.

It is designed for take fast snapshot of a filesystem (for backup purpose for instance). It already exist in VXFS (and in UFS).
When you take a snapshot of a filesystem, each time you try to write a file bloc, ZFS reads the old block, and saves it in a backing store (hence copy on write).
when you read the snapshot, you access the old blocks if they were modified, and the current blocks if they didn’t change.
The big difference between VXFS snapshots and ZFS snapshot, is that ZFS keeps its backing store inside the pool and you don’t have to allocate a specific size. It’s a bit like the checkpoint feature of VXFS.

But you should post a question in opensolaris.org/os the Sun Microsystems people will explain it better than me.

Marcelo Leal - October 19th, 2008 at 12:23 pm

Hello Guy!
I think you are misunderstanding the ZFS “cow” feature. The copy-on-write is not related with the snapsthot feature. The cow is used to guarantee filesystem consistency without “fsck”, so without chance of leaving the filesystem in an inconsistent state. For that, ZFS *never* overwrite a block, always that ZFS needs to write something, “data” or “metadata”, it writes it to a brand new location. Here in my example, because my test was with a “flat” file (not structured), the VI (and not because it is an ancient program ;), needs to rewrite the whole file to update a little part of it. You are right in the point that has nothing to do with the cow. But i did show that the cow takes place in “data” updates too. And sometimes we forget that flat files are updated like that…
Thanks a lot for your comments!
Leal.

Demi Moore - October 28th, 2009 at 5:52 pm

“You cannot escape the responsibility of tomorrow by evading it today.” – Abraham Lincoln
Demi Moore

Ric Hall - December 12th, 2011 at 7:56 pm

Great site and good writeup. Perhaps the best thing to consider is that instead of referring to ZFS as COW (Copy on Write), Sun called it ROW (Relocate on Write).

Leal's blog

ZFS Internals (part #1)

About Marcelo Leal

10 Comments Already

Leave a Reply Cancel reply

ZFS Internals (part #1)

Related Posts

SNIA SpeedConf and Las Vegas

Parceria…

FISL 13

About Marcelo Leal

10 Comments Already

Leave a Reply Cancel reply