ZFS Internals (part #6)
From the MANUAL page: The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.
In this post let’s see some important feature of the ZFS filesystem, that i think is not well understood and i will try to show some aspects, and my understanding about this technology. I’m talking about the ZFS Intent Log.
Actually, the first distinction that the devel team wants us to know, is about the two pieces:
– ZIL: the code..
– ZFS Intent log (this can be in a separated device, thus slog)
The second point i think is important to note, is the purpose of such ZFS’ feature. All filesystems that i know that use some logging technology, use it to maintain filesystem consistency (making the fsck job faster). ZFS does not…
ZFS is always consistent on disk (transactional all-or-nothing model), and so there is no need for fsck, and if you find some inconsistency (i did find one), it’s a bug. ;-) The copy-on-write semantics was meant to guarantee the always consistent state of ZFS. I did talk about it on my earlier posts about ZFS Internals, even the name of the zpools used in the examples was cow. ;-)
So, why ZFS has a log feature (or a slog device)? Performance is the answer!
A primary goal of the development of ZFS was consistency, and honor all the filesystems operations (what seems to be trivial), but some filesystems (even disks) do not.
So, For ZFS a sync request is a sync done.
For ZFS, a perfect world would be:
+------------------+ +--------------------------------+ | write requests | ----------> | ZFS writes to the RAM's buffer | +------------------+ +--------------------------------+
So, when the RAM’s buffer is full:
+---------------------------+ +------------------------+ | ZFS RAM's buffer is full | ----------> | ZFS writes to the pool | +---------------------------+ +------------------------+
For understanding, imagine you are moving your residence, from one city to another. You have the TV, bed, computer desk, etc, as the “write requests”… and the truck as RAM’s buffer, and the other house as the disks (ZFS pool). The better approuch is to have sufficient RAM (a big truck), put all stuff into it, and transfer the whole thing at once.
But you know, the world is not perfect…
That writes that can wait the others (all stuff will go to the new house together), are asynchronous writes. But there are some that can not wait, and need to be written as fast as possible. That ones are the synchronous writes, the filesystems nightmare, and is when the ZFS intent log comes in.
If we continue with our example, we would have two (or more trucks), one would go to the new city full, and others with just one request (a TV for example). Imagine the situation: bed, computer, etc, put in one truck, but the TV needs to go now! Ok, so send a special truck just for it… Before the TV truck returns, we have another TV that can not wait too (another truck).
So, the ZFS intent log comes in to address that problem. *All* synchronous requests are written to the ZFS intent log instead to write to the disks (main pool).
ps.: Actually not *all* sync requests are written to ZFS intent log. I will talk about that in another post…
The sequence is something like:
+----------------+ +-------------------------------+ | write requests | ----> | ZFS writes to the RAM's buffer |-------+ +----------------+ +-------------------------------+ | | ZFS acks the clients +-----------------------------------+ | <--------------------- |syncs are written to ZFS intent log |<-----+ +-----------------------------------+
So, when the RAM's buffer is full:
+---------------------------+ +------------------------+ | ZFS RAM's buffer is full | -------------->| ZFS writes to the pool | +---------------------------+ +------------------------+
See, we can make the world a better place. ;-)
If you pay attention, you will see that the flow is just one:
requests ------> RAM -------->disks
or
requests ------> RAM--------->ZFS intent log | +---->disks
The conclusion is: The intent log is never read (while the system is runnning), just written. I think that is something not very clear: the intent log needs to be read after a crash, just that. Because if the system crash and there are some data that was written to the ZFS intent log, and not written to the main pool, the system needs to read the ZFS intent log and replay any transactions that are not on the main pool.
From the source code:
/* 40 * The zfs intent log (ZIL) saves transaction records of system calls 41 * that change the file system in memory with enough information 42 * to be able to replay them. These are stored in memory until 43 * either the DMU transaction group (txg) commits them to the stable pool 44 * and they can be discarded, or they are flushed to the stable log 45 * (also in the pool) due to a fsync, O_DSYNC or other synchronous 46 * requirement. In the event of a panic or power fail then those log 47 * records (transactions) are replayed. 48 * ...
ps.: More one place where ZFS intent log is called ZIL. ;-)
Some days ago looking at the SUN's fishworks solution, i did see that there is no redundancy on the fishworks slog devices. Just one, or stripe... so i did realize that the "window" of problem is around five seconds! i mean, the system did write the data to the ZFS intent log device (slog), and just after that the slog device did fail, and before the system write the data on the main pool (what is around five seconds), the system crash.
Ok, but before you start to use the SUN's approuch, i think there is at least one bug that need to be addressed:
6733267 "Allow a pool to be imported with a missing slog"
But if you do not export the pool, i think you are in good shape...
In another post of ZFS internals series, i will try to go deeper into ZFS intent log, and see some aspects that i did not touch here. Obviously, with some action... ;-)
But now i need to sleep!
peace.