ZFS Internals (part #9)

PLEASE BE AWARE THAT ANY INFORMATION YOU MAY FIND HERE MAY BE INACCURATE, AND COULD INCLUDE TECHNICAL INACCURACIES, TYPOGRAPHICAL ERRORS, AND EVEN SPELLING ERRORS.

 From the MANUAL page:
 The zdb command is used by  support  engineers  to  diagnose
 failures and gather statistics. Since the ZFS file system is
 always consistent on disk and is self-repairing, zdb  should
 only be run under the direction by a support engineer.

DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!

Some builds ago there was a great change in OpenSolaris regarding ZFS. It was not a change in the ZFS itself because the change was the adition of a new scheduling class (but with great impact on ZFS).

OpenSolaris had six scheduling classes until then:
– Timeshare (TS);
This is the classical. Each process (thread) has an amount of time to use the processor resources, and that “amount of time” is based on priorities. This scheduling class works changing the process priority.
– Interactive (IA);
This is something interesting in OpenSolaris, because it is designed to give a better response time to the desktop user. Because the windown that is active has a priority boosts from the OS.
– Fair Share (FSS);
Here there is a division of the processor (fair? ;-) in units, so the administrator can allocate the processor resourcers in a controlled way. I have a screencast series about Solaris/OpenSolaris features so you can see a live demonstration about resource management and this FSS scheduling class. Take a look at the containers series…
– Fixed Priority (FX);
As the name suggests, the OS does not change the priority of the thread, so the time quantum of the thread is always the same.
– Real Time (RT);
This is intended to guarantee a good response time (latency). So, is like a special queue on the bank (if you have special necessities, a pregnant lady, elder, or have many, many dollars). Actually this kind of person do not go to bank.. hmmm bad example, anyway…
– System (SYS);
For the bank owner. ;-)
Hilarious, because here was the problem with ZFS! Actually, the SYS was not prepared for ZFS’s transaction group sync processing.

There were many problems with the behaviour of ZFS IO/Scheduler:

6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
6494473 ZFS needs a way to slow down resilvering
6721168 slog latency impacted by I/O scheduler during spa_sync
6586537 async zio taskqs can block out userland commands
ps.: I would add to this list the scrub process too…

The solution on the new scheduling class (#7) is called:

System Duty Cycle Scheduling Class (SDC);

The first thing that i did think reading the theory statement from the project was not so clear why fix a IO problem changing the scheduler class, actualy messing with the management of the processor resources. Well, that’s why i’m not a kernel engineer… thinking well, seems like a clever solution, and given the complexity of ZFS, the easy way to control it.
As you know, ZFS has IO priorities and deadlines, synchronous IO (like sync/writes and reads) have the same priority. My first idea was to have separated slots for different type of operation. It’s interesting because this problem was subject of a post from Bill Moore about how ZFS was handling a massive write keeping up the reads.
So, there were some discussions about why create another scheduling class and not just use the SYS class. And the answer was that the sys class was not designed to run kernel threads that are large consumers of CPU time. And by definition, SYS class threads run without preemption from anything other than real-time and interrupt threads.
And more:

Each (ZFS) sync operation generates a large batch of I/Os, and each I/O may need to be compressed and/or checksummed before it is written to storage. The taskq threads which perform the compression and checksums will run nonstop as long as they have work to do; a large sync operation on a compression-heavy dataset can keep them busy for seconds on end.

ps.: And we were not talking about dedup yet, seems like a must fix for the evolution of ZFS…
You see how our work is wonderful, by definition NAS servers have no CPU bottleneck, and that is why ZFS has all these wonderful features, and new features like dedup are coming. But the fact that CPU is not a problem, actually was the problem. ;-) It’s like give to me the MP4-25 from Lewis Hamilton. ;-)))
There is another important point with this PSARC integration, because now we can observe the ZFS IO processing because was introduced a new system process with the name: zpool-poolname, which gives us observability using simple commands like ps and prstat. Cool!
I confess i did not had the time to do a good test with this new scheduling class implementation, and how it will perform. This fix was commited on the build 129, and so should be on the production ready release from OpenSolaris OS (2010.03). Would be nice to hear the comments from people that is already using this new OpenSolaris implementation, and how the dedup is performing with this.
peace