Elevator Algorithm

AP Photo/Oliver Multhaup

Hello there, this will be a long post… so, i think if you have no time to waste, think again before you go ahead.
And, no. I’m not Morpheus, so i cannot offer you the truth. Actually, i can just offer you my doubts. And no pills, i do not like them.
My blog is like a notepad, and i’m not a kernel developer, and so, what i will describe in this post is just my findings and in any way is an authoritative or official resource about the OpenSolaris Kernel, or scsi interface. It ‘s here in the hope it can be:
1) Useful
2) Fixed
In any order…
DO NOT TRY IT IN PRODUCTION. USE AT YOUR OWN RISK!
Well, for my two or three loyal readers… here we go!
After my last post about performance (II), i’m trying (really hard) to understand the “background” about the Solaris/OpenSolaris device driver’s interface, mainly the scsi driver (sd), the elevator/reordering algorithm, and their impact on workloads (performance and consistency if any).
I do think it’s an important point in the storage, and i guess we do not consider it this way most of the time. It’s like F1, the driver and the car are subjects of the whole interest, but many races are decided on pit stop.
In my quest the first point that got my attention was the commentary:

 ..
  * A one-way scan is natural because of the way UNIX read-ahead
  * blocks are allocated.
...

The phrase above is out of context, and we will talk more about the function this commentary is trying to explain, but i really think we need to review our concepts.
So, if you did read my last post about performance, you know that i did talk about the disk queue and cmds reordering inside the device driver. The block of commentary above is from the function: sd_add_but_to_waitq, the first function that pick my attention. It’s through that function that an I/O cmd is sent to the disk. As always, a common feature of OpenSolaris code, it is very well commented and simple to read. Here is the prototype of the function we will discuss in this post:

   ...
static void
sd_add_buf_to_waitq(struct sd_lun *un, struct buf *bp)
   ...

So, the function receives two parameters, and basically: one is the target and the other is the command. It’s important for us to understand where we are… we are not talking about filesystems, about ZPL, about DMU, or SPA. We are down, really close to the disks, and more: we have the power to live and let die.
Sorry for the simplistic approach, but i will talk about the first argument as the “disk” that we want to write, and the argument two as the “data” we want to write to that disk. The *bp is a pointer to a buffer that has much more informations the device driver needs to do the job.
Before talking about that procedure in particular, let’s agree that in a normal scenario, we would expect a FIFO alghorithm for that kind of work. So, if we need to write data: “B”, “C”, “D”, “E”, and “F” at the locations: “10”, “14”, “18”, 19″ and “21”, that would be the order we would execute the task.
Now, back to that procedure, we need to know that the *bp has an important information about how everything depends on: the location the data will be written. The BLKNO will tell the absolute block where that data will be written, and so, if we have another flag (disksort enabled by default) enabled on that device, everything changes.
Let’s say we have a scsi disk: sd1, and a ZFS filesystem on it. The requests are sent to that disk through the sd_add_buf_to_waitq function, and will be sent directly to the device or go the wait queue. Then, imagine we have a queue (the letters above) like this ( lower BLKNO’s first):

+------------------------+
 | 10 | 14 | 18 | 19 | 21 | 
+------------------------+

So that I/O’s were sent to the driver, and now we want to write the letter “A”, and the BLKNO for that data is “8”. Let’s follow the logic in the sd_add_buf_to_waitq and understand how this new I/O will be delivered:

...
/* If the queue is empty, add the buf as the only entry & return. */
 if (un->un_waitq_headp == NULL) {
...

But the queue is not empty, no luck to us… what next?

/*
 * If sorting is disabled, just add the buf to the tail end of
 * the wait queue and return.
 */
...
	if (un->un_f_disksort_disabled || un->un_f_enable_rmw) {
...

That’s not the case either, but if the answer was “yes” (disksort = disabled), that would be FIFO and we would be included after the “21”. But the sorting is enabled by default for all devices. So, the logic to insert the “A” in the wait queue is as follow:

 -> BLKNO of (A) < BLKNO of first request (B))?
  ->  No, then... search for the first inversion on the list, 
        because it should be ascending. If it finds an inversion, 
        it adds the "A" before or after it, depending of the BLKNO.
        In this case there is no inversion, so the "A" will be placed
        at the end, and will be the first request for the "second"
       list:

+---------------------------------+
 | 10 | 14 | 18 | 19 | 21 |  08  | 
+---------------------------------+

Yes, second list. We have two lists implemented on the waitq:
– The first with cmds that have the first request BLKNO greater than the current;
– The second list is for the requests which came in after their BLKNO number was passed;
Well, it’s not FIFO, but we have the same behaviour in our example, don’t you agree? The last request (A), will be the last one to be served. Fair enough…
To facilitate my math, thinking that this disk can handle each request in ~10ms, we will have our request “done” in ~60ms.
But remember: that is not a FIFO, and that first example was just coincidence. Luck for some guys, providence for others. So, let’s say in the next ms did arive more six requests: “O”, “P”, “Q”, “R”, “S”, and “T”.
BLKNO: “4”, “7”, “1000”, “6”, “2000”, “3000” respectively (random as life).
Let’s see how our waitq would look (following the algorithm):

+---------------------------------------------------------------------+
 | 10 | 14 | 18 | 19 | 21 | 1000 | 2000 | 3000 | 04 |  06 | 07 | 08  | 
+---------------------------------------------------------------------+

Hmmm, now our latency did change a little… from 60ms to 120ms! We can have I/O’s taking half a second, and so, somebody waiting for the 4K block for the email header, or a really bad browser experience. But the logic on sd.c code is not dumb, it has a reason, a purpose, and is really simple: provide more bandwitdh.
The idea is this:
– We do a “head seek”, and do the max work we can…
– “seek again” and do the same thing.
The purpose is to valorize a “head seek”. What is the better as a default behaviour is not so clear to me.
But the main concern should be the length of that list. Bigger it gets, bigger the latencies and you will not have control of your workload service times. If you need “controlled” latency times, you need to configure that list to a few elements, or none at all. You really want to check this parameter if need to handle a heavy random IOPS workload.
The last OpenSolaris distro is 2009.06, and that version has the old definition of 35(zfs_vdev_max_pending) elements on that list. And that number can give you all the random latency you don’t want in a critical IOPS environment. The new value of 10 is better, but not perfect for all workloads. So, take a close look on your disks latency times.
I did write a D script to look further into this situation, and what i saw was that the number of elements is totally dependent on the workload. I’m in the middle of these tests, but what i can tell you is that 10 is always better than 35. But just two or three elements was not better in all situations i did test. In some workloads 15 was the better number. What is the disk queue, service time, workload, retries, etc…
The last point about reordering is consistency, and the commentary i did write in the begining of this post. Besides the performance problem we can have with this configuration, i think there are some consistency issues we can face too. My script is printing the following informations:

– Device throttle (default 256, and was not a problem; max cmds allowed in transport…);
– ncmds_in_driver (I think here are the cmds in the waitq and “others” i could not trace, not sure);
– ncmds_in_transport (The commands sent to the device, and hope no other sort logic there);
– Write Cache Enabled;
– The BLKNO of this request in question (buf);

I did my initial tests on a ZFS environment with just one disk on the pool, and really idle to make things easier. Here are two spa_syncs with the output from that test with zfs_vdev_max_pending = 10:

2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426757
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987048
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233378
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771670
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 32530
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 33042
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 154239250
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 154239762
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101426647
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218576
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233386
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68986953
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233322
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771642
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233372
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218578
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171462
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426771
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171464
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233376
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752873
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233372
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171460
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  0  BS: 101426646
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233282
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426663
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752866
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771670
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  1  BS: 101426648
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101039490
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171406
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771666
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  0  BS: 101426764
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987055
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233386
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 101171400
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101040002
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115694
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233374
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426774
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218576
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482898
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171460
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752873
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987062
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482898
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752890
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233318
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171466
2010 Sep  4 20:38:07 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426777

Here the second, 30 seconds later…

2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  1  BS: 101426787
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233394
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 69233450
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233466
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101426784
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426790
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752898
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233450
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171514
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 69233460
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  0  BS: 101426813
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987101
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233466
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233462
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 32532
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 33044
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 154239252
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 154239764
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  0  BS: 101426793
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233384
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233454
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171506
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218586
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218594
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171508
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171516
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218584
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  1  BS: 101426780
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426796
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233462
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987124
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233448
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482906
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171468
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42218584
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  2  BS: 100771690
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987108
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426836
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233418
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 100771674
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42115702
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752905
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752930
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426820
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752905
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171506
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 68987071
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233448
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  1 NCMDST:  0  BS: 101171514
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 69233460
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  3 NCMDST:  2  BS: 42218592
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101482906
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101426833
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  4 NCMDST:  3  BS: 68987121
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 42752930
2010 Sep  4 20:38:37 - DID:  1 BCT: 154296320 WCE:  0 THR: 256 NCMDSD:  2 NCMDST:  1  BS: 101171516

You can see above two spa_sync’s and 30 seconds between one and the other. Everything’s fine… but…
In the first execution, at the “begining” we have this:

2010 Sep  4 20:38:07 -  ... NCMDSD:  1 NCMDST:  0  BS: 32530
2010 Sep  4 20:38:07 -  ... NCMDSD:  2 NCMDST:  1  BS: 33042
2010 Sep  4 20:38:07 -  ... NCMDSD:  3 NCMDST:  2  BS: 154239250
2010 Sep  4 20:38:07 -  ... NCMDSD:  4 NCMDST:  3  BS: 154239762

33042 – 32530 = 256KB = ZFS Label = L0 and L1
154239762 – 154239250 = 256KB = ZFS Label = L2 and L3

First, why that blocks in the begining of the report, when they should be the last update?
And the two-phase commit says that ZFS updates L0 and L2..
… after that, will be updated labels L1 and L3. L0 and L1 are at the begining of the vdev (disk), and L2 and L3 are at the end. So, with the two-phase commit, updating one label on the head and one label at the tail of the disk, ZFS tries to make harder to loose access to one pool if we have a logical failure, that is more likely to be contiguous. Interesting to see that they are the only commands that have a pattern 1, 2, 3 ,4 on the ncmds on driver and on transport (0, 1, 2, 3). It will be interesting to see the latencies for this updates.
I did change the zfs_vdev_max_pending to 1 and the problem almost did go away completely. But i did see a few label updates in the middle of the reports sometimes. To be continued…

peace

Robert - September 6th, 2010 at 5:26 am

Very interesting. When reading your post and interpreting the results (probably wrong) it reads like “ZFS tries it’s best to maintain data integrity by doing L0 + L2/ L1 + L3 updates and the sd driver resorders transactions for the sake of throughput”. Is this correct ?

Does Opensolaris have no concept of barriers (reorder domains) as Linux has ?

Marcelo Leal - September 6th, 2010 at 3:44 pm

Hello Robert!
Yeah, that’s what it seems… as i said, i did try to configure zfs_vdev_max_pending to 1 to see if that order would change, and it did. The only problem was that the label updates were not 100% in order too. Almost every update were in the right order (last 4 requests were the label updates), but i did see “some” in the middle of the report too.
Another thing… in any time i did see L0 and L2 and L2 and L3 (in this exact order). Even with zfs_vdev_max_pending configured to “1”, the order was L0, L1, L2 and L3.
I need to go further on this, but with tuning setting to “1”, we need to know if the ZFS is updating the label on the wrong order (bug?), or if there is another reorder algorithm elsewhere…

Sean - September 7th, 2010 at 10:53 am

I suppose any tuning will have impact on other areas of the system.
When you tune down the i/o queue, it may help on latency, but I would speculate it will negatively impact throughput.
Further more, I remember there was an issue with swap/paging performance when you tune down sd queue depth. So if your swap is on zvol, and you tune down zfs_vdev_max_pending, and the system starts paging, I am pretty sure the performance will be bad.

Marcelo Leal - September 7th, 2010 at 11:05 am

Hello Sean,
You are right, this tuning will have impact on throughput performance, and that is the subject of this post. The system is already configured (default) for throughput, and if your workload is that, this tuning is not for you.
And about swapping, i think there is no configuration for it to work well. ;-) The system should not work with “running” programs mapped to disk.
Last, this post is about performance and consistency, i did talk about reorder issues regarding the ZFS label updates. So, in my opinion there are two dimensions about this “old” code and assumptions.

Sean - September 7th, 2010 at 11:40 am

When I mentioned paging/swaping, I meant “in the case when paging/swaping happens” – nobody wants this to happen but it does happen.
In this case, it does require a higher throughput, and a short queue depth will make the performance VERY bad. See Technical Instruction Note: 1011936.1 for an example.

Marcelo Leal - September 7th, 2010 at 12:15 pm

I see your point Sean, and as a general rule i understand that it should be take in consideration. But i did never see a server providing services swaping. So, i really do not consider this scenario. When we have a server in a swaping situation, things are not working anymore. So would be talking about “some coins” in a caos situation, but i do agree that as a general rule this should be taken into account.
Swap for me is just a concept. ;-)

Sean - September 7th, 2010 at 1:56 pm

In other words, all your servers have been oversized. I am pretty sure the sales folks were pretty happy about it :-)
BTW, paging and swapping are two different concepts, I am pretty sure the solaris internals book or solaris performance tuning books cover the topics.

Marcelo Leal - September 7th, 2010 at 2:13 pm

Hello Sean,
Interesting discussion, but i don’t know if our expertise on configure a server is the right direction for it.
Maybe my servers are all oversized, because you can know two things:
– The servers i do manage, do not do swap, period.
– I do not implement just-in-time on server management.
I allways work with a trade-off, and you will always pay somewhere:
a) you pay to not have a problem; or
b) pay when you have a problem (a degradation of the services is a problem, and depending on your clients/company, can be a huge one).
So, i respect the fact that you can manage your servers without any margin, and i’m sure you are thinking about the requirements of your job. Even if i do not. ;-)
If i did understand well your point, it’s better to have a system working 90% of the time “bad” (without the tuning), and work well (or better) when swapping (10% of the time). It’s that right? Or your servers do swap more frequently? Because that would be more gain for your configuration…
My experience did prove me that a critical system “must” have the running processes on memory, even if my desktop swaps from time to time.

PS: About paging/swaping, read your first comment. Did you read it before or after the Solaris Books? ;-)

Sean - September 7th, 2010 at 3:30 pm

I read the Solaris performance tuning book (Adrain Cockscroft) over 10 years ago, way before ZFS.

A few clarifications – swapping is bad, moderate paging is not necessarily bad.
Paying a few extra hundres or thousands bucks for a few GB RAM is OK, paying a few hundred thousands for a few hundred GB RAM is not.

Yes, there is always a trade off.
Anyway, I don’t mean anything hostile, just would like you remind you the side effects of tuning down the queue depth.

Marcelo Leal - September 7th, 2010 at 3:58 pm

C’mon, i really like a good discussion, that’s the cool thing about blogging.
You were not hostile, and thank you for your comments! I’m sure others think the same as you, and this is the real purpose of the comments section, to generate discussion about the subject.
The only problem is that i always miss the beers and laughs in this web conversations… ;-)

Leal's blog

Elevator Algorithm

About Marcelo Leal

10 Comments Already

Leave a Reply Cancel reply

Elevator Algorithm

Related Posts

ASCiiVMSSDashboard for Azure Virtual Machine Scale-Set

AroundCorners World Application

Packt $5 eBook Bonanza!

About Marcelo Leal

10 Comments Already

Leave a Reply Cancel reply