What i do not like about ZFS…

Posted in Opensolaris, storage, Zfs by Marcelo Leal on 08 April 2011 Tags: mirror, Opensolaris, raidz, solaris, storage, Zfs

Losing cache...

Well, it has been a long time from my last post, and i’m quite busy with my day-to-day job and other projects that are consuming a lot of my time.
I really like to blog, but right now i really do not have the time i need to write here. Nice to read the comments about my ZFS Internals series, and the PDF version i need to update. I do not even remember my last post about ZFS, and still have a lot of comments on that sections. Thank you!
I’m doing a lot of interesting things at work, and with other projects that will be nice to write about soon… but what i really think is my obligation to write about, is about my opinion on ZFS not so good points.
If you are used to read my blog, i do not need to tell you how i think this is a great peace of software, and how i think it is the way to go for storage on these days. But as everthing else in life, it has problems, and i will say (sorry), huge ones.
Actually, i would title this post “ZFS errors”, but who am i to say that? So, i prefer to talk about my personal opinion after to use it and hack it a little. Besides, it’s not fair to talk about just the good points about something, because it’s not real. And the problems we have faced with some technology, and maybe for us is something not so critical, can be a huge problem for other people, and we can help them with our experience.
Sorry ZFS hackers, but here we go…

Number 1: RAIDZ: Simple like that: It solves a real problem, that two guys in the whole universe did know about. And creates many others… RAIDZ simply cannot be used in production, if you need a little of performance, you cannot use it. And that makes you run in production with just one configuration: Mirror. If you want to diversify, you can do RAID1. In a comparison with other storage vendors, your cost per giga will be always impacted by this ghost. Raw/2.
It’s not useless, you can use it at home, even for backups in a datacenter solution, but you need to have a robust infrastructure to deal with long resilvers and really poor restore procedures. Everything is a whole stripe solves one problem and creates a bottleneck for performance and a nightmare for the resilver process (ZFS needs to traverse the whole filesystem to resiver a disk). If you care, i can give you an advice: if you want to use it, three discs in a set is the maximum you can obtain for it.

Number 2: L2ARC: Sometime ago i did write about it already, but it’s something worth to talk again in this list. The ZFS ARC is a damn good code, and it really can bring a huge performance over time! But looking at the features of Oracle Storage 7420, i was scared:

Up to 1.15PB (petabyte) raw capacity
Up to 1TB DDR3 DRAM
Up to 4TB flash-enabled read-cache
Up to 1.7TB flash-enabled write-cache

If ZFS did solve the write problem using SSD’s , i think it did create another for reads. If we loose 4TB of read-cache, we will not have the universe as we know it anymore. No, don’t tell me that this will not happen…
Solid State Drives are persistent devices, i guess would be a priority to not loose it on reboots, failovers, or SSD’s failures (eg.: mirroring). It’s huge performance impact, that the system will not be available, and the second more critical problem on storage is availability (the first one is to loose data, and ZFS cares a lot about our data). Every presentation, paper, everything about L2ARC is the same “answer”: The data is on disk, we can access it from there in the case of loosing the SSD data. No, no, we can not get it from there! ZFS loves cheap discs, our discs is 7200 SATA discs… they do know how to store data, but do not know how to read it. These 7200 SATA drives should have a banner saying: Pay to write it and pray for read it.

Number 3: FRAGMENTATION
No tears, no tears… something that can make you do a full ZFS send and recreate the dataset or pool. Incremental snapshot replication can be a terrible experience. Copy on write for the win, defrag to not loose the title!

Number 4: Myths
Here i can add to a whole blog entry, but let’s stay with tuning for now. When you have a huge installation, there is no code that can give you the better numbers without tuning. Prefetch, max_pending, recordsize, L2ARC population task, txg sync time, write_throtle, scrub_limit, resilver_min_time… and that is just some of the numbers i think every “medium” ZFS installation should tune from the start. My book has others… ;-)
And we are not talking about other OS specific numbers outside ZFS that can make the whole difference between: Everything is working fine to: The performance is a crap! Disksort, naggle alghoritm, dnlc, and etc.

Ok, done. Feel free to curse at me on the comments section.
peace

About Marcelo Leal

I do work with systems architecture (operating systems, network and storage plumbing). Strong experience (more than 20 years): prospecting, modeling, planning and implementing new solutions to provide resilient internet services, cost effective operational infrastructure, storage performance and data protection for highly available systems.

View all posts by Marcelo Leal →

21 Comments Already

Subscribe to comments feed

Giovanni - April 9th, 2011 at 11:53 am

Loosing the contents of the L2ARC SSD seems a waste of resources, that should be fixed by Oracle ASAP. The whole point of their ZFS appliance is to win over their competitors by allowing customers to buy cheap and unreliable storage (7200RPM Enterprise-y SATA disks) and make it run better with lots of fast cache (Memory and SSDs). We cannot make memory persistent without a lot of expensive/complex hardware… but the data on the SSDs are there, persistent as we need it. It was a waste to throw it all away after a reboot. It even makes their cluster offering look silly since who would want to failover a cluster node which holds all those precious cached data on its memory/SSD only to boot from a bunch of 7200RPM disks.

If Oracle wants to move the Unified Storage ZFS appliances to the high-end spectrum, they need to do a lot of work. I would grade the 74x0s as mid-range (aka “glorified” low-end).

Perhaps moving it to a high-end offering will be too much trouble. The rumors that Oracle is going to acquire Netapp seem to hold some truth.

—

On RAIDZ performance, it’s long overdue other redundancy offerings. There is no one-size-fits-all solution and RAIDZ + RAID-1 do not cover all the requirements… so why not offer RAID-5/RAID-6, plain ol’ JBOD, etc. We know ZFS code footprint is quite small.. would it be hard?

I agree that sometimes Sun’s engineers were too hard-nosed on expecting people to have only the limited set of needs that they could imagine in their walled garden. The truth is that reality is much more complicated and a storage offering should adapt to fit its customers (only introducing some coolness, even if it’s not a mass success at first).

—

Fragmentation… as we’ve seen it, a real PITA. As we’ve talked over and over, this has been mentioned to Sun engineers multiple times and it felt like people were point a cross at the devil. Reality strikes again and fragmentation is very real… all those cheap disks that ZFS allowed us to buy and use in enterprise scenarios now suffering to delivery even the lowest IOPS. It kills to see a disk resilver on a RAIDZ pool going at 2MB/s. ZFS send/receive becomes almost impossible when you’re trying to build a incremental feed from a pool that’s highly fragmented.

—

If Oracle doesn’t work hard and fast, I cannot see all those Exalogic and Exadata customers still willing to spend their money of the cool 7720 rack just to be kicked in the ass when a failure occurs (or worse, they have a normal maintenance window to failover the cluster node and discover their cluster will become a bunch of cheap 7200RPM disks trying to serve that Exadata monster).

Sometimes I think Oracle should give us, the brave 74×0 beta tester a trophy or award for dedicating so much time uncovering their product’s flaws. It feels we’re giving Oracle/Sun all the decades of storage experience that their competitors (Netapp/EMC/HP/IBM) had to slowly accumulate.

—

To the list of missing features I would add tiering… our last issues with ZFS caching have made realize that tiering isn’t just the old storage vendors trying to extend the life of their products (that’s true also) but it’s a real requirement. Some critical applications must be on fask disk/SSDs all the time and the business has to pay for that if it wants the performance/reliability.

Too bad our SLA requirements aren’t always clear… it would make deciding where to put data much easier and avoid all the hassle when things break.

—

I really hope someone at Oracle reads this. If they think the 7000s are anything but a glorified mid-range storage server.. time to wake up.

That was a long comment, sorry :-)
UX-admin - April 11th, 2011 at 10:26 am

You probably meant “losing cache”, instead of “loosing”, which would be to loosen it up, make it less tight.
UX-admin - April 11th, 2011 at 10:31 am

We have had good luck with RAID-Z striping. You described problems, but from reading about them, I could not tell what you’ve tried.

Did you read https://blogs.sun.com/roch/entry/when_to_and_not_to
?

I spent a week studying that article, before I designed and implemented my first RAID-Z pool.

It would be a lie to say that performance is stellar; but it is decent, and well within expectations. But then again, reliability was always far more important to us than performance (nothing else performs faster in that configuration anyway).
Marcelo Leal - April 11th, 2011 at 12:19 pm

Oh, thanks UX-admin! I’ve changed it…
Joe - April 20th, 2011 at 3:25 pm

Perhaps a better and more recent understanding of ZFS would assist us all.

Remember this:

1. You can build your OWN Hardware OR you can purchase from Sun.

2. Much of the Oracle Software is FREE.

3. Businesses that value Support Contracts you can purchase one inexpensively (a few dollars per day).

4. If you pay Oracle not one penny and use other Hardware then don’t blame them for the performance (or lack thereof) resulting from your choice.

5. ZFS is NEW. Not all the Bugs have been worked out YET it is “working”.

6. You can file a Bug Report for FREE and they will fix it for FREE (if they choose to do so, often expensive decision will go before a Committee).

7. You can pay them or someone else to fix specific things that are important to YOU and that you value not just with words but with ca$h.

8. MOST Important. ZFS works best with LOTS of puny, fast, expensive Hard Drives and SSDs. If YOU WANT lower performance you can use slow, cheap Hard Drives and a small amount of RAM with an old 386 Processor AND it (YOUR “Storage Solution” will still work SLOWLY.

Check out this Document:

Sun ZFS Storage Appliance
https://www.oracle.com/us/products/servers-storage/sun-zfs-storage-family-ds-173238.pdf

Quotes from that Document:

1. “Oracle Solaris ZFS transparently executes writes to low-latency SSD media
so that writes can be quickly acknowledged, allowing the application to
continue processing.

Oracle Solaris ZFS then automatically flushes the data to high-capacity
drives as a background task. Another type of SSD media acts as a cache
to reduce read latency, and Oracle Solaris ZFS also transparently manages
the process of copying frequently accessed data into this cache to
seamlessly satisfy read requests from clients.”

2. The “Sun ZFS Storage 7120” uses “the same feature-rich software as the high-end configurations, and delivers 12 TB to 120 TB of raw capacity using 7200 RPM SAS drives with 96GB Write cache.”

3. (Page 6:) ”
Processor * 1x 4-core 2.4 GHz Intel® Xeon® Processor
Main memory * 24 GB
Configuration options * 12 TB to 120 TB using high capacity SAS-2 7200 rpm disks
* Base system: 12 TB (12x1TB) or 24TB (12×2 TB)
”

ZFS works best with LOTS of Spindles. Configure your System as shown in Comment 3 (taken from Page 6 of that PDF) and measure the performance.

With “Software RAID” you need enough processing power (and RAM) that you would get with a dedicated (MIPS / FPGA) RAID Card (for $1000).

You _could_ buy a “Sandy Bridge” instead of a Xeon and put it on a Motherboard that has PCI-e x 16 Slots to hold two “High Speed SAS Controllers”.

You can buy 1TB HDs for less than $100 – so the total cost of that “Storage Solution” (including paying you to do the wiring) is WAY under $3000. You could pay 10 times as much if you go a different route (with better performance, but not 10x better).

Yes, ZFS has some faults. File a Bug Report and hope for the best. Think what it will be like when ZFS is 10 years old. They might even have a ‘Hardware RAID-Z4 Card’ for it by then.
Shawn - May 1st, 2011 at 11:19 am

You do realize that RAIDZ where Z=1 is NOT RAID1? RAIDZ is basically RAID5 but the Z refers to the number of parity disks (1, 2, or 3)? It’s seems that reading your post (I have not read others) and some of the comments this is being mis-understood.
Rainer - May 1st, 2011 at 8:10 pm

I assume, there is no translation of the book in English available?
I doubt our Portuguese-speaking co-worker would want to do a translation just for us….
Marcelo Leal - May 1st, 2011 at 8:15 pm

Hello Rainer, you are right, at this time there is just a Brazilian portuguese version. Thanks for the interest, hope soon i can do a English version too.

Leal
sean - May 4th, 2011 at 11:27 am

I am trying to understand the problems first.
1. RAID-Z
By saying performance is bad, I assume you are talking about random I/Os which requires lots of IOPS?
It was explained by numerous blog entries that a RAIDZ group has the IOPS of a single disks.
In this case, I suppose the answer is an SSD log device which gives you lots of IOPS?
In the mean time, I am under the impression that read and sequential IO is fine in terms of performance?
About resilver – isn’t ZFS still better than a RAID-5 resync? At least only filesystem data is resilvered, not the whole disks?
2. L2ARC
How is it any different than other filesystems? With a cold reboot, all filesystems need to repopulate the filesystem cache, so why is this a ZFS problem? At least after the cache warming, the read performance is acceptable?
4. Myths
Aren’t this the same for all other filesystem/volume managers?
Donal - August 9th, 2011 at 8:02 am

Hi Marcelo

Thanks for the article ( if fact all of your articles in this blog are phenomenal)

I am by no means well versed in ZFS, but am trying to get there.. I was wondering if one could circumvent the L2ARC issue, in specific setups, by using server-side read caching, or something like a read-optimized Whiptail (whiptailtech.com) device sitting in front of the ZFS appliances.

Or even more of a stretch, virtualizing the ZFS head within a large VM (8-way, 256GB vRAM) protected by fault tolerance.

I understand that this does not solve the issue, except in a specific situation, and most likely for smaller installations, but would either of the above ideas suffice to protect read performance after reboot? In the first instance, reads are cached outside of the head. In the second instance, it is a brute force attempt to keep the head online. Would option 1) even work?

Thanks and keep up the great work

Donal
Marcelo Leal - August 9th, 2011 at 11:41 am

Hello Donal!
Thanks for the nice words. It’s good to hear that this random ideas are useful for somebody… ;-)
I think i could not understand the first proposed setup, but as i understand, you are talking about a chache outside the ZFS NAS. So, on the “clients”. That is good. All cache is good…
That will depend if you will be using a local filesystem (iSCSI), a network filesystem, or a cachefs for example.
The second approuch is like you say a “brute force”, but no doubt is a good option if you will add availability to the solution. But i do not like to rely on a solution that “will not fail”. I don’t believe in infinite MTTDL, but in the best MTTR. ;-)
Constantin gives a good idea to the L2ARC problem that is use it as stripes, and divide the impact in the case of a SSD failure. But, no solution in case of reboot.
Hope that helps.

Leal
Donal - August 10th, 2011 at 12:08 pm

Hi Leal

Thanks for the feedback, and also to Constantin for the striped L2ARC suggestion

The DDRdrive X1 has its own battery to preserve the cache (normally used as a ZIL device) during a power outage. I’m not certain if the DDRdrive can be used as an L2ARC device. From looking at the specs, it certainly should be possible.

My first suggestion, to explain a bit further, is to use a content-based read cache within the client servers (in my case ESXi servers) That functionality will soon be available and works by maintaining a small area of vmkernel memory for read caching. However there are several hardware devices that can do this using local SSDs within the client server, or on the SAN/network between the clients and the storage array head. My worry was that ZFS must maintain read cache consistency at the appliance head for the filesystem to work properly.

How about RDAM between appliances whereby the components of the L2ARC are mirrored over, say infiniband, or 10G Ethernet. If that is possible, you would have guaranteed L2ARC availability on one of the ZFS heads, meaning that during any outage, the L2ARC ‘working set’ is warm on the remaining ZFS appliance head. Does that capability exist?

Thanks again for the education!
Donal
Donal - August 10th, 2011 at 12:08 pm

P.S. I mean RDMA as opposed to RDAM.
Marcelo Leal - August 10th, 2011 at 12:59 pm

Ok, i think i did understand well your suggestion, and i really think it will work. But let me make myself clear, when i’m talking about the problem with the L2ARC, i’m talking about the code and not about the device. Do you understand my point? SSD is a persistent device, the problem is that ZFS can not reuse that data in the case of a reboot. In the case of the L2ARC ssd device, my experience is that is much more common to have a maintenance and so a reboot than a ssd failure (that is not true for a ssd for zil for example).
And when i do say about the proposed cache on the client, the protocol is important because that will impact on the efficiency of the cache. If you will be using iscsi, your cache is the “guy”, because the filesystem is yours. Using NFS, you will need to do some tuning on the timeo values, or maybe use some kind of cachefs. And so on… For the comparison of the efficieny of the cache between network filesystem and local filesystem, take a look here:

https://www.eall.com.br/blog/?p=89

Hope that helps!
Leal
Donal - August 11th, 2011 at 6:12 am

Hi Leal

Thanks very much again for your response. I had been missing the point that what was being discussed here is the code requirement for a persistent L2ARC.

I will be using a mixture of iSCSI using comstar and NFS v.3 My hypervizor (ESXi) does not (yet) support pNFS, but when it does, the cachefs will become very critical to our architecture. In the meantime, I will research the link you posted and look for ways to improve my current NFS design

Thanks again for your advice.
Donal
Daniel - April 25th, 2013 at 5:03 pm

Marcelo,

If make two RAIDZ and STRIP between them would be the same as a RAID 50? Would this be an option for better performance than STRIP + MIRROR?
I have a table of comparative from INTEL RAID CONTROLLER that shows the RAID 50 is faster (read/write) than RAID 10:
https://s15.postimg.org/v34vn9ym3/intel_raid.jpg
This table is reliable or marketing?

Thanks.
Marcelo Leal - April 26th, 2013 at 3:40 pm

Hi Daniel,
RaidZ is a similar RAID solution (as raid5), but not exactly the same. Raidz has difference in performance and guarantees when comparing with “standard” raid5 or raid6.
Here you can have a “master piece” on the explanation of the whole thing: https://blogs.oracle.com/roch/entry/when_to_and_not_to
Basically, if you need performance, go with mirror + load share (a.k.a RAID 10). Raid5 has some flaws when dealing in rewriting blocks, and ZFS RaidZ technology fixes that (in detriment of performance). The write speed in RaidZ or mirror should not be a much difference, but the read will be much impacted (see link above).
Here you can see another “master piece” from one of the ZFS’ fathers, where you can learn more about the raidz implementation:
https://blogs.oracle.com/bonwick/entry/raid_z
You can find more informations on my ZFS Internals series, and IIRC the Part #8 has a deep dive on the RaidZ.
ali - November 13th, 2014 at 5:17 pm

Hi,
i want to say something about Oracle ZFS Appliance.
Firstly,
DO NOT BUY ORACLE ZFS APPLIANCE, IF YOU NEED PERFORMANCE, STABILITY or SLA.

if you buy, (i can guarantee), you are in trouble like me.

we bought 7420.. I am using that storage nearly 2 year. it kills me. i and my partners are very regretful.

we gave 1M$ for 7420, to get buggy software, slow speed, unstable iops.
firstly, we use it with raid-z but it kills all us, with sluggish performance. i need to use it with iscsi, but it is slower than one ordinary sas disk.

there is cp command problem. if you copy one thing in one thread, it is slower than my pc or usb disk.

we called oracle, open sr, make everything there is no solution.
i changed all design to use it tolarable state.

make the pool mirrored.
make all disk conf “latency”.
remove all iscsi need.
use only nfs.
use only dnfs(for oracle databases)

1M$ storage gives me 4GB/hour for database queries(best effort), despite the all glorified hardware.

nowadays, a problem appeared again. storage stop all operations for a limited time window, randomly. there is no log, no error, no any information about that situation on storage. i send a bulk to oracle for investigation, they also cannot find anything.(there is no switch problem or nic problem or cable problem, because, all servers prints logs, i use different switches, also protocols, also nics. they all affected).

if you want it to use for only rman backup/restore, you buy it. anything but that, stay away from it.

this is a recommendation.
peace.
ali - November 13th, 2014 at 5:21 pm

edit: sorry for mistake 4GB/minute it is not hour. and it is single thread
M Space - March 18th, 2016 at 11:09 am

Hi Marcelo,

I came across this blog by chance. I was curious to know whether you have ever used DragonflyBSD and its HAMMER filesystem? A number of the things that have been mentioned here, particularly performance damaging filesystem fragmentation, have been mitigated by some quite clever algorithms and pruning.
Mike Salerno - August 5th, 2016 at 5:43 pm

I don’t know where this “raidz=bad performance” comes from. In my experience, raidz, even using 12 drives in raidz2 (all in one vdev) I’ve pushed the performance of all 10 drives to the max. >1GB/sec writes across all 12, using Xeon X5560 processors and twp 3GB SAS Dell MD1000 drive trays in split-bus mode. And that’s using WD Blue 1TB drives with SAS/SATA interposers.

Other systems I’ve done raidz2 on I get similar performance. Usually within 80-90% the max bandwidth of all the drives added together. If a single drive is 150MB/sec read/write, take 12 drives, subtract 2 for raidz2, and I get 1000 to 1250 MB/sec.

IOPS, same thing. Multiply the IOPS of a single drive by the total number minus 2 (if doing raidz2) and get 80-90% of that.

The ONE TIME I had a problem with ZFS was when it made an Oracle redo log inaccessible. Why? Because it had a checksum error – WIN FOR ZFS! Database stopped, but catching that data error was MUCH more important than the database being up.

I’ve used ZFS since it’s first arrival in Solaris 10 – in some cases, I did tune certain things, but never ever really got any improvement over the already stellar performance. The ONLY thing that I’ve ever really needed to mess with:

/etc/system:
set zfs:zfs_vdev_max_pending = 32
set sd:sd_max_throttle = 4096

But that’s all part of system administration – if you’ve never tuned anything for your workload, you’re not a real sysadmin.