A simple tip about FMA
Solaris 10/OpenSolaris have many new features, and some users used to GNU/Linux, or even used to older versions of Solaris, sometimes don’t know some of them. We have been heard a lot about ZFS and Dtrace, for example, but there are much more!
In my previous post: A contract between OpenSolaris and GNU/Linux users, i did talk about another great feature, that is SMF.
Today we will see a little tip about FMA (Fault Management Architecture). If you think what i will post here is what FMA is all about, or if you think this facility is not “so nice like ZFS or Dtrace”, think again!
FMA is a complex (for the developers ;), enterprise fault management architecture, open source and available for use and learn in the OpenSolaris OS. Take a look in this simple example (We can monitor hardware faults in our server just issuing a simple command like):
# fmdump TIME UUID SUNW-MSG-ID Apr 01 18:57:21.0621 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e ZFS-8000-FD
As you can see in the output above, there is one failure with the message id: ZFS-8000-FD. All errors reported by the FMA facility has a message ID, and you can see what is any message looking for it in this site. If you look specifically for the msg id ZFS-8000-FD, you will see that the description for that message is:
The number of I/O errors associated with a ZFS device exceeded acceptable levels.
and the FMA subsystem has taken some action for us (Automated Response):
The device has been offlined and marked as faulted. An attempt will be made \\ to activate a hot spare if available.
Well, we do not have a hot spare in this test system, so our pool must be in degraded state. Let’s look:
# zpool status pool: test state: DEGRADED scrub: resilver completed after 0h1m with 0 errors on Thu Nov 20 18:03:58 2008 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 c0t2d0 REMOVED 0 0 0 c0t3d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0
So, after change the faulted disk, and see our pool in good shape again, we must inform the FMA subsystem about our action:
# fmadm repair c5541e9b-ddb1-c214-c2af-e3d358ae1a8e TIME UUID SUNW-MSG-ID Apr 01 18:57:21.0621 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e ZFS-8000-FD Apr 01 19:30:20.0122 c5541e9b-ddb1-c214-c2af-e3d358ae1a8e FMD-8000-4M Repaired
We can look for that msg ID (FMD-8000-4M) on SUN site to see what is its description:
All faults associated with an event id have been addressed.
and the automated response:
Some system components offlined because of the original fault may have been brought back online.
I think is a good pratice (just my opinion, not a recommendation), to rotate the log after that:
# fmadm rotate fltlog fmadm: fltlog has been rotated out and can now be archived
So…
# fmdump TIME UUID SUNW-MSG-ID fmdump: /var/fm/fmd/fltlog is empty
Ok you say, but i can see errors in ZFS pool using the “zpool status” command, not big deal… but keep in mind that FMA is a facility for the whole system, and answer me how do you would see this error:
# fmdump TIME UUID SUNW-MSG-ID Apr 01 00:23:54.7851 79e662bf-d4fb-771e-8553-9876ca7912c5 INTEL-8001-94
If you look the description for that error, you will see this:
This message indicates that the Solaris Fault Manager has received reports of single bit correctable errors from a Memory Module at a rate exceeding acceptable levels, and a Memory Module fault has been diagnosed. No data has been lost, and pages of the affected Memory Module are being retired as errors are encountered. The recommended service action for this event is to schedule replacement of the affected Memory Module at the earliest possible convenience. The errors are correctable in nature so they do not present an immediate threat to system availability, however they may be an indication of an impending uncorrectable failure mode. Use 'fmadm faulty' to identify the dimm to replace.
And there is more:
fmdump -v -u 60e662bf-d4fb-661e-8553-9876ca7911b4 TIME UUID SUNW-MSG-ID Apr 01 00:23:54.7851 79e662bf-d4fb-771e-8553-9876ca7912c5 INTEL-8001-94 100% fault.memory.intel.dimm_ce Problem in: hc://:product-id=X7DB8:chassis-id=0123456789:server- \\ id=testserver:serial=02020208170712cf50:revision=C1/motherboard=0/ \\ memory-controller=0/dram-channel=1/dimm=1/rank=3 Affects: mem:///motherboard=0/memory-controller=0/dram-channel=1/dimm=1/rank=3 FRU: hc://:product-id=X7DB8:chassis-id=0123456789:server- \\ id=testserver:serial=02020208170712cf50:revision=C1/motherboard=0/ \\ memory-controller=0/dram-channel=1/dimm=1 Location: DIMM2B
So, show some respect for FMA! ;-)
peace.
Nice entry. One note about the log rotation. I discourage using ‘fmadm rotate’ directly. Used repeatedly, you’ll overwrite historical information. Better to use ‘logadm’ with options to force a rotation of the FMA logs. I’ve written details on that here: https://blogs.sun.com/sdaven/entry/fma_log_files
Thanks for the reply Scott! Actually, i did forget to mention that i like to maintain that errors reported by the fmdump utility in another system. So i will have the history, using the server fltlog just as a “tail” ;-).
But it’s nice to know that there is a better approach to rotate the logs, and preserve the history.
Thanks again.