Solaris 10 u3 – SC 3.2 ZFS/NFS HA with NON-shared discs using AVS (part II)

Posted in Avs, General, Opensolaris by Marcelo Leal on 09 August 2007

First of all, i would like to thanks Suraj Verma, Venkateswarlu Tella, Venkateswara Chennuru, and the great opensolaris community.

If you do not have the AVS setup already, take a look at part I of this howto. Ok, let’s go..

First, we will need to create a custom agent (Resource type), because the HA.StoragePlus resource needs global devices for HA, and this howto is for HA using local devices.
So, let’s create the resource named: MRSL.NONsharedDevice.
We will use the directory /usr/scnonshared:

   # mkdir /usr/scnonshared

In that directory let’s create a file named MRSL.NONsharedDevice.rt with the following lines:

##############################################
# Resource Type: NONsharedDevice             #
# byLeal                                     #
# 17/ago/2007                                #
##############################################
RESOURCE_TYPE   = NONsharedDevice;
RT_DESCRIPTION  = "NON-Shared Device Service";
VENDOR_ID       = MRSL;
RT_VERSION      = "1.0";
RT_BASEDIR      = "/usr/scnonshared/bin";
START           = nsd_svc_start;
STOP            = nsd_svc_stop;

{
        PROPERTY = Zpools;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The Pool name";
}
{
        PROPERTY = Mount_Point;
        EXTENSION;
        STRINGARRAY;
        TUNABLE = AT_CREATION;
        DESCRIPTION = "The mount point";
}

Let’s go through some lines of that file:
p.s: Lines starting with “#” are just comments…
1) RESOURCE_TYPE: Here we have the name of our brand new Resource Type (agent).
2) RT_BASEDIR: That’s the directory base of the Resource type, where “must” be the start/stop methods.
3,4) START/STOP: The two methods that we need to implement for mount/unmount our pool/fs. The filenames will be “nsd_svc_start” and “nsd_svc_stop“.
Here you can download the two perl scripts: nsd_svc_start, nsd_svc_stop.
Ok, ok, i’m not a “real” perl hacker, but i think that two scripts do the job… the two scripts are basically the same, i don’t know if we can have just one script to start and stop methods. But, at least we can put the functions in an “include” file.. feel free to enhance it!
p.s: You will need to change the string “primaryhostname” in both scripts to the name of your server (master of the sndr volumes).
The sections between “{}“, are extensions properties that we need for customize each resource that we will be creating in this Resource Type:
1) Zpools: Like in SUNW.HAStoragePlus RT, we need this extension to associate ZFS pool with the MRSL.NONsharedDevice resource.
2) Mount_Point: This extension we need to mount the ZFS pool/fs. Because we must set the ZFS pool mount property to “legacy_mount“, for each ZFS pool that we want to use in a HA solution with “local” devices .
Both extensions must be provided “AT_CREATION” time, and there is no default.
If you change the perl scripts, keep in mind that the important point is the control of the AVS’s synchronization/replication. Like Jim Dunham have said to me:
“Golden Rule of AVS and ZFS:”

When replicating data between Solaris hosts, never
allow ZFS to have a storage pool imported on both nodes
when replication is active, or becomes active. ZFS does not
support shared writer, and thus if ZFS on one node, sees a
replicated data block from ZFS on another node, it will
panic Solaris.

and more..

Now in a failback or switchback scenario, you do have a
decision to make. Do you keep the changes made to the
SNDR secondary volume (most likely), or do you discard
any changes, and just continue to reuse the data on the
SNDR primary volume (least likely). The first thing that
needs to be done to switchback, which is automatically
provided by the NFS resource group, is to ZFS legacy
unmount the ZFS storage pool on the SNDR secondary node.

– If you want to retain the changes made on the SNDR secondary, in a script perform a “sndradm -n -m -r “
– If you want to dispose of the changes made on the SNDR secondary, do nothing.

Next allow the NFS resource group to ZFS legacy mount
the ZFS storage pool on the SNDR primary node, and now
you are done.

…

In this HOWTO we will “retain the changes made on the SNDR secondary“, so the stop/start scripts must handle that. If you are implementing a solution that does not need to retain the changes, you can edit the scripts and remove the sndradm lines. Remember that you need put that files in the directory “/usr/scnonshared/“, rename them to “nsd_svc_start” and “nsd_svc_stop“, and give them execution permissions.
You can use the start/stop scripts on command line to test them in your system. You will need to create two directories in both nodes:

 # mkdir /var/log/scnonshared
 # mkdir /var/scnonshared/run

Be sure that the ZFS pool is unmounted and exported, also, keep in mind the “AVS/ZFS golden rule”… You can try the scripts running a command similar to:

/usr/scnonshared/bin/nsd_svc_start -R poolname-nonshareddevice-rs \\
-T MRSL.NONsharedDevice -G poolname-rg

and

/usr/scnonshared/bin/nsd_svc_stop -R poolname-nonshareddevice-rs \\
-T MRSL.NONsharedDevice -G poolname-rg

The SC subsystem will call the scripts with the options we have used above…
p.s: I think this should be a information that i could find out without have to make a script to print “$@“. Would be nice find it in the scha_* man pages…

The scripts will log everything on screen, and in the log files named: nsd_svc_stop.`date`.log and nsd_svc_start.`date`.log…
Ok, after `cd /usr/scnonshared`,we can register our new RT:

 # clresourcetype register -f MRSL.NONsharedDevice.rt MRSL.NONsharedDevice

…and configure the whole resource group:

 # clresourcegroup create -p \\
PathPrefix=/dir1/dir2/POOLNAME/HA poolname-rg
 # clreslogicalhostname create -g poolname-rg -h \\
servernfs servernfs-lh-rs
 # clresource create -g poolname-rg -t \\
MRSL.NONsharedDevice -p Zpools=POOLNAME \\
-p Mount_Point=/dir1/dir2/POOLNAME poolname-nonshareddevice-rs
 # clresourcegroup online -M poolname-rg

Now, the resource group associated with the ZFS pool (POOLNAME) is online, hence the ZFS pool too. Before configure the resource for the NFS services, you will need to create a file named: dfstab.poolname-nfs-rs, in the directory: /dir1/dir2/POOLNAME/HA/SUNW.nfs/.
p.s: As you know, the dfstab is the file containing commands for sharing resources across a network (NFS shares – “man dfstab” for informations about the sintaxe of that file).

To avoid the SC “validate” check, to configure the SUNW.nfs resource, we will need to have the ZFS pool mounted on both nodes. So, let’s put the AVS software in logging mode (on primary node):

   #sndradm -C local -g POOLNAME -n -l

After that, we can import the pool (on secondary node):

  #zpool import -f POOLNAME

Now we can proceed with the cluster configuration:

   # clresource create -g poolname-rg -t SUNW.nfs -p \\
Resource_dependencies=poolname-nonshareddevice-rs \\
poolname-nfs-rs

So, now we can unmount/export the ZFS pool on secondary node, put the AVS software in replication mode, and go to the last step.. bring all the resources online.
On the secondary node:

   # unmount POOLNAME
   # zpool export POOLNAME

On primary node:

   # sndradm -C local -g POOLNAME -n -u
   # sndradm -C local -g POOLNAME -n -w
   # clresourcegroup online -M poolname-rg

That’s it, you can test the failover/switchback scenarios with the commands:

 # clnode evacuate < nodename >

The command above will take all the resources from the nodename, and will bring them online on other cluster node.. or you can use the clresourcegroup command to switch the resource group to one specific host (nodename):

clresourcegroup switch -n < nodename > poolname-rg

WARNING: There is a “time” (60 seconds by default), to keep resource groups from switching back onto a node, after the resources have been evacuated from a node (man clresourcegroup). Look here..

If you need undo all the configurations (if something goes wrong), here is the step-by-step procedure for understanding:

   # clresourcegroup offline poolname-rg
   # clresource disable poolname-nonshareddevice-rs
   # clresource disable poolname-nfs-rs
   # clresource delete poolname-nfs-rs
   # clresource delete poolname-nonshareddevice-rs
   # clresource disable servernfs-lh-rs
   # clresource delete servernfs-lh-rs
   # clresourcegroup delete poolname-rg

I think we can enhance this procedure using a point-in-time copy of the data, to avoid “inconsistency” issues during the synchronization task… but it is something i will let you comment! That’s all..

Edited by MSL (09/24/2007):
This “Agent” is changing over time (for better, i guess :), and i will use the comments sections like a “Changelog“. So, if you want to implement that solution, i recommend you to read the comments section, and see if there is some changes in the above procedure. The “stop” and “start” scripts are always permanent links, and the updated RT file can be downloaded here.

About Marcelo Leal

I do work with systems architecture (operating systems, network and storage plumbing). Strong experience (more than 20 years): prospecting, modeling, planning and implementing new solutions to provide resilient internet services, cost effective operational infrastructure, storage performance and data protection for highly available systems.

View all posts by Marcelo Leal →

22 Comments Already

Subscribe to comments feed

Suraj Verma - August 23rd, 2007 at 1:37 pm

Very cool and pretty useful I am sure.
Regarding your comment “Would be nice find it in the scha_* man pages…”, I guess you can find this information in rt_callbacks(1HA) or in Developer’s Guide (link found at https://opensolaris.org/os/community/ha-clusters/ohac/Documentation)
MSL - August 24th, 2007 at 12:50 pm

Suraj Verma comment something to me, that i think is very relevant, and because of that i’m posting here:
”You say that the extension properties can only be set when creating the
resource (AT_CREATION). However I guess it will be safe to have them
modified when the resource is disabled (WHEN_DISABLED). This will give
you the flexibility of adding/removing mountpoints without deleting the
resource”.
msl - September 24th, 2007 at 4:34 pm

The first ”major” change in that procedure is the ”Mount_Point” extension removed from the RT file. I have decided to control the mount/umount operation using just the ”zfs import/export” command. I did put three other properties to the Agent:
1 – Sync_Mode: So the admin can set the AVS/sndr synchronization to full, or update (default).
2 – Last_Started: We need to know if we need to run a reverse sync or not, if the replica state is ”logging”.
3 – SNDR_primary: We need to know that, and i think here is the best place.

So, there is one line in the procedure that needs to be changed from:

# clresource create -g poolname-rg -t MRSL.NONsharedDevice -p Zpools=POOLNAME -p Mount_Point=/dir1/dir2/POOLNAME poolname-nonshareddevice-rs

to:

# clresource create -g poolname-rg -t MRSL.NONsharedDevice -p Zpools=POOLNAME -p SNDR_primary=yourprimaryhostname -p Sync_Mode=UPDATE poolname-nonshareddevice-rs
That’s all, for now…
msl - October 4th, 2007 at 3:35 pm

Methods:
https://www.eall.com.br/hp/Solaris/nsd_svc_start
https://www.eall.com.br/hp/Solaris/nsd_svc_stop
https://www.eall.com.br/hp/Solaris/nsd_monitor_start
https://www.eall.com.br/hp/Solaris/nsd_monitor_stop
https://www.eall.com.br/hp/Solaris/nsd_probe
msl - October 23rd, 2007 at 3:08 pm

I was getting dependency errors trying to bring the resource group online. So, talking to the devel team (Thanks to Suraj Verma), i was pointed to the URL:
https://docs.sun.com/app/docs/doc/817-4227/6mjp2sind?l=en&a=view (Deciding Which Start and Stop Methods to Use).
Like the title suggests, there are variables to take into account when deciding which start/stop methods to use in a specific Data Service.
\”Services that use network address resources might require that start or stop steps be done in a particular order that is relative to the logical hostname address configuration. The optional callback methods Prenet_start and Postnet_stop allow a resource type implementation to do special start-up and shutdown actions before and after network addresses in the same resource group are configured to go up or configured to go down.\”
And more
\”When deciding whether to use the Start, Stop, Prenet_start, or Postnet_stop methods, first consider the server side. When bringing online a resource group containing both data service application resources and network address resources, the RGM calls methods to configure the network addresses to go up before it calls the data service resource Start methods. Therefore, if a data service requires network addresses to be configured to go up at the time it starts, use the Start method to start the data service.\”

So, i had to create two new methods:
https://www.eall.com.br/hp/Solaris/nsd_prenet_start
https://www.eall.com.br/hp/Solaris/nsd_postnet_stop
msl - January 10th, 2008 at 4:59 pm

Here you can download the Solais package tarball.
Jeremy - April 25th, 2008 at 5:34 pm

What happens if one of the secondary nodes disks fail? It looks like you’re opening yourself up to a scenario where NODE2 is not in a good state ( would be degraded if switching over to it ).

What if the primary NODE1 detects a bad disk and starts to resilver, that will generate a LOT of IO across this AVS solution.

My question is, is it better to forget about zfs and use this solution with a UFS meta device?
Marcelo Leal - April 25th, 2008 at 7:04 pm

Hello Jermy, thanks for your feedback!
First, what you are describing is a scenario where a two-node cluster is not supposed to work (two fails). The SC software is supposed to handle one fail, and do the best in two or more fails. What you are telling is; “The first node crash, and you don’t have the second up and running to take the services”. The situation will not work in any scenario… Imagine that instead the disk fail, the CPU on the second node fail. Keep in mind that for one resource group switch over, there is a need to have one node ready to bring that resource group online.
Another thing is, in the scenario you have described, the resource would be online because is a mirror, so, one disk is enough. :)
The resilver of ZFS feature, there is nothing to do with this solution. The problem in a ZFS resilver/scrub is the IO on the node where the discs are, because that will impact performance on the host where the service is running. And you have a choice to have a dedicated interface to AVS replication.
Last thing: ZFS resilver/scrub is a feature. Remember that… maybe you got a performance issue, but you will know about a “silent data corruption”, with UFS and other RAID solutions don’t.
Thanks again.
JaneRadriges - June 13th, 2009 at 2:05 pm

Hi, interest post. I’ll write you later about few questions!
znakomstva - September 5th, 2009 at 6:51 pm

Присоединяюсь, к комментариям! Добавлю в избранное!
Проказница - September 16th, 2009 at 12:37 pm

Спасибочки) Очень помогли =-*
Работник - September 23rd, 2009 at 3:09 am

Спасибо) есть что то интересное))
Окулист - October 5th, 2009 at 5:29 am

Как раз то что искал, большое спасибо!
website designing pakistan - January 10th, 2010 at 5:36 pm

Very interesting and informative site! Good job done by you guys, Thanks
Chris - March 17th, 2010 at 3:06 am

Hi,

this is a fantastic agent, and i have it up and running so far flawlessly however when my cluster restarts AVS appears to loose its configuration?

if i run “sndradm -C local -g POOLNAME -n -u” im told:

Remote Mirror: avs1 /dev/rdsk/c8t1d0s0 /dev/rdsk/c8t1d0s1 avs2 /dev/rdsk/c8t1d0s0 /dev/rdsk/c8t1d0s1
sndradm: warning: SNDR: /dev/rdsk/c8t1d0s0 ==> /dev/rdsk/c8t1d0s0 not already enabled

dsstat also returns nothing…

so far my solution is running on both nodes:
dscfgadm -d
rm /etc/dscfg_cluster EE rm /etc/dscfg_local
echo “/dev/did/rdsk/d2s0” > /etc/dscfg_cluster
dscfgadm

and then syncing the primary to the slave again.. obviously not ideal :(

any suggestions??

and cheers for this awesome agent… would be nice if the AVS comunity got behind this!!
Marcelo Leal - March 17th, 2010 at 11:54 am

Hello Chris,
This is an *old* problem without fix that i know. If you do a search on google about AVS configuration lost, you will see a lot of emails from me. ;-)
You could ask again in OHAC mailing list with CC storage. I did talk with Jim Dunham about this behaviour without luck.
Thanks for your comment!
Chris - March 17th, 2010 at 9:49 pm

from all the googling i did find alot of emails from you and alot of pushback from devs saying it wasnt a problem, however here i am with the same problem as you, using opensolaris 2009.6.. :(

i may give this a shot with the latest solaris release and the packages from the sun website, see if its just an opensolaris issue.

such a shame that this problem is here as your solution is fantastic providing AVS works – which it does until a node restarts!!!

i gather your current agent code is here “https://www.eall.com.br//hp/Solaris/MRSLnonshareddevice-2.2.tar.gz” ?

once again… awesome work with the agent!
Chris - March 18th, 2010 at 2:29 am

well i tried using the supported stuff from sun website and i have the same problem…

what i have noticed is that prior to running sndradm if i run dscfg -l -s /dev/did/rdsk/d5s0 | grep -v “#” – the contents is empty. Once i setup the replication with sndradm and rerun the dscfg command it returns one line with setid: 1 setid-ctag

i would of thought it would return the same as /etc/dscfg_local which in my case has:
cm: 128 64 – – – – – – –
sndr: nas1 /dev/rdsk/c1t1d0s0 /dev/rdsk/c1t1d0s1 nas2 /dev/rdsk/c1t1d0s0 /dev/rdsk/c1t1d0s1 ip sync tank1 – setid=1; –
sv: /dev/rdsk/c1t1d0s0 – –
sv: /dev/rdsk/c1t1d0s1 – –
dsvol: /dev/rdsk/c1t1d0s0 – sndr
dsvol: /dev/rdsk/c1t1d0s1 – sndr

so perhaps the problem isnt that configuration is lost, but that configuration isnt even being written to the cluster database on the shared disk???

wonder how i could manually get that info on the shared storage.. i did try d if=/etc/dscfg_local of=/dev/did/rdsk/d5s0 bs=512k count=11 but without any luck heh

maybe you have some ideas? :)
Chris - March 18th, 2010 at 3:20 am

apparently i can update the contents of the cluster database via dscfg -C – -a file

however this appears to be only updating dscfg_local

arggh :(

time to give up…
Marcelo Leal - March 19th, 2010 at 1:11 pm

Hello Chris!
Yes, the version 2.2 is the current agent version (a.k.a Smith) ;-)
That was the version i did use in the Presentation last year on OHAC Summit (San Francisco/California 2009).

Leal
Mike - March 22nd, 2010 at 9:38 am

Thanks for a wonderful post, l ve been looking for such information, I will join jour rss feed now.
office project 2019 - January 14th, 2021 at 6:26 am

You can definitely see your enthusiasm in the work you write. The world hopes for even more passionate writers like you who aren’t afraid to say how they believe. Always follow your heart.