Friday, July 23, 2010

A scheduler for Netapp snapmirrors

Snapmirroring is a good way to protect your datas on your Netapp filer, replicating every night for instance all or part of your information of your filer in production to a remote hot spare filer. If you want to use a asynchronous snapmirror policy, you will have to rely on a sort of cron table to launch your replications.

But to my view, such cron feature does not prove to be a great idea if you want to avoid that too many snapmirrors occur at the same time (and thus slowing your network and increasing the latency of your filer). Of course, you can configure the cron in order to give sufficient time for some replications to end before launching new ones. This yet turns out to be a bit complicated to manage if you have a lot of volumes or qtrees to replicate. What is more, doing this way, you may waste part of your night time and replications may happened during business hours, impacting your responsiveness during peak hours.

To cope with such problems, I decided to create a sort of scheduler that takes a list of volumes/qtrees to replicate and ensure that only a certain amount of replications (and thus bandwidth) occur at the same time. When a slot is freed, it launches a new one. If you know Veritas Netbackup, you'll see where I got my inspiration.

Script is written in perl and you'll need the Netapp ONTAPI APIs (well, in fact just the perl part). You can dowload it here.

Install the scheduler :
mkdir /opt/netapp-manageability-sdk
tar xvfz /tmp/snapmirrorScheduler.tar.gz -C /opt/netapp-manageability-sdk
In the /opt/netapp-manageability-sdk/prod directory, install the ONTAPI API (just the perl part, don't bother with C, java etc). Then, you may need to patch a file in the API : /opt/netapp-manageability-sdk/prod/lib/perl/NetApp/ The invoke function of my API version did not manage the case where it received as an argument a reference to an object instead of a scalar. So here are the changes you may need to apply :
diff -r1.1
< $xi->child_add(new NaElement($key, $value));
> unless( ref($value)){ # we have a scalar variable
> $xi->child_add(new NaElement($key, $value));
> } else { # we have a reference object variable : must be treated in another way (bug from Netapp)
> my $newElement = new NaElement($key);
> $newElement->child_add( $value);
> $xi->child_add( $newElement);
> }

Now, let's change the root password of your filer. Look for the file lib/ and replace the TOCHANGE string by your password. Snapmirrors are launched on the servers that receive the replication so it should be the password of those filers. I considered that the filers had the same password, if your configuration differs, you'll have to change a bit the constructor block.

Finally, one should edit the configuration files. The tarball has got two configuration files : etc/scheduleSnapMirror_nas01a-ibm.conf and etc/scheduleSnapMirror_nas01b-ibm.conf. The files you'll have to create must be named the following way etc/scheduleSnapMirror_<filer-hostname>.conf. There must be a configuration file for each receiver filer. On each line, you explain the replication you want to launch. Replications at the top of the file will be launched first and the one at the bottom last. You can replicate just qtrees or whole volumes or a mix of them.
To fully understand the syntax of the configuration file, you must know that in my case, I have 2 principal filers : nas1a and nas1b and I have 2 filers on my backup data center : nas01a-ibm and nas01b-ibm. nas1a replicates on nas01a-ibm and nas1b replicates on nas01b-ibm. What is more, if I have a volume myVol13 on nas1a, replicated volumen on nas01a-ibm will be named R_myVol13. This allows me to make shorter my replication lines (if you have different conventions, you may hack a bit the code to do it as explained lated).

And now, it should work! Just execute the script :
/opt/netapp-manageability-sdk/bin/ --verbose nas01a-ibm
And you should see the first replications beginning.
If everything's OK, you can set such command in cron to execute it every night for instance :

mkdir /var/log/netapp
cat > /etc/cron.d/netapp <<-EOF
# schedule snapmirror execution on Netapp in order not to launch to many replications at the same time
03 22 * * * root /usr/local/bin/ nas01a-ibm
05 22 * * * root /usr/local/bin/ nas01b-ibm

Some explanations about the script :
Let's explain a bit the script
One function you might want to change is maxTransfersAllowed. It defines how many transfers you allow at the same time. I defined a business policy (lowTransfer variable) and an out of business policy (fullTransfer variable).
If you use the verbose mode, you'll see much information of your transfers in /tmp/.scheduleSnapMirror_<filer>-ibm.debug file. Every 5 minutes, the script will write on that file what transfers are executing and how many bytes have been transmitted.
You can also see in the script many eval constructions in order to catch errors. I did that because it sometimes happens that I have XML serialization problems that I did not really understand. What is more, you don't want your replication policy to stop if you loose network connectivity during a few seconds.
As default, the transfer rate (maxRate) is 8704 kb/s ; you'll may want to change it.
Last thing that can be interesting to change is in the launchEasyTransfer function. There, you'll see the code to establish a link between origin volume (filerA:volume) and destination volume (filerA-ibm:R_volume). You may need to adapt it according to your environment.