During the last weeks I got a lot requests on how Veeam and NetApp designs could look like for different size of customers. The combination of Veeam and NetApp is all about minimizing the RPO and RTO in the modern datacenters. It is about combining the NetApp storage based snapshot and replication features like snapshot, SnapVault and SnapMirror with the orchestration and granular restore capabilities of Veeam. Additionally the combination minimizes the management overhead and eliminates dependencies.
If we talk about different designs we should define what the main goal of the design should be. That’s why I defined the following requirements for this article:
– SMB design (RTO > 24 hrs. and RPO > 24 hrs.)
– MID design (RTO < 24 hrs. and RPO < 24 hrs.)
– ENT design (RTO 0-2 hrs. and RPO 0-1 hrs.)
If I look into the IT of today my feeling is that more than 70% of the customers will need to have at least the MID design and > 50% require a enterprise solution as it is unacceptable to be offline or even loose any data in case of a disaster.
At a SMB level where you don’t need any kind of low RPO or RTO a combination of NetApp E-Series as a source and E-Series as a repository for the Veeam server is a valid design. With that you will get a high performing and very stable primary storage with f.e. SAS drives for your VMware environment and a Veeam server which is taking care of all the backup and restore stuff. As the target it is best you use a high density storage like a E-Series with a bunch of NL-SAS drives to be the Veeam repository. This design is mostly for SMB customers or branch offices where there is no need of being available. In this scenario also the restore capabilities in case of a disaster are limited and it will take a lot of time until you are fully operational again. Veeam is used to provide single item restore capabilities and to restore several VMs instantly by booting those directly from the repository but in case of a disaster you still need to restore all VMs back to new hardware and this will take a lot of time. Another disadvantage is that there is no optimized snapshot handling or backup from storage snapshot available with NetApp E-Series and Veeam.
The MID solution is a combination of NetApp FAS and Veeam. In this scenario you will have a primary NetApp FAS as your storage for VMware and a Veeam server with direct attached E-Series as a backup target. The benefit is that you can now leverage the Veeam integration into NetApp’s data ONTAP. First benefit is, that you can backup the data directly from a NetApp storage snapshot. With that a lot of load will be taken from the VMware environment and this optimizes the VMware snapshot handling problem extremely by minimizing the time a VMware snapshot needs to stay open during the backup. You can find more about backup from storage snapshot in my previous posts. Secondly you can combine Veeam created agent-lees, consistent storage snapshot with NetApp crash consistent snapshots to minimize the RPO. As Veeam can restore from snapshots even if they are not created by themselves this is a great way to improve your RPO.
If you need to restore files or even VMs you can leverage the NetApp snapshots to be the source for Te restore. With that you optimize RTO of VMs and files as there is no performance issue with NetApp snapshots. But as it is in the SMB design, in case of a disaster you still need to restore the whole environment back to new systems and servers before you will be backup online. As there is no replication in this design the disaster RTO is still very high.
In the always on business of today a modern datacenter requires to have more than just a bit of storage and servers. It’s all about having availability capabilities in several layers. Looking to the design above it’s all about combining the functionalities of NetApp storage systems with the features of Veeam. At the primary datacenter a NetApp MetroCluster can make sure that in case of a disaster no data is lost (RPO=0) and the applications can access the data with no outage (RTO=0) as the data is synchronously mirrored between two sides. The MetroCluster is for sure the solution which provides you the highest level of availability in case of a disaster but you can also use a regular NetApp CDOT cluster on the primary side if MetroCluster is not possible. In a secondary location you will have another NetApp FAS system to be used as SnapVault and/or SnapMirror destination from you primary NetApp. And then there is a Veeam Backup & Replication server present either on your secondary site or on a third place. The Veeam server is the central orchestration tool of any kind of backup, restore, Snapshot, SnapMirror or SnapVault activity within the whole design. It is used to create an application consistent VMware Snapshot followed by a volume Snapshot on ONTAP. Right after this the VMware Snapshot will be deleted as you now have the application consistent state at the ONTAP level. As soon as the VMware Snapshot is committed as SnapVault or SnapMirror update can be triggered to transfer the data from the MetroCluster directly to the secondary NetApp. There it can be either saved on a Snapshot level or in version 9 be used as a source to perform a Backup from Storage Snapshot. In this scenario the primary storage is completely unaffected from a performance perspective during the backup as everything is going to be transferred from the secondary NetApp ONTAP system. Furthermore you can then use the data stored on your repository to perform a copy job to the cloud via cloud connect or to do a tape out for long time retention. The data can for sure be used for all other Veeam restore capabilities such as instant recovery or Veeam Explorer.
Beside that you can also use the NetApp Snapshots as source for your restore. By leveraging the Snapshot you will see the same performance during restores as it is in you production environment as the Snapshots are directly mounted to vSphere. The combination of NetApp SnapVault/SnapMirror and Veeam can minimize the RTO in case of a disaster of your primary system (MetroCluster or Cluster offline) down to ~ 1 hr.
RPO can be minimized down to 15 min. depending on the configuration you use in the jobs.
I hope the post answered some of your question and feel free to comment and share.