Technology Blog

Home » Messaging and Collaboration » Major Improvements in Exchange 2013’s High Availability – An Exchange Architecture Perspective!

Major Improvements in Exchange 2013’s High Availability – An Exchange Architecture Perspective!


Exchange

Exchange Server 2010

Exchange 2010 introduced a new features called Database Availability Group (DAG) has many changes to its core architecture as compared to previous version of Exchange 2007. In Exchange 2010, these new features such as incremental deployment, mailbox database copies, and available database copies works in a groups along with other features such as shadow redundancy and transport dumpster to provides a new, unified platform for High Availability and Site Resilience.

In other words, we’d experience quite substantial downtime with Exchange Server 2007 SP3 (CCR and SCR) features while restoring the database. Fortunately, Exchange 2010 DAG overcomes these limitations and avoids this substantial downtime in email services.  Important point here is that we are not talking about one Exchange server mirroring with another – It actually goes further into mailbox database level where mailbox database copies will be shared between Exchange servers.

At any point in time, only one mailbox database copy will be active while the other mailbox database copy will stay in standby mode and will be upto date in healthy state. In Exchange 2010, Administrator can add up to 16 Mailbox servers to a DAG and potentially have 16 copies of each Mailbox database

No doubt, Exchange 2010 has made tremendous achievement in terms of site resilience for the messaging service and data as compared to previous version of Exchange. By using the native site resilience features in Exchange 2010 and proper planning, you will be able to activate a second datacenter to serve a failed datacenter’s clients. The process you perform to do this is referred to as a datacenter switchover. This is a well-documented and generally well-understood process, although it takes time to perform, and it requires Administrator intervention in order to begin the process.

As we all experienced, a datacenter switchover in Exchange 2010 is operationally complex. This is because recovery of mailbox database (DAG) and client access (namespace) are tied together in Exchange 2010. This leads to other challenges for Exchange 2010 in certain scenarios:

If you lose a significant portion of Client Access servers, or CAS Array, or if you lose a significant portion of DAG, you were in a situation where we needed to do a datacenter switchover.

You could deploy a DAG across two datacenters and host the witness in a third datacenter and enable failover for the Mailbox role for either datacenter. But you didn’t get failover for the messaging service, because the namespace still needed to be switched over for the non-Mailbox roles.

But all saying that, the biggest challenge with Exchange 2010 is that the namespace is a single point of failure. In Exchange 2010, the most significant single point of failure in the messaging system is the Published FQDN that you provide to end users because it tells the user where to connect. Changing the IP address for that Published FQDN is not that easy because you have to change DNS and deal with DNS latency, which in some parts of the world is very bad. And you have name caches in browsers which are typically around 30 minutes or more that also have to be dealt with.

Exchange Server 2013

In Exchange 2013, the DAG architecture remains unchanged but there are plenty of enhancements in Exchange 2013’s high-availability

Reduction in IOPS over Exchange 2010: In Exchange 2013, the system is able to provide fast failover because it’s using a high checkpoint depth on the passive copy (100 MB) as compared to low checkpoint depth (5 MB) in Exchange 2010.

By using 100-MB checkpoint depth on passive copy, they’ve been de-tuned to no longer be so aggressive. As a result of increasing the checkpoint depth and de-tuning the aggressive pre-reads, IOPS for a passive copy is about 50 percent of the active copy IOPS in Exchange 2013 and IOPS for a passive database copy is not equal to IOPS for an active copy as it was in Exchange 2010

As a result of these changes, Exchange 2013 provides a 50 percent reduction in IOPS over Exchange 2010

Multiple Databases Per Volume: Exchange 2013 is optimized in such a way so that it can use large, multi-terabyte disks in a JBOD configuration more efficiently.

With multiple databases per volume, you can have the same size disks storing multiple database copies, including lagged copies. The goal is to drive the distribution of users across the number of volumes that exist, providing you with a symmetric design where during normal operations each DAG member hosts a combination of active, passive, and optional lagged copies on the same volumes.

Another benefit of using multiple databases per volume is that it reduces the amount of time to restore data protection in the event of a failure that necessitates a reseed (for example, disk failure)

Auto-Reseed: In Exchange 2013, Auto-Reseed is designed to automatically restore database redundancy after a disk failure by using spare disks that have been provisioned on the system. In the event of a disk failure where the disk is no longer available to the operating system, or is no longer writable, a spare volume is allocated by the system, and the affected database copies are reseeded automatically.

AutoReseed is integrated with multiple databases per volume and it is capable of restoring redundancy for multiple databases in parallel.

Managed availability: Exchange 2013 server roles include a new monitoring and high availability feature known as Managed Availability.

With the Exchange Server 2013 Management Pack, Managed Availability is also integrated with Microsoft System Center Operations Manager (SCOM). Any issues that Managed Availability escalates are sent to SCOM via an event monitor

Managed Availability includes three main asynchronous components that are constantly doing work. Administrators remain in control with the ability to configure server-specific and global overrides.

Probe Engine: Responsible for taking measurements on the server and collecting the data; results of those measurements flow into the monitor.

Monitor: Contains business logic used by the system to determine whether something is healthy, based on the data that is collected and the patterns that emerge from all collected measurements.

Responder Engine: Responsible for recovery actions. When something is unhealthy, the first action is to attempt to recover that component via multi-stage recovery actions that can include recycling an application pool, service, server and removing a server from service.

If recovery actions are unsuccessful, Managed Availability escalates the issue to a human through event log notifications.

High Availability Message Flow: In Exchange 2013, A Mailbox server receives a message from any SMTP server that’s outside the Transport high availability boundary. The Transport high availability boundary is a database availability group (DAG) or an Active Directory site in non-DAG environments

Before acknowledging receipt of the primary message, the primary Mailbox server initiates a new SMTP session to a shadow Mailbox server within the Transport High Availability boundary and makes a shadow copy of the message. In DAG environments, a shadow server in a remote Active Directory site is preferred.

The primary server processes the primary message and delivers it to users within the Transport high availability boundary or relays it to the next hop. The primary server queues a discard status for the shadow server that indicates the primary message was successfully delivered, and the primary server moves the primary message into the local Primary Safety Net.

The shadow server periodically polls the primary server for the discard status of the primary message.

When the shadow server determines the primary server successfully delivered the primary message or relayed it to the next hop, the shadow server moves the shadow message into the local Shadow Safety Net.

The message is retained in the Primary Safety Net and the Shadow Safety Net until the message expires.

Site Resilience: In Exchange 2013, significant changes have been made to address the challenges of Exchange 2010 site resilience. With the Namespace Simplification, Consolidation of Server Roles (this seems to be the biggest change to the Exchange architecture in the 2013 release is the consolidation of roles that were introduced in Exchange 2007 – Now all 5 Exchange Roles has been consolidated into CAS and Mailbox Role in Exchange 2013), Separation of CAS Array and DAG Recovery, and Easier Load Balancing changes. Exchange 2013 provides new Site Resilience options, such as the ability to use a single Global Namespace.

In addition, for customers with more than two locations in which to deploy messaging service components, Exchange 2013 also provides the ability to configure the messaging service for Automatic Failover in response to failures that required manual intervention in Exchange 2010.

As Microsoft Exchange Server 2013 adds a lot of other features as well, so to go through all features in details please visit to Alan Maddison’s TechNet Magazine Article –“Microsoft Exchange Server 2013: E-mail improved” – http://technet.microsoft.com/en-us/magazine/jj851175.aspx


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s