Shadow Redundancy and Server Outages

Exchange Server 2010 has a feature that tries to ensure that emails in transport cannot be lost. This feature is called Shadow Redundancy and lots of information on how it works can be found on the Internet.

But what happens if a mailbox server or site is unavailable? Items will queue in a single location, and now this location is a single point of failure. So whilst you have an outage (planned or otherwise), you increase the risk of loss of mail due to a second outage in transport that causes mail.que database corruption.

Let us examine the details by considering one type of outage – other types of problems can occur and generate the same potential results. If I dismount a database in an Active Directory site and then send an email to an Exchange 2010 mailbox on that site the email will queue on an Exchange 2010 Hub Transport server in that site. The queue will be visible with Get-Queue and the queue will go into Retry state. Here is a picture showing the Exchange Management Shell output for one such site:

The first cmdlet shows one email queuing for a mailbox database (the one that is offline), the DeliveryType is MapiDelivery and the NextHopDomain (the next target) is the offline database.

The second cmdlet in the above picture shows the effect of a second email being sent. The items in the queue are at 2, and both of these are on FAB-RED-HUB1. Should FAB-RED-HUB1 fail at this point and the mail.que database become corrupt due to this failure, these emails would be lost.

What you cannot see from the screenshot is the effect of Delayed Acknowledgement. Delayed Acknowledgement is the process whereby if a Hub Transport server receives an email from an SMTP server that does not support Shadow Redundancy then it will delay acknowledgement to the message long enough to ensure the message exists on two servers – that is, it has a shadow for the message. In the above example this is not possible as the inbound email is from the internet and is directly into this Active Directory site, so there is nowhere else to send the email. Delayed Acknowledgement is set to 30 seconds by default and so on the arrival of the first message the sending server has their acknowledgement delayed by the full 30 seconds. On the arrival of the second message, as the delivery queue is in retry the Delayed Acknowledgement is not implemented as DelayedAckSkippingEnabled is set to True by default (so if it would take over 30 seconds to deliver or the target queue is in retry then don’t implement a delay as it is likely to be present even after 30 seconds. The problem here is that protection of the first message was 30 seconds, and if the mailbox database (or other failure) was resolved in 30 seconds then you would have delayed the acknowledgement and so protected the message by having not told the previous hop that it was queued. The second message (and all subsequent messages) are immediately added to the outbound queue and are a single point of failure.

Service Pack 1 for Exchange 2010 adds Shadow Redundancy Promotion. This will ensure that the message lives on two transport servers within a site if the NextHopDomain is unavailable. But this is disabled by default.

To enable Shadow Redundancy Promotion, edit the EdgeTransport.exe.config file on all hub transport servers to read True for the ShadowRedundancyPromotionEnabled setting. Once the EdgeTransport.exe.config file is saved then restart the Microsoft Exchange Transport service on all servers. EdgeTransport.exe.config is found in \Program Files\Microsoft\Exchange Server\V14\bin.

This second screenshot shows the effect of enabling Shadow Redundancy Promotion on all my hub transport servers and restarting the transport service on each machine. The screenshot follows on immediately from the above example.

In the above you can now see that the queue that did contain the message to the mailbox database (FAB-RED-HUB1\216) is now empty and that their is a shadow queue containing the two messages on FAB-RED-HUB1 instead. FAB-RED-HUB2 (also in the same site) now hold the queue to the offline database. In the event of a transport server failure whilst the database is offline, there will not be a loss of email as the email can be redelivered from the other transport server.

Comments

Leave a Reply Cancel reply