In Exchange 2010 (not SP1) and Exchange 2007 there was no memory of unavailable transport servers and so the round robin method of load balancing across the hubs in the target delivery site or smarthosts used by connectors sourced to your current server was just that – round robin.
Though if a server was unavailable the next server in the list was selected and connected to and the first server in the list was moved to the end of the list of servers to use. This resulted in an uneven distribution of load when servers were offline. Imagine the scenario where you have three hub transports in the London Active Directory site (HL1, HL2 and HL3) which were installed in that order. A Hub Transport server in another AD Site will deliver up to 20 messages per connection and will make the connections in a round robin fashion. Therefore if HL1 is offline the connection will automatically be made to HL2. Upon completing the connection the first server in the list will be moved to the end of the list – in this example HL1 will move to the back of the list.
The next connection to the London site will use the list HL2, HL3, HL1 for delivery, and as HL2 is running will connect to HL2 and deliver its email and move HL2 to the back of the list. The third connect will go to HL3. The fourth connection will attempt to reach HL1 and fail, so deliver to HL2 and move HL1 to the back of the list.
The result of this is that HL2 will get 66% of email delivered to HL3’s 33% and not a 50/50 distribution once one server is down. When all servers in the site are operational the distribution will be 1/3 of connections each and even load balancing.
Exchange 2010 SP1 records downed servers in a separate list which it will attempt to connect to on a separate sequence (unrelated to email delivery). So taking the above example and HL1 is offline (again) and the source server is Exchange 2010 SP1 it will fail to connect and deliver to HL2, move HL2 to the bottom of the list and remove HL1 from the available servers list. Therefore HL2 and HL3 will get 50% of connections each – no overloading of the next hub in install order.
The source Exchange 2010 SP1 server will maintain this list of unavailable servers and will attempt to connect to the unavailable server regularly. It does this once a minute for four minutes (known as the QueueGlitchRetryCount and QueueGlitchRetryInterval), then it changes to TransientFailureRetryCount and TransientFailureRetryInterval, which is six times, once every five minutes. After 35 minutes going through the Glitch and Transient retry intervals Exchange will only attempt to connect once every 10 minutes (the OutboundConnectionFailureRetryInterval value) or 15 minutes if on an Edge Transport server.
Once the server is online again it is added back into the round-robin load-balancing list for connections to remote sites or smarthost endpoints. This does mean though that if a server is offline for more than 35 minutes it will be up to 10 minutes before Exchange 2010 SP1 attempts to connect to it for transport and email delivery.
To see which servers are on your unavailable list run Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component SmtpOut -Argument verbose . The Get-ExchangeDiagnosticInfo cmdlet is covered further in my next blog today.
Leave a Reply