In Exchange 2010 (not SP1) and Exchange 2007 there was no memory of unavailable transport servers and so the round robin method of load balancing across the hubs in the target delivery site or smarthosts used by connectors sourced to your current server was just that – round robin.
Though if a server was unavailable the next server in the list was selected and connected to and the first server in the list was moved to the end of the list of servers to use. This resulted in an uneven distribution of load when servers were offline. Imagine the scenario where you have three hub transports in the London Active Directory site (HL1, HL2 and HL3) which were installed in that order. A Hub Transport server in another AD Site will deliver up to 20 messages per connection and will make the connections in a round robin fashion. Therefore if HL1 is offline the connection will automatically be made to HL2. Upon completing the connection the first server in the list will be moved to the end of the list – in this example HL1 will move to the back of the list.
The next connection to the London site will use the list HL2, HL3, HL1 for delivery, and as HL2 is running will connect to HL2 and deliver its email and move HL2 to the back of the list. The third connect will go to HL3. The fourth connection will attempt to reach HL1 and fail, so deliver to HL2 and move HL1 to the back of the list.
The result of this is that HL2 will get 66% of email delivered to HL3’s 33% and not a 50/50 distribution once one server is down. When all servers in the site are operational the distribution will be 1/3 of connections each and even load balancing.
Exchange 2010 SP1 records downed servers in a separate list which it will attempt to connect to on a separate sequence (unrelated to email delivery). So taking the above example and HL1 is offline (again) and the source server is Exchange 2010 SP1 it will fail to connect and deliver to HL2, move HL2 to the bottom of the list and remove HL1 from the available servers list. Therefore HL2 and HL3 will get 50% of connections each – no overloading of the next hub in install order.
The source Exchange 2010 SP1 server will maintain this list of unavailable servers and will attempt to connect to the unavailable server regularly. It does this once a minute for four minutes (known as the QueueGlitchRetryCount and QueueGlitchRetryInterval), then it changes to TransientFailureRetryCount and TransientFailureRetryInterval, which is six times, once every five minutes. After 35 minutes going through the Glitch and Transient retry intervals Exchange will only attempt to connect once every 10 minutes (the OutboundConnectionFailureRetryInterval value) or 15 minutes if on an Edge Transport server.
Once the server is online again it is added back into the round-robin load-balancing list for connections to remote sites or smarthost endpoints. This does mean though that if a server is offline for more than 35 minutes it will be up to 10 minutes before Exchange 2010 SP1 attempts to connect to it for transport and email delivery.
To see which servers are on your unavailable list run Get-ExchangeDiagnosticInfo -Process EdgeTransport -Component SmtpOut -Argument verbose . The Get-ExchangeDiagnosticInfo cmdlet is covered further in my next blog today.
6 responses to “Hub Transport Load Balancing”
Ok if you can help me with the below scenario…(E2K10)
I have 4 AD sites with HUB servers A,B,C & D with MPLS connection between and each site are with same cost 10 in between.
Considering Site A users sends email to Site D user, for sure Site A HUB server will direct connect a remote queue with Site D HUB server and the email is delivers.
What if the site D HUB fails and again the same email flow happens between site A user to site D user….now through which route the Site A HUB server should route or will send the email to Site B/C HUB server….and why…????
Let me know if i haven’t cleared any…Thanks!
@Charles, it will never route to B or C as B and C are not on the least cost route. The least cost route is A>D. So if D is offline then email queues at A.
Can you please elaborate more on that, Thanks!
I had read it just few days(3) back as i was in holidays after wanting to know as below mentioned.
The remote delivery in this above scenario should be A->B since D which had fail.
* First it will look for Cost
* Second – No. of Hops
* Third – Alphanumerical connector.
So as my above mentioned scenario if site D had failed & based on the same cost and hop, it will use the 3 point and queue @Site B Hub….?
Correct me if am wrong, Thanks!
The cost of the link between A and D is 10. There is no link that is any cheaper, therefore all inter-site transport will use this link. Other links (for example A-B-D) are more expensive (20 in the case of the example).
Exchange uses least cost routing. It will always route on the least cost.
Therefore regardless of the state of the hubs in D, the route from A to D is the least cost route and in your example this is A-D. If the hubs in D are unavailable (or some other reason that means hubs in A cannot reach the hubs in D) then mail will queue at A until D is available. B and C are not on the least cost route, so mail will never go there.
Thanks very much for the explanation…got it.
So lets say if the cost between the site A-C was 5 & comparing to the cost between site A-B is 10 then would it been queued to the site C hub srv as totals 15(A-C-D) as compared to 20(A-B-D).
Can you please help me in knowing the alphanumeric method….?
#Wish you a Very Happy New Year….!
No – the cost A-D is still 10. The cost ABD and ACD are still more than this, so it will still queue on A for D to become available.