Today I visited a client who had noticed that no log files had ever been removed after any backup within Exchange 2010 SP1. It was fortuitous that they had enough log disk space for about eight months of log generations. The disadvantage was that we were four months into this time period, so it was a ticking clock, and that the nightly incremental backups were taking longer and longer.
They were getting the following error in their backup datacentre:
Unable to communicate with the Microsoft Exchange Information Store service to coordinate log truncation for database ‘name’ due to an RPC communication failure. Error 3355379671 Extended Error: 0 and Event ID 2136 for the MSExchangeRepl service in the Application event log.
What the error does not clearly say is that the Microsoft Exchange Replication service (MSExchangeRepl) on the server in the DR site (a passive node in the DAG) needs to communicate via RPC to the Microsoft Exchange Information Store service on the server holding the active node of the database.
In the case of my client, the Exchange team is not the same people as the network team or indeed the firewall team, and these teams are in different countries. In the case of the network for this client, the Replication network for the DAG had been opened to allow RPC traffic, but the MAPI (Client) network had not.
When Exchange in the DR site needed to check which logs it could truncate (a process it performs every 15 minutes), it needs to talk to the Microsoft Exchange Information Store service on the server holding the active copy of the database, and name resolution was returning (as expected) the IP address of the server on the MAPI/Client network. This network blocked RPC between servers and so (as one of the many issues they now attribute to this problem) logs could not be truncated and Event ID 2136 was posted once per database on the passive node in the DR site. The two servers in the primary site could RPC each other, so this log is not repeated in the primary site.
To solve this log growth problem without waiting for a response from the firewall team, we added a record to the hosts file on the passive server to override DNS name resolution, and within 15 minutes 2TB of log files instantly disappeared on all servers. Name resolution was reverted to DNS and the firewall team contacted.