Performance Monitoring Outlook on Office 365
Outlook on Office 365 must be one of the most widely used SaaS products in the world. Measuring Outlook performance is relatively straightforward, but a mechanism called long-polling can skew the figures.
In this article, we look at the impact of long-polling.
Measuring the performance of the Microsoft Outlook service has become more straightforward with recent changes to the protocol used. In earlier releases, requests and responses were transported on two separate TCP connections, making it difficult to correlate requests and responses. The current release uses MAPI over HTTP which uses the typical protocol of POST a request and get a response interactions. Even though the traffic is encrypted using HTTPS, we can measure request latency by merely looking at the timings between requests flowing towards the service and responses coming back. Many solutions can do this including Wireshark with TRANSUM.
If you are an Outlook user, you'll know that when an email arrives in your inbox, you get a pop-up bubble notifying you. The arrival of an email is an asynchronous event, i.e. we didn't know if and when it was going to happen. The Outlook client software deals with this situation by sending a request to the service asking, "Do I have new emails?" (1 in the diagram above). Rather than respond immediately, the request is just left hanging there. If nothing arrives in the inbox within a timeout period, the service responds indicating nothing has arrived (2). The client immediately sends another "Do I have new emails?" request (3).
If an email arrives before the timeout, the service sends a response indicating mail has arrived (4). The client gets the details (5 & 6) and then issues another "Do I have new emails" request (7). This mechanism is called long-polling.
As the long poll timeouts can be tens or hundreds of seconds, aggregated Outlook response time figures may be skewed.
The graph above is from a monitoring system we engineered for a client. The graph shows 95th percentile request latency values for Outlook on Office 365 based on 5-minute samples. The plot contains just over 12 million requests to the Outlook service.
The performance looks dreadful through the night because there are many idle Outlook clients and few emails, and so a large proportion of the observations are long-poll timeouts. Users in India start at around 3 am UTC, and then early shifts in the UK start at around 6 am. Once we get into the working day, the request latency figures look much better. To clarify things here; there is nothing wrong with the performance of this system, it's just a monitoring issue.
We are investigating ways to characterise the encrypted long-polls, but luckily the high number of real email requests compared to long-polls eliminates the measurement problem during the working day.
I don't have an easy solution for you, but being aware of the long-poll avoids that embarrassing moment when you pick up the phone to Microsoft to incorrectly complain about the performance of Outlook.
Author: Paul is CEO of the Site Reliability Engineering company, Advance7. Through services, technology and consultancy, Advance7 helps banks, financial services companies and insurance companies adopt SRE practices within a traditional enterprise services environment.