The problem – A new version of an old application running VERY slow! A couple of weeks ago I was asked to analyze an application at one of my client’s data centers. This application was a new deployment of an existing application. The reason we were analyzing this new version was that while the old version took about 10 minutes to complete the processing, the new version was taking over 4 hours to complete. The client had contacted the software vendor seeking assistance, unfortunately they received the response we have all heard at one time or another, which was “it sounds like a network problem.” Being that we specialize in troubleshooting network problems, we felt comfortable isolating the issue. Getting the correct tool – Free! The first task at hand was to determine whether this was really a network problem. For this task we brought out Iperf, one of our trusty free tools. Why Iperf? First, it is free, which falls into everyone’s budget. Second, it is a 120K Byte executable file. There is nothing to install. This eliminates concerns on the part of server administrators. We simply put the executable on a share device accessible by all the servers and copied it to each of the servers for testing. The Setup! From a DOS prompt we ran the following command on the application server:
c:\program files\iperf\iperf –s
This setup made the application server the Iperf server. As the Iperf server, this host began listening to the TCP 5001, waiting for a connection from the Iperf client. On the SQL server we ran this command:
c:\program files\iperf\iperf –c ipaddress of application server
Upon running the above command, the Iperf client, in this case the SQL server, began transmitting TCP packets to the Iperf server. This transfer will last 10 seconds by default. At the completion of this transfer, the server and the client will report the number of bytes transferred and the rate at which the bytes were transferred. The Gold! Here we have the results from the first test we ran:
------------------------------------------------------------ Client connecting to 10.0.0.20, TCP port 5001 TCP window size: 8.00 KByte (default) ------------------------------------------------------------ [1868] local 10.0.0.30 port 1139 connected with 10.0.0.20 port 5001 [ ID] Interval Transfer Bandwidth [1868] 0.0-10.0 sec 647 MBytes 543 Mbits/sec
First result – Enough Bandwidth! So throwing more bandwidth at the problem which is the most common solution used today, would not have solved the issue! By running a single TCP stream from the SQL server to the application server, we were able to achieve a throughput rate of 543 Megabits per second. Is this good? From the applications standpoint it was good. Analysis of the production version of the application showed that it only required 43 Megabits per second of bandwidth to complete in 10 minutes, as expected. While the available bandwidth was more than enough for the application, we wanted to make sure that the link was able to operate at the advertised Gigabit speed. This was accomplished by increasing the number of concurrent TCP streams. To do this we used the –P parameter with Iperf.
c:\program files\iperf\iperf –c ipaddress of application server –P 8
By increasing the number of concurrent connections to 8, we were able to achieve a data rate of 944 Megabits per second, or 94.4% utilization on the Gigabit network.
------------------------------------------------------------ Client connecting to 10.0.0.20, TCP port 5001 TCP window size: 8.00 KByte (default) ------------------------------------------------------------ [1692] local 10.0.0.30 port 1265 connected with 10.0.0.20 port 5001 [1772] local 10.0.0.30 port 1260 connected with 10.0.0.20 port 5001 [1708] local 10.0.0.30 port 1264 connected with 10.0.0.20 port 5001 [1724] local 10.0.0.30 port 1263 connected with 10.0.0.20 port 5001 [1836] local 10.0.0.30 port 1256 connected with 10.0.0.20 port 5001 [1740] local 10.0.0.30 port 1262 connected with 10.0.0.20 port 5001 [1788] local 10.0.0.30 port 1259 connected with 10.0.0.20 port 5001 [1756] local 10.0.0.30 port 1261 connected with 10.0.0.20 port 5001 [1804] local 10.0.0.30 port 1258 connected with 10.0.0.20 port 5001 [1852] local 10.0.0.30 port 1255 connected with 10.0.0.20 port 5001 [1868] local 10.0.0.30 port 1254 connected with 10.0.0.20 port 5001 [1820] local 10.0.0.30 port 1257 connected with 10.0.0.20 port 5001 [ ID] Interval Transfer Bandwidth [1836] 0.0-10.0 sec 102 MBytes 85.8 Mbits/sec [1756] 0.0-10.0 sec 88.5 MBytes 74.2 Mbits/sec [1804] 0.0-10.0 sec 95.1 MBytes 79.7 Mbits/sec [1724] 0.0-10.0 sec 83.1 MBytes 69.7 Mbits/sec [1852] 0.0-10.0 sec 114 MBytes 95.2 Mbits/sec [1788] 0.0-10.0 sec 112 MBytes 94.1 Mbits/sec [1708] 0.0-10.0 sec 114 MBytes 95.2 Mbits/sec [1868] 0.0-10.0 sec 112 MBytes 93.7 Mbits/sec [1740] 0.0-10.0 sec 63.2 MBytes 53.0 Mbits/sec [1772] 0.0-10.0 sec 77.1 MBytes 64.7 Mbits/sec [1692] 0.0-10.0 sec 88.4 MBytes 74.2 Mbits/sec [1820] 0.0-10.0 sec 78.1 MBytes 65.5 Mbits/sec [SUM] 0.0-10.0 sec 1.10 GBytes 944 Mbits/sec
So, what has all of this told us? This is not a network problem! What do we know after running these tests?
The network infrastructure between these two servers is capable of transferring data at Gigabit speeds.
The Network Interface Cards (NIC) on the servers are capable of Gigabit speeds.
The IP and TCP stacks on the servers are capable of Gigabit speeds.
Same old story – The network is not the Problem! Not this time! It looks like the vendor was mistaken in blaming the customer’s network for the cause of the slowdown. Having proven it was not the network, it was time for us to move up the protocol stack and begin analyzing the conversation between the client and the server. The first step in this analysis was to place a Gigabit port aggregation tap between the application server and the Ethernet switch. Why a tap? The tap allowed us to monitor all of the pure traffic in real time between the server and the rest of the network and not have to change the configuration on the Ethernet switch. A PC with a Gigabit NIC was attached to the Gigabit tap (TAP for 100% pure access to the data, SPAN is not acceptable for REAL TIME analysis – see Tim's now famous article) and a filter was configured to only capture the packets between the application server and the SQL server. A test job was run and the application traffic was captured. I then began to analyze the traffic between the two servers. If something takes a long time to complete, it is usually one of two things. The first is a few packets with large delays between them. The second is many packets with small delays between each packet. Now the Facts - In this case, it was few packets with long delays. In fact, for every SQL query we observed a 200ms delay. In our normal day to day life, 200ms is a pretty short period of time. However, if I have to wait this amount of time between each of 72,000 SQL queries, this can result in a 4 hour delay. Here is a small section of the trace:
Frame Description [1729] The application server sends a SQL request to the SQL server [1730] The SQL server acknowledges the request at the TCP layer [1731] The SQL server sends the first packet of the response [1732] The application server waits 200ms before sending an acknowledgment [1733] The SQL server sends the remainder of the response [1734] The application server acknowledges the response [1735] The SQL server completes the call
Why was the application server waiting 200ms before sending the acknowledgement to the response packet from the SQL server? By default the Microsoft TCP stack will send a TCP Acknowledgement for every second TCP data packet received. The idea is to minimize the number of acknowledgements. If the receiving TCP stack only receives one TCP data packet it will wait 200ms before sending the acknowledgement. This is know as TCP Delayed Acknowledgements. In normal situations, the last packet of a data stream may be a single packet. Waiting 200ms to send a response is not a problem, since the receiver has already received all of the requested data and is not waiting to proceed. However, in this case the SQL server has decided to send a single response packet and wait for a response before sending the remainder of the response. The application server will not send an acknowledgement until it receives the second data packet, or the 200ms timer expires. Now we know! At this point we knew the network was not the problem and the Delayed Acknowledgements were causing the slowdown. So, what could be done to solve this problem? There were two course of action. The first would be to determine why the SQL server was waiting for the acknowledgement after the first response packet. In many cases this can be the way the application is written or within the SQL server code itself. The second was to get the application server to acknowledge every packet. The downside to this solution is the increase in traffic on the network. For this problem we chose the second solution. The network between the servers was a Gigabit network and was not going to be impacted by a few extra acknowledgement packets. With a little research on the Internet I found a registry setting that allows us to change the number of data packets that must be received before sending an acknowledgment. The name of this registry value is TcpAckFrequency. By default this value is set to 2, meaning the receiver will send a TCP Acknowledgement for every 2 data packets received. By changing this value to 1, we would be able to get the server to acknowledge every data packet received and eliminate the Delayed Acknowledgement problem. By following the instructions on the Microsoft Knowledgebase, we were able to location the spot where we needed to add the value and set it to a value of 1. The server was rebooted as per the instructions. WOW! 3 minutes versus 4 hours!! The test job was rerun. After making the change to the registry the job completed in 3 minutes and 20 seconds. This was a vast improvement over the 4 hours it was taking prior to the change. The client was happy and I was happy. The tap was removed from between the server and network and I was on to the next analysis job. Is this a change I would recommend making on servers, absolutely not! Delayed Acknowledgements and the default value of 2 for TCPAckFrequency have been setup for good solid reasons. In this case we found a unique situation that warranted changing this value to a non-standard setting and it solved a very specific problem. Who we are! At Network Protocol Specialists, LLC it is our belief that through careful analysis and a thorough understand of the protocols any problem on the network can be isolated, diagnosed and resolved. Replacing network equipment and upgrading the infrastructure will never solve problems such as the one described in this paper. Network Protocol Specialists, LLC is comprised of a group of dedicated network troubleshooters that enjoy both solving network problems and teaching others how to solve their own network related problems. We believe that by being both instructors and analysts, we are able to bring our real world experience into the classroom, as well as bringing the theory and concepts from the classroom into the real world.