Packet Loss:
Modelling Microbursts and Packet Queues What is a primary cause of slow throughput in today’s wide-bandwidth networks?
Packet loss
How just a few packet losses can have a debilitating effect on throughput is another matter and has been very well studied –
This takes us into the fields of TCP flow control and congestion avoidance.
But what is the primary cause of packet loss?
Today we’re less likely to find problems in our physical infrastructure such as corroded cables and connectors, or EM interference, and must face the more ephemeral possibility of buffer overflow: a consequence of the severe congestion that occurs quite simply when increasingly-common, large bursts of packets arrive in a router and must wait for their turn to be transmitted. Long queues will be even more likely if the egress link is over-subscribed or feeds a WAN that is slower than the ingress links.
The subject of microbursts is never more prominent than in the securities-trading industry where huge sums depend on receiving news in time and placing orders first. How can we tell if our network has microburst problems and is prone to dropping packets? My search of the web has turned up only one approach. To detect over-subscription you must measure bandwidth utilisation, and, to detect microbursts that might appear for only a few hundreds of microseconds, you measure utilisation over correspondingly small intervals. Many tools compete on their ability to do this better and better. But really?
How many false-positives do you get? It is quite possible to drive a link at 100% utilisation for very long periods without overflowing a buffer. An over-subscribing burst can be handled without loss because that is the whole point of buffers. The severity of a microburst depends on the length of the packet queue when the burst arrives. In other words, you have to take into account all the preceding flows into the queue.
Some network devices understand our requirements very well – they record the high-water marks of packet-queue lengths over time so that we can tell when, how often, and how close we get to overflowing a buffer. Measures of utilisation, over any length of time interval, are little more than a curiosity.
What can we do if we don’t have such sensible network devices, or we are designing a new facility? With a capture file we can, of course, measure bandwidth utilisation, perhaps averaged over time intervals as short as we like. What isn’t widely known is that it is very easy to calculate and display what we really need – the length of the packet queue as it changes with the arrival and departure of every packet. Simple arithmetic tells us the time to transmit a packet of known length over a link of known speed, and tracking queue length involves only incrementing and decrementing a counter as packets arrive and leave a queue.
Thanks to a capture provided by @Sake Blok for a networking challenge at SharkFest 2020 we have a record of all the packets heading for a downstream queue behind a slow link. The large numbers of selective acks (SACKs) indicate quite severe packet losses and we might want to confirm that they are the result of buffer overflow. Furthermore, can we tell the size of the buffer, and what is the speed of the bottleneck?
If we model a queue with an output (egress) speed of exactly 916 Kbps we get this chart of the queue length indicated in two ways: the number of packets; and the space occupied by packets in the queue.
The model requires us to know which packets were lost after they passed the sniffer and our NetData tool infers all the losses from the appearance of selective acks and retransmissions.
Each red marker indicates the length of the queue when a packet was lost, and the fact that packet losses occurred only and always when the modelled length reached a peak confirms the accuracy of the model. The peak in the first burst was lower, probably because the queue was handling unseen traffic. The peak in the second burst is particularly interesting because packets were not lost at a uniform queue length measured in Kbytes, but when the number of packets reached a peak. The following chart shows that a proportion of packets in this burst were very small, and the queue-length evidence suggests that all queued packets were allocated the same buffer space, irrespective of their length.
This fish-net chart focuses only on the second burst and overlays the queue-length graphs with near-vertical short strips arrayed in steeply sloping diagonal lines, to indicate which packets and TCP connections occupied the egress link at any time. Horizontal ticks indicate the ends of packets, and the heights of the strips are proportional to packet length according to the scale on the left.
The popup points to a short packet with only 44 bytes of data and the prevalence of short packets is clearly visible where the horizontal ticks are close together.
Thick red strips in the second half of the chart indicate when the lost packets would have been transmitted if not dropped off the queue. Yes, this chart is hard to read, but it gives us a picture of all the packets in the queue at any time. If the capture contained packets from multiple connections, strip colours would reveal how packets from different connections were interleaved on the link.
Further validation of the model arises from the measurements of round-trip times plotted as black markers at times when the packets left the queue, not when captured by the sniffer. RTTs should increase when packets spend more time in a queue, and that seems to be the case here.
We can judge the effect on RTTs quite precisely because our queueing model knows exactly when each packet arrived and departed the queue. For the next chart NetData overlays calculated, relative queuing times as green circles and we see how closely the black and green markers track each other.
We can be confident in our assessment of the queue’s link speed by experimenting with the model; here we increase the speed by just one percent and regenerate the chart:
Packets are no longer lost at a uniform queue length, and the difference between RTTs and waiting times no longer remains constant throughout a burst. The modelling on the earlier charts tells us that all the lost packets were victims of an overflowing queue limited to 325 packets and about 500 Kbytes, waiting to be transmitted over a link with an effective speed of 916 Kbps.
Could you predict the downstream packet losses from only the rate of bits passing the sniffer – that is, if all you had was the black line on the chart below?
Bob Brownell has more than 45 years’ experience in communications and IT, initially designing and building networks, computer-controlled systems and packet-switching systems in Australia and Europe. He is a founder and director of Measure IT Pty Ltd, a firm that specialises in diagnostic analysis of IT systems, and over the last 20 years has been developing NetData, a powerful network analyser with unique visualisation capabilities that is now freely available. It characterises virtually all transactions with a broad range of application decoders that includes all the major database protocols. NetData has been licensed by IBM, major Australian banks and government departments to diagnose the most complex IT performance problems around the world. rom the University of Tasmania.
Contact
If you would like to understand microbursts in your network and investigate the behaviour of packet queues, packet shapers and packet policers – if you have any intractable performance problem or would like to extend your Wireshark skills and learn more about NetData – please send an email to
NetData Lite can be downloaded free from-
Or see Phil’s YouTube channel that includes training videos: