top of page

Don’t assume anything when troubleshooting!

I was working at a large network heterogeneous environment and started working on a problem of scanners at field offices being unable to transfer documents across the WAN. Working on this problem led me quite down the rabbit hole, a black hole to be more specific.

The field techs had already changed out scanners, but the site continued to be intermittently unable to send documents across the WAN. Some documents transferred, some didn’t. After checking permissions and general settings on the scanner, I started looking at the network path.

I did a traceroute to the scanner to check all the hops. Things looked OK on the surface with normal responses coming back from each hop. Taking it a step further, I started checking each hop for its capability to pass fully loaded frames. To do this I used the following:

Ping x.x.x.x -f -l 1472 (-f do not fragment, -l set payload size to 1472). Assuming the standard network MTU used is 1500 bytes, then 1472 would be the total amount of data that each hop should pass. This is because the Ethernet headers including 20 bytes for the IP header and 8 bytes for the ICMP header take up 28 bytes of the 1500 available, leaving you 1472 bytes for actual payload.

As I started testing each hop, I hit one router in-between the edge and my location which would not pass 1472 bytes. I reduced the payload to 1460, Ping x.x.x.x -f -l 1460, and it passed. Moving it around until I found the cutoff at 1468 bytes, where 1468 bytes would pass, but 1469 would not. Taking a capture of the traffic at the scanner, I could see the scanner was setting the Do Not Fragment Bit in the IP Header by default, so when the scanner was sending fully loaded frames, 1469-1472 bytes, the packets were being dropped at that hop, effectively black-holing the traffic. If the scanner happened to send frames that were not fully loaded, less than 1469 bytes, the traffic would pass through the router, and the document would be received.

This revelation led to troubleshooting differently other applications that were having intermittent problems at field offices, by capturing traffic, checking the “Do Not Fragment” bit was set in the IP Header, and then checking the traffic path for black holes. Defined by RFC 791, in your capture file using Wireshark, you can look for ICMP Type 3 Code 4 errors, which indicate Black Hole detection also along a path. By correlating the issues to a geographical map, I found the hand-off between two telco companies in a large MPLS was dropping 4 bytes of traffic as it traversed from one telco to the other. This ended up being an expensive month-long fix of upgrading telco equipment at many locations, while resolving countless issues 40,000+ users were experiencing intermittently.

The moral of the story?

Telco’s main job is to provide transport.

Don’t assume anything when troubleshooting!

Author - John Modlin -


38 views

Recent Posts

See All

Comments


bottom of page