I came across an interesting problem today that I think serves as an excellent example of why packet analysis is such a critical skill for network engineers.
A few days ago, the internal network belonging to one of my employer's customers was compromised by a malicious party. Since the customer had connectivity into our network by way of a VPN tunnel and we didn't want to knowingly expose ourselves or other customers to the threat, we saw fit to temporarily sever the VPN while the breach was tended to by another party. We also upgraded the site's core switch to better support a feature useful in the analysis of the breach.
Shortly thereafter, we began receiving reports of problems with Internet connectivity from the site. Everything was reachable, it was just... slow. And worse, the issue seemed not to have any uniform effect: One person experiencing the issue might sit next to someone else who was completely immune and who noticed no difference from the day before. This of course made troubleshooting frustrating, to say the least.
First we tried reversing the firmware upgrade on the core switch, as it was reasonable to suspect we may have encountered some obscure bug, but this was quickly revealed to be a
red herring as the issue persisted. On-site engineers verified that they could still reach everything (excluding of course our internal resources which were no longer reachable as a result of severing the VPN) and speed tests showed mostly normal results. There was no correlation between affected users and any access switch, VLAN, or IP subnet. We also confirmed about seventeen times that Internet traffic was in fact traveling from end hosts through the firewall directly to the Internet, with no proxy or caching servers in between.
Frustrated with the lack of progress in isolating the issue, I asked for a packet capture to be performed at the site. The testing procedure was as follows:
- Open a web browser
- Navigate to about:blank (this ensures that the test begins with a clean slate and guards against any rogue HTTP requests resulting from leaving the prior page)
- Start a promiscuous packet capture in Wireshark
- Navigate to packetlife.net and wait for the page to completely load
- Stop and save the capture.
My motivation for choosing packetlife.net as the test target was more than mere vanity. When you load a major web site like yahoo.com or cnn.com, you're actually generating a huge number of HTTP requests under the hood for content from a dozen or so sources (content delivery networks, embedded advertisements, etc.). By using a simple, familiar site I knew what sort and number of HTTP requests to expect, which makes the job of analyzing each packet in the capture a good deal easier.
After quickly sanity-checking the packet capture to ensure that the test was completed as as designed, my first step was to isolate the initial TCP session triggered by the web request. A quick way to do this if you know the IP address of the remote server is to apply a display filter showing only traffic to or from that IP:
ip.addr==174.143.213.184
Then, right-click on the first TCP packet (the one with only the SYN flag set) and select "Follow TCP Stream". This will apply a new display filter showing only packets which belong to that TCP session. (It will also generate a new dialog showing the contents of the HTTP request, which in many cases can be quite handy but wasn't of interest at this time.)
I reviewed the TCP exchange and everything appeared normal, until I noticed the packet timestamps. After the initial three-way TCP handshake (SYN, SYN/ACK, ACK) completed, there was a delay of about five seconds before the HTTP GET request was transmitted (between packets #32 and #101 in the figure below). Under normal conditions, there should be practically zero delay between the handshake completion and the initial HTTP request, so I knew I was on to something.
(The delta time column in the screenshot was
added to the default column set to show the difference in seconds between each packet and the one displayed immediately before it.)
One important clue to note is that the delay here is almost exactly five seconds; a round number to us humans. This suggests that some process triggered by or in parallel with the TCP connection is intelligently timing out at five seconds, at which point the HTTP request proceeds.
I clear the display filter and begin sifting through the packets which were captured between the establishment of the TCP session and the first HTTP request. Most of it is ambient noise. There's nothing terribly interesting except for a few unanswered TCP requests from the local host to a remote server that used to be accessible via the VPN tunnel we disconnected earlier. The requests were destined for port 5274; a quick google of the port number yielded no clues. But the answer finally clicked when I looked up the destination IP address: It was a Trend Micro antivirus server.
What does an antivirus application have to do with Internet connectivity? One of the services provided by Trend Micro and countless other AV products is the ability to filter URLs and block access from the local machine to questionable web sites. As I understand it (and I could be wrong, as we're well outside my area of expertise now), the client software hijacks the web browser's HTTP request and sends the URL to a central server which compares it against a blacklist. In this scenario, that server is no longer accessible, and the connection attempt times out after - you guessed it - five seconds. The client then "fails open," meaning that the HTTP request is allowed to continue despite receiving no response from the filtering server. (In hindsight, a big red error page complaining about AV software issues would have all but eliminated our time spent troubleshooting.)
Five seconds may not seem like a substantial amount of time, but remember how I mentioned earlier that large sites tend to pull content from myriad sources just to render a single page? If several of those request are being hijacked and delayed one after another, the cumulative delay can quickly grow to the point where angry phone calls are placed. This explanation also reveals why some users were immune to the issue: They were using an alternate antivirus product, or didn't have the URL filtering feature enabled.
Fortunately, the resolution for this issue ended up being trivial to implement. But how long do you suppose troubleshooting might have gone on had we not taken the time to inspect the wire directly? Hone your packet analysis skills well, as the investment will quickly pay off.