Packet Loss: The Hidden Cause of Remote Access Performance Woes

Of the major determinants of network performance for remote users, packet loss is the most serious. So, if your applications are crawling or it’s taking 10 minutes for a file to download, chances are it’s not a bandwidth or latency issue but a packet loss problem.

Remote access performance is a big deal for enterprises now that around 50% of their staff are working away from the office for at least a couple of days a week. Workers who expect office LAN levels of performance when they work away from the office are usually disappointed. The result is lower productivity for the user and frustration for IT whose response to requests for support is unlikely to be any more helpful than “It’s your network, sorry”.

Network teams have long regarded unreliable remote access as someone else’s problem. If users are connecting through consumer-grade broadband or hotel Wi-Fi, there’s not much IT can do to help. Radical solutions – like router upgrades or faster broadband links – might help alleviate the problem for home workers, but of course they can’t take those solutions with them when they’re on the move.

Network vendors have also tended to shrug off responsibility for performance in the last mile. However, as hybrid work has grown, SD-WAN, ZTNA and VPN vendors have been forced to pay attention to the problem. Their claims of improved performance usually center on low latency.

Packet loss – the other major factor in performance – is rarely mentioned. This is because while there are some solutions to the latency problem, packet loss is not something many vendors address. They don’t talk about it much because they can’t fix it.

Latency matters but even a very low latency network can grind to a halt when you introduce even small amounts of packet loss.

This is illustrated in Figure 1. The top line in the chart shows the effect of latency on throughput assuming no packet loss. As latency rises towards the right-hand of the scale, throughput reduces but not so dramatically that it would make a noticeable difference to performance. The lower lines of the chart shows the same scenario with 0.0046% packet loss. In this case, adding packet loss to a very low latency connection reduces throughput by around 90%. Even at 50ms, throughput is reduced by more than 95%.

Figure 1. Source: US Department of Energy.

There are three things to highlight here. The first is that anything below 100ms would typically be regarded as acceptable latency for SaaS apps, so you’d expect great performance at 50ms. The second is that 0.0046% is a very low degree of packet loss, so we’re not creating an unrealistic scenario. In busy periods a consumer Internet connection could easily reach 1-2%. The third point is to underline the dramatic negative consequences of a tiny amount of packet loss. Even with low latency, the throughput of a 100Mb/sec broadband connection would fall to below 5Mb/sec.

The back-off that causes the drop in performance is the network’s way of coping with congestion. Crudely put, the TCP/IP protocols’ job is to keep things moving and protect the core network. When a transmitting application starts to lose packets it knows to back off or find a new route through the network.

This principle of keeping the freeways clear is why most packet loss – perhaps 95% – happens at the user edge. It follows that solutions to the network performance issue must also be applied close to the user.

This can be done in two ways. First by providing remote users with an ultra-low latency (5-20ms) connection to the nearest point of presence (PoP). There are obvious practical limitations to achieving that performance with physical PoPs, but almost no constraints on the supply of virtual PoPs, which the new generation of network as-a-service companies are building on existing cloud and telco infrastructure. Using AI, a virtual PoP can be spun up instantaneously wherever the user happens to be.

The second part of the solution is a technique called accelerated pre-emptive packet recovery, which enables lost packets to be recovered very close to the user in these PoPs. The combined effect of packet recovery and low latency has been shown in independent tests to increase throughput by a factor of 30 or more.

Figure 2 shows the result of a speed test by European labs Broadband Testing. It shows the impact of 5% packet loss on transmission of a 100Mb file over a 300Mb/sec connection. Even with zero latency, throughput on the control connection falls to just over 10Mb/sec. At 100ms latency, it is less than 0.5Mb/sec. The same connection with the architectural features described above shows respectable performance throughout the latency range and still manages 10x throughput in the most challenging case (5% packet loss, 100ms latency).

Figure 2. Source: Broadband Testing.

With hybrid work now a firmly established trend, CIOs and network teams will increasingly be tasked with bringing the user experience of remote and mobile users up to the level of their office colleagues. This will mean scrutinizing the performance of existing networking solutions at the point of use, which for many users at least some of the time will be on remote and unreliable connections.

Several tools are available for networking teams to model the impact of packet loss on remote access connections