The importance of being liberal in a Cisco environment

Okay, so today I grappled with a Cisco sized gorilla and won. -- me, earlier this year, apparently feeling rather chuffed with myself.

I had recently launched a new service for a client, and a harvester on the Internet was experiencing timeouts in trying to connect, and so our data was not being harvested by the harvester that we needed. There was no evidence of a problem in the web-server logs because no HTTP request ever had a chance to make it through.

It seems that Cisco products (some unstudied subset, but likely firewalls and NATs) seem to play a bit too fast and loose for Linux’s liking and change packets in ways that makes the Linux firewall (iptables) stateful connection tracking occassionally see such traffic as INVALID. This manifests in in things such as connection timeouts, and as such you won’t notice it in things like webserver logs. In a traffic capture, you may recognise it as a lot of retransmissions.

Possibly it is affected by TCP window scaling as well. It is only noticeable to traffic from the internet (from my point of view in New Zealand, so presumably related to round-trip-time), and as such is most likely noticeable only on production systems accepting connections from the internet.

Possibly browsers may be less affected (they are more robust), and smaller spiders and other non-browsers would likely be more affected.

Others on the internet have certainly experienced the same:

The solution is to enable the ‘nf_conntrack_tcp_be_liberal’ flag:
nf_conntrack_tcp_be_liberal - BOOLEAN
0 - disabled (default)
not 0 - enabled
Be conservative in what you do, be liberal in what you accept from others.
If it's non-zero, we mark only out of window RST segments as INVALID.

echo 1 > /proc/sys/net/netfilter/nf_conntrack_tcp_be_liberal

Post change, I experienced immediate resolution of this issue, much to the relief of stakeholders.
THANK YOU! YOUR MAGIC WORKED!!!! Woo, I’m too excited!
The test harvest run beautifully with 3,812 records in return.
In my experience of this, devices behind a load-balancer do not seem to be affected, probably because our load-balancer has a habit of stripping window-scaling TCP options I think. Its much more likely to present itself when the server is terminating the TCP connection itself.


Popular posts from this blog

ORA-12170: TNS:Connect timeout — resolved

Getting MySQL server to run with SSL

From DNS Packet Capture to analysis in Kibana