[lvs-users] NFCT and PMTU

Julian Anastasov ja at ssi.bg
Mon Sep 10 22:16:43 BST 2012


On Mon, 10 Sep 2012, lvs at elwe.co.uk wrote:

> I have a number of LVS directors running a mixture of CentOS 5 and CentOS 
> 6 (running kernels 2.6.18-238.5.1 and 2.6.32-71.29.1). I have applied the 
> ipvs-nfct patch to the kernel(s).
> When I set /proc/sys/net/ipv4/vs/conntrack to 1 I have PMTU issues. When 
> it is set to 0 the issues go away. The issue is when a client on a network 
> with a <1500 byte MTU connects. One of my real servers replies to the 
> clients request with a 1500 byte packet and a device upstream of the 
> client will send an ICMP must fragment. When conntrack=0 the director 
> passed the (modified) ICMP packet on to the client. When conntrack=1 the 
> director doesn't send an ICMP to the real server. I can toggle conntrack 
> and watch the PMTU work and not work.

	I can try to reproduce it with recent kernel.
Can you tell me what forwarding method is used? NAT? Do
you have a test environment, so that you can see what
is shown in logs when IPVS debugging is enabled?

	Do you mean that when conntrack=0 ICMP is forwarded
back to client instead of being forwarded to real server?

	Now I remember for some problems with ICMP:

- I don't see this change in 2.6.32-71.29.1:

commit b0aeef30433ea6854e985c2e9842fa19f51b95cc
Author: Julian Anastasov <ja at ssi.bg>
Date:   Mon Oct 11 11:23:07 2010 +0300

    nf_nat: restrict ICMP translation for embedded header
     	Skip ICMP translation of embedded protocol header
    if NAT bits are not set. Needed for IPVS to see the original
    embedded addresses because for IPVS traffic the IPS_SRC_NAT_BIT
    and IPS_DST_NAT_BIT bits are not set. It happens when IPVS performs
    DNAT for client packets after using nf_conntrack_alter_reply
    to expect replies from real server.
    Signed-off-by: Julian Anastasov <ja at ssi.bg>
    Signed-off-by: Simon Horman <horms at verge.net.au>

diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index e2e00c4..0047923 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -462,6 +462,18 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
 			return 0;
+	if (manip == IP_NAT_MANIP_SRC)
+		statusbit = IPS_SRC_NAT;
+	else
+		statusbit = IPS_DST_NAT;
+	/* Invert if this is reply dir. */
+	if (dir == IP_CT_DIR_REPLY)
+		statusbit ^= IPS_NAT_MASK;
+	if (!(ct->status & statusbit))
+		return 1;
 	pr_debug("icmp_reply_translation: translating error %p manip %u "
 		 "dir %s\n", skb, manip,
 		 dir == IP_CT_DIR_ORIGINAL ? "ORIG" : "REPLY");
@@ -496,20 +508,9 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
 	/* Change outer to look the reply to an incoming packet
 	 * (proto 0 means don't invert per-proto part). */
-	if (manip == IP_NAT_MANIP_SRC)
-		statusbit = IPS_SRC_NAT;
-	else
-		statusbit = IPS_DST_NAT;
-	/* Invert if this is reply dir. */
-	if (dir == IP_CT_DIR_REPLY)
-		statusbit ^= IPS_NAT_MASK;
-	if (ct->status & statusbit) {
-		nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
-		if (!manip_pkt(0, skb, 0, &target, manip))
-			return 0;
-	}
+	nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
+	if (!manip_pkt(0, skb, 0, &target, manip))
+		return 0;
 	return 1;

	If this patch does not help we have to debug it

> I would happily leave conntrack off, but it has a huge performance impact. 
> With my traffic profile the softirq load doubles when I turn off 
> conntrack. My busiest director is doing 2.1Gb of traffic and with 
> conntrack off it can probably only handle 2.5Gb.

	It is interesting to know about such comparison
for conntrack=0 and 1. Can you confirm again both numbers?
2.1 is not better than 2.5.

> I am hoping that this issue has been observed and fixed and someone will 
> be able to point me to the patch so I can back port it to my kernels (or 
> finally get rid of CentOS 5!).
> Thanks
> Tim


