[lvs-users] NFCT and PMTU

lvs at elwe.co.uk lvs at elwe.co.uk
Tue Sep 11 21:52:53 BST 2012


That patch solves the problem, at least for ICMP port unreachable packets. 
I tested ICMP port unreachable packets without the patch and like ICMP 
must fragment packets they were not being forwarded with conntrack=1. So 
it all looks good.

The port unreachable test is a really useful way to test ICMP forwarding. 
I will be using that in the future!

Thankyou for your help
Tim

On Tue, 11 Sep 2012, Julian Anastasov wrote:

>
> 	Hello,
>
> On Mon, 10 Sep 2012, lvs at elwe.co.uk wrote:
>
>> I have a number of LVS directors running a mixture of CentOS 5 and CentOS
>> 6 (running kernels 2.6.18-238.5.1 and 2.6.32-71.29.1). I have applied the
>> ipvs-nfct patch to the kernel(s).
>>
>> When I set /proc/sys/net/ipv4/vs/conntrack to 1 I have PMTU issues. When
>> it is set to 0 the issues go away. The issue is when a client on a network
>> with a <1500 byte MTU connects. One of my real servers replies to the
>> clients request with a 1500 byte packet and a device upstream of the
>> client will send an ICMP must fragment. When conntrack=0 the director
>> passed the (modified) ICMP packet on to the client. When conntrack=1 the
>> director doesn't send an ICMP to the real server. I can toggle conntrack
>> and watch the PMTU work and not work.
>
> 	I can try to reproduce it with recent kernel.
> Can you tell me what forwarding method is used? NAT? Do
> you have a test environment, so that you can see what
> is shown in logs when IPVS debugging is enabled?
>
> 	Do you mean that when conntrack=0 ICMP is forwarded
> back to client instead of being forwarded to real server?
>
> 	Now I remember for some problems with ICMP:
>
> - I don't see this change in 2.6.32-71.29.1:
>
> commit b0aeef30433ea6854e985c2e9842fa19f51b95cc
> Author: Julian Anastasov <ja at ssi.bg>
> Date:   Mon Oct 11 11:23:07 2010 +0300
>
>    nf_nat: restrict ICMP translation for embedded header
>
>     	Skip ICMP translation of embedded protocol header
>    if NAT bits are not set. Needed for IPVS to see the original
>    embedded addresses because for IPVS traffic the IPS_SRC_NAT_BIT
>    and IPS_DST_NAT_BIT bits are not set. It happens when IPVS performs
>    DNAT for client packets after using nf_conntrack_alter_reply
>    to expect replies from real server.
>
>    Signed-off-by: Julian Anastasov <ja at ssi.bg>
>    Signed-off-by: Simon Horman <horms at verge.net.au>
>
> diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
> index e2e00c4..0047923 100644
> --- a/net/ipv4/netfilter/nf_nat_core.c
> +++ b/net/ipv4/netfilter/nf_nat_core.c
> @@ -462,6 +462,18 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
> 			return 0;
> 	}
>
> +	if (manip == IP_NAT_MANIP_SRC)
> +		statusbit = IPS_SRC_NAT;
> +	else
> +		statusbit = IPS_DST_NAT;
> +
> +	/* Invert if this is reply dir. */
> +	if (dir == IP_CT_DIR_REPLY)
> +		statusbit ^= IPS_NAT_MASK;
> +
> +	if (!(ct->status & statusbit))
> +		return 1;
> +
> 	pr_debug("icmp_reply_translation: translating error %p manip %u "
> 		 "dir %s\n", skb, manip,
> 		 dir == IP_CT_DIR_ORIGINAL ? "ORIG" : "REPLY");
> @@ -496,20 +508,9 @@ int nf_nat_icmp_reply_translation(struct nf_conn *ct,
>
> 	/* Change outer to look the reply to an incoming packet
> 	 * (proto 0 means don't invert per-proto part). */
> -	if (manip == IP_NAT_MANIP_SRC)
> -		statusbit = IPS_SRC_NAT;
> -	else
> -		statusbit = IPS_DST_NAT;
> -
> -	/* Invert if this is reply dir. */
> -	if (dir == IP_CT_DIR_REPLY)
> -		statusbit ^= IPS_NAT_MASK;
> -
> -	if (ct->status & statusbit) {
> -		nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> -		if (!manip_pkt(0, skb, 0, &target, manip))
> -			return 0;
> -	}
> +	nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> +	if (!manip_pkt(0, skb, 0, &target, manip))
> +		return 0;
>
> 	return 1;
> }
>
> 	If this patch does not help we have to debug it
> somehow.
>
>> I would happily leave conntrack off, but it has a huge performance impact.
>> With my traffic profile the softirq load doubles when I turn off
>> conntrack. My busiest director is doing 2.1Gb of traffic and with
>> conntrack off it can probably only handle 2.5Gb.
>
> 	It is interesting to know about such comparison
> for conntrack=0 and 1. Can you confirm again both numbers?
> 2.1 is not better than 2.5.
>
>> I am hoping that this issue has been observed and fixed and someone will
>> be able to point me to the patch so I can back port it to my kernels (or
>> finally get rid of CentOS 5!).
>>
>> Thanks
>> Tim
>
> Regards
>
> --
> Julian Anastasov <ja at ssi.bg>
>




More information about the lvs-users mailing list