Problems with IPVS

Roberto Nibali ratz at drugphish.ch
Tue Oct 17 16:35:42 BST 2006


>  Dumps attached on previous e-mail were done on bond0 interface which is
> facing proxy. tcpdumps done on proxy confirms the problem.

Hehe, you definitely want to use all possible features of Linux 
networking. How is your bonding configured, ALB? There is an outstanding 
issue with regard to packet reassembly on bond devices using ALB. It's 
highly unlikely that you're experiencing it, though. But this could 
explain your not perfect looking ethereal :).

>  tcpdump.cap - DNAT case
>  tcpdump2.cap - LVS case
>  tcpdump3.cap - LVS case and Nokia phone

Still no data at my end.

>>>  1. phone sends SYN packet to proxy;
>>
>> Means (from previous email context):
>>
>> Phone --> GRE tunnel --> netwap --> fwmark --> LVS --> proxy
> 
>  Yes. netwap is interface on the same server running LVS.

Ok.

>> How many devices are we talking about including Phone and proxy?
> 
>  Phone, SGSN/GGSN, PIX firewall (one end of GRE is there), server, proxy.

Excellent, thanks. Does the PIX belong to the carrier? I presume, the IP 
addresses after the PIX are still non-publicly routeable IP addresses?

@Joe: In case you want to update the LVS-Howto:
       http://en.wikipedia.org/wiki/SGSN
       http://tools.ietf.org/html/rfc3344

>>>  2. proxy responds with SYN,ACK;
>>>  3. phone sends ACK;
>>
>> Beautiful, if this goes through LVS, it's already a big step towards a 
>> correctly working LVS.
> 
>  Nokia phones works through LVS without problems.

Hmm, since you talk about re-transmission, I wonder one of the following 
contexts apply (http://tools.ietf.org/html/rfc3344#page-83):

C.1. TCP Timers

    When high-delay (e.g. SATCOM) or low-bandwidth (e.g. High-Frequency
    Radio) links are in use, some TCP stacks may have insufficiently
    adaptive (non-standard) retransmission timeouts.  There may be
    spurious retransmission timeouts, even when the link and network
    are actually operating properly, but just with a high delay because
    of the medium in use.  This can cause an inability to create or
    maintain TCP connections over such links, and can also cause unneeded
    retransmissions which consume already scarce bandwidth.  Vendors
    are encouraged to follow the algorithms in RFC 2988 [31] when
    implementing TCP retransmission timers.  Vendors of systems designed
    for low-bandwidth, high-delay links should consult RFCs 2757 and
    2488 [28, 1].  Designers of applications targeted to operate on
    mobile nodes should be sensitive to the possibility of timer-related
    difficulties.

C.2. TCP Congestion Management

    Mobile nodes often use media which are more likely to introduce
    errors, effectively causing more packets to be dropped.  This
    introduces a conflict with the mechanisms for congestion management
    found in modern versions of TCP [21].  Now, when a packet is dropped,
    the correspondent node's TCP implementation is likely to react as
    if there were a source of network congestion, and initiate the
    slow-start mechanisms [21] designed for controlling that problem.
    However, those mechanisms are inappropriate for overcoming errors
    introduced by the links themselves, and have the effect of magnifying
    the discontinuity introduced by the dropped packet.  This problem has
    been analyzed by Caceres, et al. [5].  TCP approaches to the problem
    of handling errors that might interfere with congestion management
    are discussed in documents from the [pilc] working group [3, 9].
    While such approaches are beyond the scope of this document,
    they illustrate that providing performance transparency to mobile
    nodes involves understanding mechanisms outside the network layer.
    Problems introduced by higher media error rates also indicate the
    need to avoid designs which systematically drop packets; such designs
    might otherwise be considered favorably when making engineering
    tradeoffs.

But then we'd definitely have a problem with IPVS. However, let's not 
jump to early conclusions.

>>>  4. phone sends HTTP GET request;
>>>  5. proxy ACKs packet 4;
>> Only ACK? No data?
> 
>  Yes.

Window size? adv size?

>>>  6. proxy sends HTTP data packet;
>>>  7. proxy sends another HTTP data packet;
>>>  8. proxy sends FIN packet;
>>>
>>>  weird things starts here
>>>
>>>  9. phone once more sends ACK packet acknowledging packet 2 
>>> (duplicate of packet 3);
>> Does the proxy have SACK/FACK support enabled?
> 
>  Proxy is CentOS4 Linux server running Squid.

And you see nothing unusual in your squid logs when connecting with SE 
phones?

> # sysctl net.ipv4.tcp_fack net.ipv4.tcp_sack
> net.ipv4.tcp_fack = 1
> net.ipv4.tcp_sack = 1

Does disabling (just for a test) SACK change anything?

>>>  10. and one more dupe of packet 3;
>>>  11.-14. proxy repeats packet 6. 4 times.
>> It has to. Is ECN enabled?
> 
>  Once again sysctl says that no. Both on LVS server and on proxy.

What are the kernel versions? (Sorry, if this is a dupe.)

>>>  The problem is that LVS does not pass packets 11. to 14. to phone. Why?
>> Because packet 8 was FIN and LVS is not stateful with regard to TCP 
>> sessions and retransmits.
> 
>  But phone did not acknowledged that FIN yet?

Sure, but we act on first seen FIN regarding template expiration, IIRC:

http://www.drugphish.ch/~ratz/IPVS/ip__vs__proto__tcp_8c.html#a36

But I'd need to check the code again. Take this with a grain of salt.

>>>  In case of DNAT packets 11.-14. are passed to phone which at the end 
>>> acknowledges packets 6. and 7. and then acknowledges packet 8. thus 
>>> closing TCP connection.
>> Here I don't follow your statements, sorry.
> 
>  If I setup DNAT instead of LVS then packets 11.-14. are sent to phone. 
> In case of LVS they are not.

So you get to see packets 11-14 on the outbound interface of LVS from 
Squid, but never on the inbound interface (direction of PIX)? This is 
very odd!

>  And after phone receives those packets it sends ACK to packets 6. and 
> 7. and then to 8.

But only for DNAT.

Regards,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc


More information about the lvs-users mailing list