load balancing trouble at a high load
Hideaki Kondo
toreno4257 at nifty.com
Sun Jul 2 14:57:14 BST 2006
Hello,
I almost found out the reason of the limit (about 28230)
of ActConn + InActConn.
The default of /proc/sys/net/ipv4/ip_local_port_range is "32768 61000".
In short, 61000 - 32768 = 28232.
The number of client of our test environment is one.
The hash key of ip_vs_conn_tab (connection table) is based on
protocol, s_addr(caddr), s_port(cport), d_addr(vaddr), and d_port(vport).
So I think that the max limit of hash key produced by hash function
is 28232(default) for one client to same virtual server.
Therefore, I think the limit of ActConn + InActConn for every client
at a high load exists and the number of hash key for ip_vs_conn_tab
from same client to same virtual server (to a realserver) is full.
So I think that strange behavior at a high load was occured by
the above reason.
In short, the cause of the load balancing trouble at a high load is mainly
related to ip_vs_conn table managed by hash key based on the above elements
and the limit of port range of a client
But I think that this specification of ip_vs is no problem in real environment.
# I'm very sorry for my poor English.
Thanks a lot.
Best regards,
----- Original Message -----
From: "Hideaki Kondo" <kondo.hideaki at oss.ntt.co.jp>
To: <lvs-users at LinuxVirtualServer.org>
Sent: Thursday, May 25, 2006 7:22 PM
Subject: Re: load balancing trouble at a high load
>
> Hello,
>
> i give a supplementary explanation about my report
> because of the lack of my information.
>
>> <<Trouble Process>>
>> (1)give a high load to LB1 by while_wget & while_ab from CL1.
>> (while_wget and while_ab are simple shell scripts.
>> while_wget repeats "wget -O index.html http://192.168.0.101:80/index.hmtl"
>> without sleep.
>> while_ab also repeats "ab -n 10000 -c 10 http://192.168.0.101:80/index.html"
>> without sleep.)
>> LB1 is correctly loadbalancing to RS1 and RS2 by RoundRobin at this time.
>>
>> (2)After a few minutes, it seems to be reached to max limit? of
>> ActiveConns + InActiveConns.
>
> It seems that the trouble always occurs when ActiveConn + InActConn is
> close to total 28231 as follows.
>
> ------------------------------------------------------------------------
> IP Virtual Server version 1.2.0 (size=4096)
> Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
> TCP 192.168.0.101:http rr
> -> rs02:http Masq 1 0 1
> -> rs01:http Masq 1 1 28229
> ------------------------------------------------------------------------
>
>> Then crash the NIC(eth0) of RS2 intentionally by executing manually
>> "ifconfig eth0 down".
>> (while_wget & while_ab become to be the state of freeze at this time.
>> This state is no problem.)
>> And change weight 1 to 0 by executing manually "ipvsadm -e -t 192.168.0.101:80
>> -r 192.168.1.2:80 -w 0 -m". (I tested this process without ldirectord
>> for checking the behavior of LVS in detail.)
>>
>> (3)give a new high load to LB1 by while_wget & while_ab from CL1
>> instead of the old high load in (1).
>> LB1 is correctly sending http packtes only to RS1 at this time.
>>
>> (4)And then recover the NIC(eth0) of RS2 intentionally by executing manually
>> "/etc/init.d/network restart".
>> After a while, LB1 starts sending http packets to RS1 and RS2 in spite of
>> still weight 0 of RS2. Moreover, LB1 is sending the packets to RS2 much
>> less than RS1.
>> (This strange behavior continues permanently. So I think the cause of
>> the behavior isn't always in a retransmit process of TCP Layer.
>> In fact, the strange behavior stops when i stop the high load from CL1)
>
> i applied "IP virtual server debugging" in "make menuconfig",
> made kernel-2.6.9-22.EL, and then applied "net.ipv4.vs.debug_level=15"
> in sysctl.conf and applied "kern.* /var/log/kernel.log" in syslog.conf.
> As far as checking kernel.log for LVS, LB1 doesn't seems to be
> loadbalancing to RS2 at this stage (4). ???
> So the cause of the strange behavior can't deny to be related with
> the retransmit process etc of TCP Layer.
>
> Checking by "ipvsadm -Lc", there are many TIME_WAIT states,
> it seems that InActConn number is reflected them.
> By the way, refering to ip_vs source code (ip_vs_proto_tcp.c),
> IP_VS_TCP_S_TIME_WAIT is 2*60*HZ.
> When i changed IP_VS_TCP_S_TIME_WAIT 2*60*HZ to 10*Hz etc (much smaller
> than 2*60*Hz), i think it seems to be improved the strange behavior.
>
> Is IP_VS_TCP_S_TIME_WAIT related with the cause of the trouble ?
> i think some timers in LVS are related with the behavior ...??
>
>>
>> (5)After checking this strage behavior for a while, change weight 0 to 1 by
>> executing manually "ipvsadm -e -t 192.168.0.101:80 -r 192.168.1.2:80 -w 1 -m".
>> But the strange behaivor still continues eternally, LB1 is sending the packets
>> to RS2 much less than RS1 in spite of Round-Robin.
>
> Certainly, there are two cases.
> One case is that LB1 is sending the packets to RS2 much less than RS1,
> the other case is that LB1 is not sending the packets to RS2 at all.
> In fact, these cases are also same in (4).
>
>>
>> (6)Then stop all high load (while_wget & while_ab) from CL1, and wait for a few
>> minutes by becoming to be close to 0 about ActiveConns + InActiveConns.
>> And start a new high load from CL1 by while_wget & while_ab, then
>> LB1 is correctly and evenly loadbalancing to RS1 and RS2 as same as (1)
>>
>> Is this trouble related with some timers, u_threshold, dest->flag
>> (IP_VS_DEST_F_OVERLOAD/IP_VS_DEST_F_AVAILABLE) etc in LVS ?
>> Is this strange behavior correct for the specification of LVS ?
>> (I can't understand the specification of LVS in detail.)
>> Is there anything about how to cope with this trouble ?
>>
>> I'm sorry for my many questions.
>> If you have information or hints etc about the trouble,
>> Would you please teach me about them ?
>>
>
> Thanks in advance.
> Best regards,
>
> --
> Hideaki Kondo
--
Hideaki Kondo
Search lvs-users Archives
More information about the lvs-users
mailing list