DR Load balancing active/inactive connections

RU Admin lvs-user at camden.rutgers.edu
Tue Nov 28 13:37:28 GMT 2006


On Tue, 28 Nov 2006, Horms wrote:

> On Tue, Nov 21, 2006 at 08:57:59AM -0500, RU Admin wrote:
>>
>> I've been using IPVS for almost two years now, I started out with 6
>> machines (1 director, 5 real servers) and was using LVS-NAT.  During
>> the first year that I was running that email server everything worked
>> perfectly with LVS-NAT.  About a year ago, I decided to setup another
>> email server, this time with 5 machines (1 director, 4 real servers)
>> and decided it was time to get LVS-DR working, which I successfully
>> did.  I then decided to switch over my first email server (the one
>> with 6 machines) to LVS-DR, since the other LVS-DR server was working
>> great. Both of my email servers have been working great with LVS-DR
>> for the past year, with one major exception (which has just recently
>> started getting worse, because of the large volumes of connections
>> coming into the servers).  The problem I am having is that my
>> active/inactive connections are not being listed properly.  What I
>> mean, is that the counter for my active/inactive connections just keep
>> going up and up, and are constantly being skewed.  I read through a
>> good number of archived messages on this mailing list, and I keep
>> seeing everyone saying "Those numbers ipvsadm are showing, are just
>> for reference, they don't really mean anything, don't worry about
>> them."   Well, I can tell you first hand, when you use wlc (weighted
>> least connections), those number obviously DO mean something.  My
>> machines are no longer being equally balanced between because my
>> connection counts are off, and this is really effecting the
>> performance of my email servers.  When running "ipvsadm -lcn", I can
>> see connections with the CLOSE state going from 00:59 to 00:01, and
>> then magically going back to 00:59 again for no reason.  The same
>> holds true for ESTABLISHED connections, I see them go from 29:59 to
>> 00:01 and then back to 29:59, and I know for a fact that the
>> connection from the client has ended.
>
> I seem to recall a bug relating to connection entries having
> the behaviour you describe above due to a race in reference counting.
> Which version of the kernel do you have? Is there any chance of updating
> it to something like 2.6.18?

I'm using a stock Debian Sarge kernel (2.6.8-2-686-smp), I can definitely 
build the latest kernel, and if you feel that it will help then I'll do 
that.  It's always risky making a major kernel change on a production 
machine, which is why I wanted to hold off from making that change until 
someone else familiar with IPVS, felt that it might help.

>
>> I'm currently using "IP Virtual Server version 1.2.0", and I know that
>> there is a 1.2.1 version available, but my problem is that my email
>> servers are in a production environment, and I really don't want to
>> recompile a new kernel with the latest IPVS if that isn't going to
>> solve the problem.  I'd hate to cause other problems with my system
>> because of a major kernel upgrade.
>>
>> I can only hope that someone has some suggestions, I am a firm
>> supporter of IPVS, and as I said I've been using it for 2 years now
>> and one of my email servers handles over 30,000,000 emails in one
>> month (or almost 1 million emails a day).  So we heavily relying on
>> IPVS.  There is another department in our organization that spent
>> thousands of dollars on FoundryNet load balancing productions, and
>> I've been able to accomplish the same tasks (and handle a higher load)
>> by using IPVS, so clearly IPVS is a solid product.  Unfortunately, I
>> just really need to figure out what is going on with the connection
>> count problems.
>>
>> I not sure what information you guys need, but here's some info about
>> my setup.  If you need any more details, feel free to ask.
>>
>> 6 Dell PowerEdge SC1425
>> Dual Xeon 3.06Ghz processors
>> 2GB DDR
>> 160GB SATA
>> Running Debian Sarge
>>
>> 1 machine is the director, the other 5 are the real servers.  All 6
>> machines are on the same subnet (with public IPs), and the director is
>> using LVS-DR for load balancing.  Just to give you an idea as to the
>> types of connection numbers
>> I'm getting:
>>   Prot LocalAddress:Port Scheduler Flags
>>     -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
>>   TCP  vip.address.here:smtp wlc
>>     -> realserver1.ip.here:smtp     Route   50     648        2357
>>     -> realserver2.ip.here:smtp     Route   50     650        2231
>>     -> realserver3.ip.here:smtp     Route   50     648        2209
>> Whereas when using LVS-NAT (which was 100% perfect), my numbers would be
>> something like:
>>     -> realserver1.ip.here:smtp     Route   50     16        56
>>     -> realserver2.ip.here:smtp     Route   50     14        50
>>     -> realserver3.ip.here:smtp     Route   50     15        48
>
> I assume that the dumps above are for similar traffic rates.

Yes, almost identical traffic rates as compared to the mail logs on the 
servers for incoming email traffic.

>
> I am wondering if the problem is that for some reason the
> linux-directors are not seeing the part of the close sequence
> that is sent by the end-user (it won't see the portion sent by
> the real-servers). Supposing for a minute that this is the case,
> it would explain the strange numbers, and those strange numbers
> will be effecting how wlc allocates connections.

But shouldn't IPVS timeout?  I thought that was the purpose of the 
timeouts...  So that when the director doesn't see a close event after a 
specified period of time, it simply times out.

>
>> I use keepalived to manage the director and to monitor the real
>> servers. The only "tweaking" that I've done to IPVS, is I have to run
>> this:
>>   /sbin/ipvsadm --set 1800 0 0
>> before starting up keepalived, just so that the active connections
>> will stay active for 30 minutes.  In other words, we allow our users
>> to idle their connection for 30 minutes, and after that, then the
>> connection should be terminated.  And I put "0 0" there, because from
>> what I've read, that tells ipvsadm to not change those other two
>> values (in other words, leave the defaults as is).
>>
>> That's about all I can think of, the only other wierd thing that I had
>> to do was to tweak some networking settings on the real servers to fix
>> the pain-in-the-@$$ ARP issues that come with DR.  But I doubt those
>> changes would have anything to do with the director's load balancing
>> problems. Those tweaks were only done on the real servers, and they
>> were to just silence the broadcasting of the MAC address for the VIP
>> (dummy0) interfaces on the real servers.
>
> How exactly did you deal with ARP, there are several methods.

On the real servers, I'm first bringing up the dummy0 interface with the 
VIP, then I use "sysctl" and set the following:
   net.ipv4.conf.dummy0.rp_filter=0
   net.ipv4.conf.dummy0.arp_ignore=1
   net.ipv4.conf.dummy0.arp_announce=2
Then I bring up eth0 with the real server's regular IP address, and with 
"sysctl", I set the following (includes a repeat of the above options):
   net.ipv4.conf.default.rp_filter=0
   net.ipv4.conf.all.rp_filter=0
   net.ipv4.conf.lo.rp_filter=0
   net.ipv4.conf.dummy0.rp_filter=0
   net.ipv4.conf.eth0.rp_filter=0

   net.ipv4.conf.default.arp_ignore=1
   net.ipv4.conf.all.arp_ignore=1
   net.ipv4.conf.lo.arp_ignore=1
   net.ipv4.conf.dummy0.arp_ignore=1
   net.ipv4.conf.eth0.arp_ignore=1

   net.ipv4.conf.default.arp_announce=2
   net.ipv4.conf.all.arp_announce=2
   net.ipv4.conf.lo.arp_announce=2
   net.ipv4.conf.dummy0.arp_announce=2
   net.ipv4.conf.eth0.arp_announce=2

The ARP problem was the one thing that kept me from moving to LVS-DR for a 
long time.  I finally started playing with all of the net.ipv4.conf 
options and bringing up the interfaces in a specific order, and finally 
stumbled across a method that actually worked.  I'm sure some of the above 
options don't need to be set, but it finally works, and I'm a little 
afraid to touch it.

I'm going to try and build the latest 2.6.18 now, and hopefully sometime 
later this week I can install the new kernel and reboot our director. 
Unfortunately I've never been able to get keepalived to handle a 
MASTER/SLAVE director properly, so I only have one director in front of 
the real servers, so if I make a mistake, our main university email 
server will be down.

Thanks for your help!

Craig



>
> -- 
> Horms
>  H: http://www.vergenet.net/~horms/
>  W: http://www.valinux.co.jp/en/
>
> _______________________________________________
> LinuxVirtualServer.org mailing list - lvs-users at LinuxVirtualServer.org
> Send requests to lvs-users-request at LinuxVirtualServer.org
> or go to http://www.in-addr.de/mailman/listinfo/lvs-users
>


      +---------------------------+-------------------------------------+
      | Craig Hynes               | Systems Programmer/Administrator    |
      | master at camden.rutgers.edu | Rutgers Camden Computing Services   |
      | (856) 225-2668            | http://computing.camden.rutgers.edu |
      +---------------------------+-------------------------------------+

Search lvs-users Archives
Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort

More information about the lvs-users mailing list