Major issue with LVS-DR when a server gets overloaded

Roberto Nibali ratz at drugphish.ch
Wed Feb 14 23:19:13 GMT 2007


> I've been using a Foundry Networks ServerIron XL until now, with DSR
> ("Direct Server Response" aka "Direct Response", "DR") to load-balance
> one virtual server to six real web servers.
> 
> All 6 real servers are identical and have the same weight. When one of
> the real web servers gets overloaded, the URL checked on it starts
> sending a 500 status code instead of the normal 200. In this case, I
> would expect the real server to be taken out of the load-balancing and
> all traffic to be sent to the other 5 remaining real servers. But no.
> 
> In the above scenario, what I saw in the ServerIron logs was that the
> real server was properly detected as "down", but all web traffic got
> sent to this server instead of having it taken offline(!!), nothing
> sent to the others, thus timeouts and some 500 errors for all clients,
> and a real server in bad shape, needing to be rebooted in many cases.

Either a massive bug in the ServerIron Firmware or a configuration 
glitch on your side. Care to post the relevant part of the configuration?

> I thought this was a bug with the ServerIron. So I looked at LVS.
> 
> I implemented a parallel identical setup using LVS and keepalived. The
> setup is similar, with LVS-DR and all 6 real web servers. Only the
> virtual server IP address changes, obviously, to keep both setups
> in parallel.
> 
> My limited testing worked fine, but when I started sending real
> traffic, the exact same issue as with the ServerIron happened!
> Symptoms :
> - The virtual IP address no longer responds on port 80, it times out
> - The real server having problems gets REALLY overloaded
> - All other 5 real servers no longer receive any traffic
> - ipvsadm show no new active or inactive connections (counters stay the
> same), only the persistent connections counters decrease slowly (as
> expected since there are no new connections...)
> 
> Even after restarting the problematic real server, keepalived re-adds
> it properly, but nothing works anymore. I need to restart keepalived
> (which flushes the ipvsadm configuration by default) for things to
> start working again.

How exactly do you get your RS to dynamically switch from HTTP response 
code 200 to 500? Have you checked the HTTP response header using a CLI 
tool like curl, lynx or wget?

> I am really confused. I've tried stopping the web daemon on one of the
> real servers under production load, and it gets taken out as expected,
> and all keeps working fine. It seems that only when the web server still
> responds with 500 status and gets detected as down, then up, then down
> again etc. does the problem appear. Note that the setup can work fine
> for hours and hours, the issue only appears when a real server has a
> problem.

This however sounds more like a "flapping" or threshold ping-pong issue.

> I would like to have tried some kind of "keep the real server disabled
> for n seconds when it's detected as down" in order to keep the check
> from flip-flopping like this, but there is no such setting in
> keepalived AFAICS.

Would it be possible and good enough for you to use the threshold 
limitation feature by setting an upper and lower threshold for the 
amount of active + inactive connections?

Regards,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

Search lvs-users Archives
Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort

More information about the lvs-users mailing list