load balancer sending to the wrong real servers
dburke at techteam.com
Mon Sep 13 13:42:48 BST 2004
Sorry if this is a repost, I don't remember ever actually seeing it get
This one is really weird to me. We've been using heartbeat and
ldirector with lvs for a few years now with relatively little problem.
Indeed there's been a lot of time where we haven't had to think about
I'll try and detail the problem clearly. First the background...
We have a setup of 8 servers total.
Two of them are Dell 1U's running Fedora C2 acting as the Load Balancers
for a total of 6 VIP's. Using heartbeat and ldirector they failover to
Two of them are SunFire boxes, running Solaris 8 and iPlanet web server.
The remaining 4 are Dell boxes running Fedora C2 with apache.
Two weeks ago we upgraded the linux boxes to Fedora C2, applying all the
updates with my favorite utility (apt-get) :)
After the upgrade we started having problems with the Solaris boxes. It
seemed like after a while LB2 just decided not to balance them anymore,
so http requests were timing out. Once we restarted heartbeat on LB2 it
was fine for nearly a day before there was problem again (the time was
not any exact interval. Last time we did an upgrade we had an ARP
problem and it was every 4 hours like clockwork).
What got weird however was after a week of dealing with this, we started
seeing LB2 take the traffic that should have been going to the Sun boxes
and direct it to the linux boxes. Since apache on those boxes didn't
have a virtualhost setup for that IP people are getting the default
apache "fresh install" page. Once we bounced heartbeat on LB2
everything was fine for a while again.
Now through research I found that the NOARP flag doesn't seem to work in
linux, which is odd, but we were seeing a vast majority of the traffic
go to just one linux web server and the tables in ipvsadm were not
changing, so ok, I have done the arp_ignore and arp_announce thing on
the linux web servers and that seems fine now. However, nothing has
changed on the Sun boxes, and the VIP for that service is not on any of
the linux web servers, so that doesn't explain why LVS started sending
If somehow the arp problem was causing the linux web servers to steal
traffic from the solaris servers I can accept that, but I need some
technical reason why for when my boss comes to me for a root cause.
This e-mail transmission is strictly confidential
and intended solely for the person or organization
to whom it is addressed. It may contain privileged
and confidential information and if you are not the
intended recipient, you must not copy, distribute or
take any action in reliance on it. If you have
received this e-mail in error, please notify the
sender as soon as possible and delete the e-mail
message and any attachment(s).
This message has been scanned for viruses
by TechTeam's email gateway.
More information about the lvs-users