Realserver failover problem using ssl and tomcat

Jason Downing jasondowning at hytechscales.com
Wed Jun 28 02:30:33 BST 2006


Setup: 
  a.. Single linux director with VIP of 192.168.0.240 and RIP of 192.168.0.16
  b.. Two realservers with RIP of 192.168.0.14 and RIP of 192.168.0.15, called realserver1 and realserver2 respectively
  c.. Total of 3 computers, using LVS-DR
  d.. Realservers running tomcat with ssl, each realserver has a copy of the ssl certificate, and the director does not have a certificate. Sessions are managed with a tomcat cluster.
  e.. The director has ldirectord running, and no heartbeat/director failover for purposes of this problem
  f.. ldirectord uses a negotiate page from tomcat to monitor realserver health
  g.. Debian Sarge current version, 3.1, with 2.4.27 kernel. uname -r output: 2.4.27-2-386. All 3 machines using same o/s
  h.. If I connect 3 client computers to my server farm the system works well, and the load is balanced.
  i.. The load balancer can change the client from one server to another mid-session and this works fine too.
The problem:  If I disconnect realserver1 by pulling out its ethernet cable, clients connected to realserver2 are ok, but clients connected to realserver1 are not. When I say ok, I mean that if a client is logged into my site, (connected to a tomcat server) the client can click another link (which requires the client to remain logged into my site and the session valid) and another page loads up fine. When I say not ok, I mean that if a client connected to realserver1 clicks the same link the new page does not load. BUT: if the client clicks that same link again the page does load up fine.

Important: A second click loads the page. If I remove realserver1 and then wait one minute, the failover is perfect, and the page loads from the new server on the first click. Its only a problem in the first 45 seconds or so. I am running ldirectord in debug mode, and I can see on the screen that it detects the missing server within 4 seconds. It then issues the ipvsadm commands to remove it from the pool. I can run ipvsadm -L -n 10 seconds after removing realserver1 and realserver1 is gone from the server pool. So ipvs has been told that the server is down, but it still routes packets to it for another 30 seconds or so.

I have used tcpdump on realserver2, and the first click does not arrive at it. The second click does. I think ipvs is routing the packet incorrectly, and it is taking some 30 seconds to implement the ipvsadm command to take realserver1 out of the pool.

Is this normal? Is there any kind of setting I can change to make ipvs take notice of the ipvsadm commands more quickly?

I tried reducing the tcp connection timeout to 5 seconds, and this tended to make round robin swap the client from one server to another much more often (which is fine), but then when I disconnect realserver1 clients connected to realserver2 take 2 clicks to work, as if they were scheduled to change to realserver1 just as realserver1 was removed. ipvs still takes 45 seconds to register that realserver1 is gone and stop routing packets to it.

If I stop tomcat using /etc/init.d/tomcat stop rather than pulling out the network cable the problem is not there. I wonder if it could have to do with my tomcat sessions, however this does not explain why the packet does not arrive at the working realserver.

ldirectord.cf:

checktimeout=3
negotiatetimeout=3
checkinterval=3
autoreload=no
logfile="/var/log/ldirectord.log"
quiescent=no
#Our custom service on 10004 also running, and taken offline when tomcat is taken offline
virtual=192.168.0.240:10004
        real=192.168.0.14:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
        real=192.168.0.15:10004 gate 1 "HeartbeatTestPages/Test", "Test Page"
        service=http
        scheduler=rr
        protocol=tcp
        checktype=negotiate
        checkport=8080
virtual=192.168.0.240:443
        real=192.168.0.14:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
        real=192.168.0.15:443 gate 1 "HeartbeatTestPages/Test", "Test Page"
        service=http
        scheduler=rr
        protocol=tcp
        checktype=negotiate
        checkport=8080

Note that I do the negotiate check on port 8080, as otherwise ldirectord pauses (no logs on screen) for a minute or so when realserver1 is disconnected and does not issue the ipvsadm commands to take realserver1 out of the pool until after 1 minute.

output of #iptables --list:
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination


Search lvs-users Archives
Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort

More information about the lvs-users mailing list