firewall marks + tunneling + persistence = ERR! state
Jaroslav Libák
jarol1 at seznam.cz
Tue Nov 28 20:32:09 GMT 2006
Hello
I have 2 directors, 2 real servers (with more than 4 realservers in the future) with heartbeat 2.0.7 + LVS (built in 2.6.18.3 kernel) + ldirectord all compiled from latest sources as packages aren't available for Oracle enterprise linux 4u4.
I tried NAT/DR/TUN and got them working without problems (the biggest problem was getting heartbeat 2.x setup). I decided to use tunneling as it seems to be very fast and real servers don't have to be in the same network. I'm using firewall marks and persistence at the same time. Everything seems to be working right (connections are really persistent),
except the ipvsadm -l -c which shows weird data.
When i run that I get some connections with ERR! state. Persistence is 600 = 10 minutes, after that these connections dissappear. Without persistence there are no such connections. If I don't use firewall marks then they aren't there either. If I don't use firewall marks, then there are "NONE" connections which from what I have read LVS uses to handle persistence. These "connections" resemble my ERR! connections in this sence. After they dissappear client can be routed to a different real server.
Could anyone confirm that in this case this ERR! state is harmless? I'm thinking that it might be happening because usage of firewall marks was added to LVS later and ipvsadm wasn't updated to handle this properly. Or when using firewall marks and persistence, somebody forgot to change the state of the connection to "NONE" in the C code.
Another issue is master-backup synchronization. Both directors are running in master-slave mode at the same time. (in heartbeat terms they are set up as symmetric cluster with resource stickiness) When I click refresh in firefox several times while viewing load balanced page, I get a FIN_WAIT connection for every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid of them fast (is this ok btw?, it was like 2 minutes before which I think is way too long). What is worse, I get "established" connection on the slave for every refresh. I have read this is due to a simplification in the synchronization code. These "established" connections on the slave have much longer timeout unfortunately (2 minutes). I'm worried that this could cause the slave director to crash faster than the master during a DOS attack. Master would remove FIN_WAITs quite fast but slave wouldn't. I have put all 3 defence strategies to auto mode just in case.
I'm using hash table size 2^20 (which doesn't limit the maximum number of values in it, it just sets the number of rows, then each row has a linked list). Doesn't it cause some slowdown in the LVS? I hope the code doesn't do iteration of all table values.
Jaro
Search lvs-users Archives
More information about the lvs-users
mailing list