CI/SSI and LVS integration (was: [CI] Re: keepalive with Cluster infrastructure health check.)

Alexandre CASSEN alexandre.cassen at
Tue May 14 14:59:52 BST 2002

Hi Aneesh,

> SSI is also a kernel patch. libcluster is the interface for the user
> application in accessing many of the cluster information.


> You mean what happens when the  application migrates to other node ?

Yes exactly. According to your CI terminology what is the "application" ?
LVS code, WEB application, fs or all together ?

IMHO, if all together, this can result of some kind framework code
repetition. What I mean :

Introducing network loadbalancing in a logic network topology must
introduce some network level abstraction. This mean that we clearly locate
the virtualservers (VIP) & realservers. Loadbalancing framework will
schedule inbound connections on VIP to the elected realserver.

This loadbalancing framework can be incorporated into a global cluster
infrastructure product (like CI/SSI) only if it respect the LVS inside
framework design. Otherwise some big part of the code from LVS (for
instance scheduling decisions) will be backported to the global cluster
infrastructure framework (to deal with the CI/SSI CVIP). This backporting
process can be very time consuming.

If LVS is a part of a CI/SSI cluster, more than one node will have the LVS
code. Since CI/SSI design use the "CVIP" that is a roaming VIP on all LVS
node, the LVS connection table will be altered. A solution can be to
introduce somme kind of CI/SSI CVIP stickyness on a specific LVS node. This
mean that all traffic will be directed to a specific LVS director. That way
we can solve this problem since using this stickyness all connection will
be accepted by the same LVS node so scheduling decisions will be consistent
en LVS inside design will be respected.

This is a requierment since LVS is able to deal with LVS-NAT|TUN|DR, even
more we can use fwmark to mark inbound traffic before processing the
scheduling decision (I don t really know the impact of using fwmark in the
CI code).

Another point, if for CI/SSI you want a failover protection for the CVIP,
the CVIP takeover must (IMHO) be handled by a specific routing hot standby
protocol (VRRP), this really introduce the question : What is the meaning
of CVIP for use with LVS since CVIP must sticked to a specific LVS node ?

> If i understand your point here. I have  the server/daemon  running at
> node1 and when it migrates to node2( as per the decision taken by load
> leveler ) and if the server requires persistents connection what will
> happen ?

Yes exactly, what is the CI/SSI meaning of the "node migration process" ?
This sound like a node election/scheduling process isn t it ? if yes, is it
possible to stick the CVIP to a specific LVS node ?

LVS cluster topology, which mean in short of "LVS loadbalancing" is a
multiple active/active LVS topology. This active/active functionnality must
be (IMHO) by a specific dedicated hotstandby protocol. This is why, we (LVS
team) have started develop a robust VRRP framework embended into

>From my eyes, LVS director clusterisation must handle the following topics

* VIP takeover
* LVS director connections table synchronization.

The current devel status on that 2 points are :

* VIP takeover : Done using VRRP keepalived framework. Currently IPSEC-AH
need to be tuned, especially the IPSEC seqnum synchronization in fallback
* LVS director connections table sync : Currently this part is handled by a
LVS part called syncd, that subscribe to a multicast group sending
connection table. This kernel space process can be drived by the VRRP
framework (according to a specific VRRP instance state). Currently the LVS
syncd can not be used in an active/active env since it can only exist one
MASTER syncd sending connections info to the multicast group. This will be
enhanced I think (our time is too short :) )

> Well In the case of already existing connection which will be doing read
> and write( If i understand correctly ) they will be carried out as
> remote operation on the socket,which means even though the application
> migrates the socket still remain at the first node(node1).( Brian
> should be able to tell more about that. Brian ? ).
> We haven't decided about what should happen to the  main socket (
> socket() ) that is bound to the address.

Ok. for me the point that I don t understand is the CVIP point :/

Just let me sketch a simple env :

example specs: Two VIPs exposed to the world, loadbalancing using LVS on a
realserver pool.

              |     WAN area     |

     .............................................[CI/SSI cluster].....
     .                                                                .
     .                                                                .
     .     +----[VIP1]---+      +----[VIP2]---+                       .
     .     |             |      |             |                       .
     .     |  LVS node1  |      |  LVS node2  |                       .
     .     +-------------+      +-------------+                       .
     .                                                                .
     .     +----[RIP1]---+      +----[RIP2]---+     +----[RIP3]---+   .
     .     |             |      |             |     |             |   .
     .     |  app node1  |      |  app node2  |     |  app node3  |   .
     .     +-------------+      +-------------+     +-------------+   .

WAN client will access VIP1 and VIP2 exposing a virtual service to the
world (http for example). Then LVS node1 will schedule traffic to app
node1,2,3 (the same for LVS node2).

What is the meaning of CVIP regarding VIP1,2 ? if I understand, CVIP will
transit periodically to LVS node1 and node2, this will break the LVS inside
scheduling coherence because connection table will have no longer

This finaly mean that the CVIP node affectation CI/SSI design is something
like scheduling decision. This scheduling decision will perturbe the LVS
scheduling decision since the LVS code scheduling decision requierd VIP
owned by the director to garanty scheduling performance. This 2 scheculing
levels introduce a concept problem, this problem can be highlighted for
persistents connections.

For example if LVS (node1,2) loadbalance SSL servers (node1,2,3), one
connection should be sticked to a specific app node otherwise ssl protocol
will be corrupted. This stickyness is done if the app node connection
between client & server is present in the LVS conn table. If LVS VIP change
from node from LVS node1 to LVS node2, then LVS node2 will not have the
entry in its connections table so LVS scheduler will be played and another
app node can be elected which will broke this persistence need.

So I can see 2 solutions :

* Replicating the whole director connection table in realtime
* Stick CVIP to a specific LVS node.

=> Then the LVS VIP takeover will be handled by VRRP (for example).

> right now I am thinking of using a script that will select potential LVS
> directors manually and syn the table between them using --sync option
> and fialover using VRRP

Oh OK. But what will be the script key decision for selecting a specific
LVS director ?
this will introduce some scheduling decisions.

This is the part a little obscur for me : What is the LVS key decision
selection ?

> I was trying to bring a Cluster Wide IP functionality to the SSI cluster
> using LVS and VRRP( keepalive ). Also I will be looking into the initial
> work done by Kai. That means nothing is decided. We could either use LVS
> and VRRP or the code from HP's NSC for SCO unixware.

ah OK... This cluster Wide IP must depend on the takeover protocol you will
choose (VRRP, ...).

> The patch was intented to make people able to run LVS with keepalive on
> a CI cluster.Whether  it form a base of Cluster Wide IP for SSI is yet
> to be decided.

Yes, yes no problem for me, I will add your patch into keepalived, really
not a problem. I just want to know a little bit more on CI/SSI :)

> I was actually pointing at the  initialize_nodemap function. If it fails
> how am i going to inform the main line code. Does a exit() will do it.

This Keepalived code doesn t use pthread (for porting reasons), the thread
functionnality is provided by a central I/O MUX. So to end a thread, just
use return() at the end of your thread functions.

> If my initialize_node_map fails then VRRP( I use the term VRRP because
> there is another keep alive with SSI. To differentiate i guess for the
> time being VRRP is ok ? ) cannot run with CI .

(VRRP name is ok for me, some people mistake VRRP and vrrpd this is why I
use keepalived).

Ok for init_node_map.

> I am not aware of one. But it will be good to have a mapping  of node
> number with  node IP address. For example in a configuration. with node1
> having two IP address one for cluster interconnect and other for
> external network, it will be good to have an interface like
> clusternode_t  cluster_node_num( uint32_t  addr_ip).
> On both the IP it will return me node1.

Hmm, yes, I agree. as a part of your libcluster.

> The CI/SSI also allow an application to know about the node up/down
> events by registering for SIGCLUSTER signal. But using SIGCLUSTER
> doesn't well fit into the existing VRRP( keepalive) model with IO
> multiplexer. Infact first i tried to do it with signal . But later
> scraped it because it doesn't fit well.

Oh OK.

I was thinking of a AF_CLUSTER family. Create a socket with this family
will create a kernel channel to the CI/SSI core code. Then the CI/SSI code
can broadcast kernel message on node status that will be received userspace
to update node status => no polling so overhead will be optimized.

Best regards,

More information about the lvs-users mailing list