LVS performance bug

Roberto Nibali ratz at drugphish.ch
Fri Mar 16 08:31:45 GMT 2007


Hi Graeme,

Thanks for analysis, it's very much appreciated, especially since most 
of use here do not have much time left at the moment.

>> So the only thing I see shooting up higher in memory used is
>> buffers/cache used seems to grow. But in the slabinfo the ip_vs_conn
>> active objects grows fast. I watched it grow during the test from 39K
>> objects to over 2 million objects. Maybe something isn't being reset or
>> returned to the pool. We are running the OPS patch(one packet
>> scheduling) because we are using LVS for the udp service DNS. I'm sure
>> it treats connections differently than the regularly hashed connections
>> thing. 
> 
> Aside from OPS, is this a relatively stock kernel (or a distributed
> one); ie. not custom compiled by you? I'm going to have a pitch at
> something slightly out of my normal range here... I'm wondering if the
> conns/sec * conn time in sec is greater than the default connection
> table size - the error you see is from ip_vs_conn.c:

Unless I'm misinterpreting you, your statement

conns/sec * conn time in sec > default connection table size

is almost always true, however that's why it's a linked list:

struct ip_vs_conn entries go into the buckets with a simply Jenkins hash 
function distribution:

Buckets (# == hash table size := n)
[entry 1  ] --> entry 1.1
[entry 2  ] --> entry 2.1 --> entry 2.2 --> entry 2.3
[entry 3  ] --> entry 3.1 --> entry 3.1
[...]
[entry n-1]
[entry n  ] --> entry n.1 --> entry n.2

>   cp = kmem_cache_alloc(ip_vs_conn_cachep, GFP_ATOMIC);
>   if (cp == NULL) {
>           IP_VS_ERR_RL("ip_vs_conn_new: no memory available.\n");
>           return NULL;
>   }

If we cannot allocate space (sizeof(struct ip_vs_conn) and HW alined in 
pages) for the table to which ip_vs_conn_cachep points to anymore using 
GFP_ATOMIC we bail out.

> in turn, ip_vs_conn_cachep is defined inside ip_vs_conn_init:
> 
>   /*
>    * Allocate the connection hash table and initialize its list heads
>    */
>   ip_vs_conn_tab = vmalloc(IP_VS_CONN_TAB_SIZE*sizeof(struct
> list_head));
>   if (!ip_vs_conn_tab)
>           return -ENOMEM;

This is the table memory.

>   /* Allocate ip_vs_conn slab cache */
>   ip_vs_conn_cachep = kmem_cache_create("ip_vs_conn",
>                                         sizeof(struct ip_vs_conn), 0,
>                                         SLAB_HWCACHE_ALIGN, NULL, NULL);

This is the pointer to the table.

>   if (!ip_vs_conn_cachep) {
>           vfree(ip_vs_conn_tab);
>           return -ENOMEM;
>   }

It there wasn't enough space (in our case 256 bytes) to get the initial 
slab object for struct ip_vs_conn, we will of course also free the table 
space allocated, since there's no chance in hell, that even one 
connection can be inserted into the lookup table. This really should not 
happen on any box, or there's been a serious memory leak in the kernel 
previously.

> The ip_vs_conn_tab is therefore sized according to IP_VS_CONN_TAB_SIZE,
> which is set in the compile process and defaults to 12. This gives a
> table size of 4096 (2^12). If you hit your server with a *very* high

Correct.

> connection rate (as in busy DNS) then you're going to exhaust your
> connection table in no time, especially if the DNS servers take a little

Not quite, since the table entries are linked lists. The average length 
for the lookup is higher and will kill your performance when you have 
slow or little L1/L2 cache, however there's nothing wrong with having a 
smaller table. Although, I believe nowadays the default size in the 
kernel configuration could be increased to 2^16 entries.

> longer to respond when loaded (in this case you get a variant of the
> "thundering herd" problem; namely that when responses start to take
> longer, you get more requests).
> 
> I'd try recompiling with IP_VS_TAB_BITS set to something higher.

What this will do is that it'll shorten the average lookup (and increase 
the initially allocated static memory for the table), since the linked 
list entries will be smaller in length.

> I'd also try not using OPS, to see whether or not that in itself is the
> problem *or* if the more straightforward schedulers exhaust the
> connection table before OPS does.

I have already forgotten how OPS works :). But yes, this could be an 
option, as well.

Take care and best regards,
Roberto Nibali, ratz

ps: Hope all is fine with your new family member and I reckon your 
memory is pretty much exhausted too at the moment :)
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

Search lvs-users Archives
Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort

More information about the lvs-users mailing list