Scalability problems are interesting. This weekend we deployed a new release and suddenly we started getting ldap alerts from one datacentre. All fingers were pointed to something new in the code making more ldap queries but I didnt thought we made any ldap code changes, infact we were moving away from ldap to mysql. We use 15G of memcached and looking into memcache stats I found that we were seeing evictions. We had bumped memory from 12G to 15G in other data centre during same code and that data centre was not having issues. So this problem was interesting.
At last we found out that our new release had made 90% of memcached data stale. We store our objects as Json in memcached and deprecated fields removal from objects made the json in cache unusable. But the code bug was that the old jsons were still sitting in cache consuming memory. As new jsons was getting pumped into memcached they were fighting for memory and memcached LRU is slab based so when it evicts a slab it would evict good data also. This in turn was causing Thundering herd issue in LDAP.
LDAP is nice but when it comes to scaling more than 2-3 million objects it sucks as internally openldap uses bdb backend and after a point both read/writes start taking a long time no matter what you do. Good news is that in coming month we are moving from ldap->mysql which would make us scalable again. I will write a blog on how to migrate millions of customers from ldap->mysql without them knowing about it. The analogy is changing tires of race cars on a race track and letting cars again on race track without disrupting the ongoing race.
At last we found out that our new release had made 90% of memcached data stale. We store our objects as Json in memcached and deprecated fields removal from objects made the json in cache unusable. But the code bug was that the old jsons were still sitting in cache consuming memory. As new jsons was getting pumped into memcached they were fighting for memory and memcached LRU is slab based so when it evicts a slab it would evict good data also. This in turn was causing Thundering herd issue in LDAP.
LDAP is nice but when it comes to scaling more than 2-3 million objects it sucks as internally openldap uses bdb backend and after a point both read/writes start taking a long time no matter what you do. Good news is that in coming month we are moving from ldap->mysql which would make us scalable again. I will write a blog on how to migrate millions of customers from ldap->mysql without them knowing about it. The analogy is changing tires of race cars on a race track and letting cars again on race track without disrupting the ongoing race.
Comments
Post a Comment