Skip to main content

From NOSQL to MySql (BDB to mysql) - Part1

Before I joined the company it was evolving from a search and share company to a cloud file server and we were using lucene as a datastore for metadata but it was always lagging in updates as  you cant update data realtime in lucene. Then we were going to NFS for metadata but it was not scaling for metadata calls as NFS protocol was never meant to scale to web2.0 volume. So someone picked up Berkeley DB as a backend. The only reason to not pick any other NoSQL products was that none of them had multi row transactions capabilities and relational db was not picked as it would take time to port code to relational db. I had recently joined at that time and then company became a hit and we scaled from 3-4 nodes in one data centre to 60-100 nodes across multiple datacentre. We are still going through the scaling phase and believe me it is not fun scaling Berkeley db.  Berkeley db works fine if you can hold your dataset in memory but as soon as you start hitting dbs of size 10G or more it becomes a nightmare. The main issues we find with Berkeley db are:

  1. Its difficult or I should say impossible to find people qualified enough to diagnose Berkeley db crashes. We hired a consultant at some costly hourly rates but all he would give is theoretical knowledge and not enough details of the crash.
  2. As its an in memory db if tomcat is hung and kill -hup wont work then we always have to  cross our fingers doing a kill -9 as sometimes we would lose data.
  3. Sometimes DB would get corrupt pages and we would have to do a catastrophic recovery and we would lose data so temporarily I had to build an alternate replicated mysql db writing some ETL code for partial recovery of metadata.
  4. As Berkeley db is an in memory db  the only way to query data is to write a JSP. Simple queries from marketing or products team would require us to write a jsp. Updating data to fix corrupt data was no fun either.
  5. We tried hard to enable replication to other node to scale read operations but our write rates were so high that replication would always lag and it works in development but never in production.
  6. Backing up Berkeley db and rsync of logs to filers would cause IO issues on nodes.
  7. Log churn on Berkeley db would cause IO issues.
  8. To add an index you have to write code and db would become fragmented so lately we are just shutting down nodes every week and copying db to another folder and moving it back to defrag it or in worse case we have to dump/reload the db.
  9. So due to above issues we were left with pinning customers to a node and nodes would become hotspots and one bad customer would affect other customers.
  10. We had to put throttles in system as we cant serve a customer from two or more nodes due to in memory issues.
  11. Very few engineers would have the aptitude to dig deep into Berkely db code and fix issues, it took me some time to digest but it opened my vision to NOSQL dbs and I was able to apply same concepts when we added cassandra to some other usecase.
  12. Sometimes we throw more hardware to solve the issue and its not cost effective in longer term.
  13. Indexes was another issue, you have to write code to add an index and that means you will have to go thought whole release cycle to get index on a db. 
  14. Adding salt to injury was that indexes wont get created on db startup, they would only get created when someone accesses a db, for a customer with 3-4M files this could take anywhere from 10-20 mins if the server is busy or 1 min if server is idle. This unpredictable performance of bdb was making us nuts.  Whenever we would create an index we had to remember to run empty snapshot queries to generate the indexes on weekend and that would mean painful exercise to repeat on 60 nodes as one or the other node would get hung and you would end up spending anywhere from 4-5 hours fixing it.
  15. BDB Locks were another pain, bdb support page level locks instead of row level locks and that means more chance of deadlocks, we would get constant deadlocks if a background thread like Trash purge Quartz job is aggressively trying to purge data and same customer is trying to add files at same time. We had to implement application level locks to prevent such scenarios and that complicates the code.

Ultimately we became tired and decided to move to Mysql and it took almost an year to port code to mysql with so many fires going on to scale Berkely db due to 20% growth month over month.  But in short NOSQL is good and is still used at company for many usecases but when it comes to  transactions/adhoc querying Mysql is still the champion and unless you are writing a desktop client application, inmemory dbs are bad idea. I despise Berkeley db everyday to be used a server application.

Finally Mysql backend is going live today to some customers and once its live and scaling we would start removing throttles and start serving one customer from multiple nodes in coming months. My anticipation is that we would consolidate hardware and cut the no of servers in half as app nodes will now be sitting mostly idle. I would cover more details on the Mysql sharding and how scaling with Mysql goes in coming months.

Part1
Part2
Part3
Part4
Part5

Comments

Popular posts from this blog

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues: 1) We have to overprovision each node with ability to handle worse case capacity. 2) If two or three high profile customers lands on to same node then we need to move them manually. 3) We need to cut over new nodes and we already have over 100+ nodes.  Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue. I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality. So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediat...

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.". After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that public class DurableJobDetailFactoryBean extends JobDetailFactoryBean {     public DurableJobDetailFactoryBean() {         setDurability(true);     } } and used this instead of JobDetailFactoryBean in the spring bean definition     <bean i...

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it ...