Before I joined the company it was evolving from a search and share company to a cloud file server and we were using lucene as a datastore for metadata but it was always lagging in updates as you cant update data realtime in lucene. Then we were going to NFS for metadata but it was not scaling for metadata calls as NFS protocol was never meant to scale to web2.0 volume. So someone picked up Berkeley DB as a backend. The only reason to not pick any other NoSQL products was that none of them had multi row transactions capabilities and relational db was not picked as it would take time to port code to relational db. I had recently joined at that time and then company became a hit and we scaled from 3-4 nodes in one data centre to 60-100 nodes across multiple datacentre. We are still going through the scaling phase and believe me it is not fun scaling Berkeley db. Berkeley db works fine if you can hold your dataset in memory but as soon as you start hitting dbs of size 10G or more it becomes a nightmare. The main issues we find with Berkeley db are:
Ultimately we became tired and decided to move to Mysql and it took almost an year to port code to mysql with so many fires going on to scale Berkely db due to 20% growth month over month. But in short NOSQL is good and is still used at company for many usecases but when it comes to transactions/adhoc querying Mysql is still the champion and unless you are writing a desktop client application, inmemory dbs are bad idea. I despise Berkeley db everyday to be used a server application.
Finally Mysql backend is going live today to some customers and once its live and scaling we would start removing throttles and start serving one customer from multiple nodes in coming months. My anticipation is that we would consolidate hardware and cut the no of servers in half as app nodes will now be sitting mostly idle. I would cover more details on the Mysql sharding and how scaling with Mysql goes in coming months.
Part1
Part2
Part3
Part4
Part5
- Its difficult or I should say impossible to find people qualified enough to diagnose Berkeley db crashes. We hired a consultant at some costly hourly rates but all he would give is theoretical knowledge and not enough details of the crash.
- As its an in memory db if tomcat is hung and kill -hup wont work then we always have to cross our fingers doing a kill -9 as sometimes we would lose data.
- Sometimes DB would get corrupt pages and we would have to do a catastrophic recovery and we would lose data so temporarily I had to build an alternate replicated mysql db writing some ETL code for partial recovery of metadata.
- As Berkeley db is an in memory db the only way to query data is to write a JSP. Simple queries from marketing or products team would require us to write a jsp. Updating data to fix corrupt data was no fun either.
- We tried hard to enable replication to other node to scale read operations but our write rates were so high that replication would always lag and it works in development but never in production.
- Backing up Berkeley db and rsync of logs to filers would cause IO issues on nodes.
- Log churn on Berkeley db would cause IO issues.
- To add an index you have to write code and db would become fragmented so lately we are just shutting down nodes every week and copying db to another folder and moving it back to defrag it or in worse case we have to dump/reload the db.
- So due to above issues we were left with pinning customers to a node and nodes would become hotspots and one bad customer would affect other customers.
- We had to put throttles in system as we cant serve a customer from two or more nodes due to in memory issues.
- Very few engineers would have the aptitude to dig deep into Berkely db code and fix issues, it took me some time to digest but it opened my vision to NOSQL dbs and I was able to apply same concepts when we added cassandra to some other usecase.
- Sometimes we throw more hardware to solve the issue and its not cost effective in longer term.
- Indexes was another issue, you have to write code to add an index and that means you will have to go thought whole release cycle to get index on a db.
- Adding salt to injury was that indexes wont get created on db startup, they would only get created when someone accesses a db, for a customer with 3-4M files this could take anywhere from 10-20 mins if the server is busy or 1 min if server is idle. This unpredictable performance of bdb was making us nuts. Whenever we would create an index we had to remember to run empty snapshot queries to generate the indexes on weekend and that would mean painful exercise to repeat on 60 nodes as one or the other node would get hung and you would end up spending anywhere from 4-5 hours fixing it.
- BDB Locks were another pain, bdb support page level locks instead of row level locks and that means more chance of deadlocks, we would get constant deadlocks if a background thread like Trash purge Quartz job is aggressively trying to purge data and same customer is trying to add files at same time. We had to implement application level locks to prevent such scenarios and that complicates the code.
Ultimately we became tired and decided to move to Mysql and it took almost an year to port code to mysql with so many fires going on to scale Berkely db due to 20% growth month over month. But in short NOSQL is good and is still used at company for many usecases but when it comes to transactions/adhoc querying Mysql is still the champion and unless you are writing a desktop client application, inmemory dbs are bad idea. I despise Berkeley db everyday to be used a server application.
Finally Mysql backend is going live today to some customers and once its live and scaling we would start removing throttles and start serving one customer from multiple nodes in coming months. My anticipation is that we would consolidate hardware and cut the no of servers in half as app nodes will now be sitting mostly idle. I would cover more details on the Mysql sharding and how scaling with Mysql goes in coming months.
Part1
Part2
Part3
Part4
Part5
Comments
Post a Comment