Skip to main content

From NOSQL to MySql (BDB to mysql) - Part1

Before I joined the company it was evolving from a search and share company to a cloud file server and we were using lucene as a datastore for metadata but it was always lagging in updates as  you cant update data realtime in lucene. Then we were going to NFS for metadata but it was not scaling for metadata calls as NFS protocol was never meant to scale to web2.0 volume. So someone picked up Berkeley DB as a backend. The only reason to not pick any other NoSQL products was that none of them had multi row transactions capabilities and relational db was not picked as it would take time to port code to relational db. I had recently joined at that time and then company became a hit and we scaled from 3-4 nodes in one data centre to 60-100 nodes across multiple datacentre. We are still going through the scaling phase and believe me it is not fun scaling Berkeley db.  Berkeley db works fine if you can hold your dataset in memory but as soon as you start hitting dbs of size 10G or more it becomes a nightmare. The main issues we find with Berkeley db are:

  1. Its difficult or I should say impossible to find people qualified enough to diagnose Berkeley db crashes. We hired a consultant at some costly hourly rates but all he would give is theoretical knowledge and not enough details of the crash.
  2. As its an in memory db if tomcat is hung and kill -hup wont work then we always have to  cross our fingers doing a kill -9 as sometimes we would lose data.
  3. Sometimes DB would get corrupt pages and we would have to do a catastrophic recovery and we would lose data so temporarily I had to build an alternate replicated mysql db writing some ETL code for partial recovery of metadata.
  4. As Berkeley db is an in memory db  the only way to query data is to write a JSP. Simple queries from marketing or products team would require us to write a jsp. Updating data to fix corrupt data was no fun either.
  5. We tried hard to enable replication to other node to scale read operations but our write rates were so high that replication would always lag and it works in development but never in production.
  6. Backing up Berkeley db and rsync of logs to filers would cause IO issues on nodes.
  7. Log churn on Berkeley db would cause IO issues.
  8. To add an index you have to write code and db would become fragmented so lately we are just shutting down nodes every week and copying db to another folder and moving it back to defrag it or in worse case we have to dump/reload the db.
  9. So due to above issues we were left with pinning customers to a node and nodes would become hotspots and one bad customer would affect other customers.
  10. We had to put throttles in system as we cant serve a customer from two or more nodes due to in memory issues.
  11. Very few engineers would have the aptitude to dig deep into Berkely db code and fix issues, it took me some time to digest but it opened my vision to NOSQL dbs and I was able to apply same concepts when we added cassandra to some other usecase.
  12. Sometimes we throw more hardware to solve the issue and its not cost effective in longer term.
  13. Indexes was another issue, you have to write code to add an index and that means you will have to go thought whole release cycle to get index on a db. 
  14. Adding salt to injury was that indexes wont get created on db startup, they would only get created when someone accesses a db, for a customer with 3-4M files this could take anywhere from 10-20 mins if the server is busy or 1 min if server is idle. This unpredictable performance of bdb was making us nuts.  Whenever we would create an index we had to remember to run empty snapshot queries to generate the indexes on weekend and that would mean painful exercise to repeat on 60 nodes as one or the other node would get hung and you would end up spending anywhere from 4-5 hours fixing it.
  15. BDB Locks were another pain, bdb support page level locks instead of row level locks and that means more chance of deadlocks, we would get constant deadlocks if a background thread like Trash purge Quartz job is aggressively trying to purge data and same customer is trying to add files at same time. We had to implement application level locks to prevent such scenarios and that complicates the code.

Ultimately we became tired and decided to move to Mysql and it took almost an year to port code to mysql with so many fires going on to scale Berkely db due to 20% growth month over month.  But in short NOSQL is good and is still used at company for many usecases but when it comes to  transactions/adhoc querying Mysql is still the champion and unless you are writing a desktop client application, inmemory dbs are bad idea. I despise Berkeley db everyday to be used a server application.

Finally Mysql backend is going live today to some customers and once its live and scaling we would start removing throttles and start serving one customer from multiple nodes in coming months. My anticipation is that we would consolidate hardware and cut the no of servers in half as app nodes will now be sitting mostly idle. I would cover more details on the Mysql sharding and how scaling with Mysql goes in coming months.



Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost(""); factory.setPort(5672); Conne…

Logging to Graphite monitoring tool from java

We use Graphite as a tool for monitoring some stats and watch trends. A requirement is to monitor impact of new releases as build is deployed to app nodes to see if things like
1) Has the memcache usage increased.
2) Has the no of Java exceptions went up.
3) Is the app using more tomcat threads.
Here is a screenshot

We changed the installer to log a deploy event when a new build is deployed. I wrote a simple spring bean to log graphite events using java. Logging to graphite is easy, all you need to do is open a socket and send lines of events.
import org.slf4j.Logger;import org.slf4j.LoggerFactory; import; import; import; import java.util.HashMap; import java.util.Map; public class GraphiteLogger { private static final Logger logger = LoggerFactory.getLogger(GraphiteLogger.class); private String graphiteHost; private int graphitePort; public String getGraphiteHost() { return graphiteHost; } public void setGraphite…

What a rocky start to labor day weekend

Woke up by earthquake at 7:00 AM in morning and then couldn't get to sleep. I took a bath, made my tea and started checking emails and saw that after last night deployment three storage node out of 100s of nodes were running into Full GC. What was special about the 3 nodes was that each one was in a different Data centre but it was named same app02.  This got me curious I asked the node to be taken out of rotation and take a heap dump.  Yesterday night a new release has happened and I had upgraded spymemcached library version as new relic now natively supports instrumentation on it so it was a suspect. And the hunch was a bullseye, the heap dump clearly showed it taking 1.3G and full GCs were taking 6 sec but not claiming anything.

I have a quartz job in each jvm that takes a thread dump every 5 minutes and saves last 300 of them, checking few of them quickly showed a common thread among all 3 data centres. It seems there was a long running job that was trying to replicate pending…