Skip to main content

Mysql sharding at our company - Part3 (Shard schema)

As discussed in Part2 of the series we do Horizontal sharding for any schemas that will store 100M+ rows. As we are a cloud file server no data is shared between two customers a perfectly isolation can be achieved easily. One year back when I was thinking on designing the schema there were many alternatives

  1. One shard per customer mapped to one database schema : Rejected this idea because mysql stores 1 or more files per table in physical file system and linux file system chokes after some no of files in a folder. We had faced this problem when storing the real files on filers (topic of another blog post).
  2. One shard per customer maped to one set of tables in database schema : This would solve the issue of multiple files in a folder but again it would lead to too many files on the disk and operating system can choke on it. Also we have customers to do a trial for 15 day and never signup, so too much for ops team to manage for these trials.
  3. Many customers in one shard mapped to one database schema: This would solve both issue one and two, but this is again  too many schemas to manage for operations team when they have to setup replication or write any scripts to manage the schemas.
  4. Many customers in one shard mapped to one set of tables in one database schema. : This is the approach we finally ended up picking as it suits both engineering and operations needs.
On each Mysql server we create a set of schemas and within each schema we have a cluster of tables that comprises a shard. To figure out what customer lives in what set of tables we use a master db called a lookup db aka DNS db. Each query first looks up the master db to figure out what shard this customer lives in, this is a highly  read intensive db so we cache this data in memcache. Once we figure out the shard then we lookup what schema and what server this shard lives on. based on that information the application looks up appropriate connection pool in spring and executes the query. This is how the topology looks like at 10000 ft.


This is the structure of our dns db tables

 A typical schema in a shard db has tables with name like
folders_${TBL_SUFFIX}, file_${TBL_SUFFIX}.  Here TBL_SUFFIX is unique within cluster so that shard can be moved easily. To make it unique for now we just append schema_name and table set number to it to it. So let say for schema c1_db1 the tables for shard 10 and 15 would look like
folders_c1_db1_t1
folders_c1_db1_t2
files_c1_db1_t1
files_c1_db1_t2

We could have just appended shard_id to the table names also to make then unique but this makes logical and physical mapping hard. Later if we move the shard to a different host all we need to do is move the entire schema containing many shards to a diff host and change metadata db mappings and flush cache and post a message to zookeeper. App nodes listen to zookeeper for such events and refresh connection pools.

Part1 of series

Part2 of series

Part3 of series

Part4 of series 

Comments

Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/

+++++++++++++++++RabbitMQProducer.java+++++++++++++++++++++++++++
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost("127.0.0.1"); factory.setPort(5672); Conne…

Logging to Graphite monitoring tool from java

We use Graphite as a tool for monitoring some stats and watch trends. A requirement is to monitor impact of new releases as build is deployed to app nodes to see if things like
1) Has the memcache usage increased.
2) Has the no of Java exceptions went up.
3) Is the app using more tomcat threads.
Here is a screenshot

We changed the installer to log a deploy event when a new build is deployed. I wrote a simple spring bean to log graphite events using java. Logging to graphite is easy, all you need to do is open a socket and send lines of events.
import org.slf4j.Logger;import org.slf4j.LoggerFactory; import java.io.OutputStreamWriter; import java.io.Writer; import java.net.Socket; import java.util.HashMap; import java.util.Map; public class GraphiteLogger { private static final Logger logger = LoggerFactory.getLogger(GraphiteLogger.class); private String graphiteHost; private int graphitePort; public String getGraphiteHost() { return graphiteHost; } public void setGraphite…

What a rocky start to labor day weekend

Woke up by earthquake at 7:00 AM in morning and then couldn't get to sleep. I took a bath, made my tea and started checking emails and saw that after last night deployment three storage node out of 100s of nodes were running into Full GC. What was special about the 3 nodes was that each one was in a different Data centre but it was named same app02.  This got me curious I asked the node to be taken out of rotation and take a heap dump.  Yesterday night a new release has happened and I had upgraded spymemcached library version as new relic now natively supports instrumentation on it so it was a suspect. And the hunch was a bullseye, the heap dump clearly showed it taking 1.3G and full GCs were taking 6 sec but not claiming anything.



I have a quartz job in each jvm that takes a thread dump every 5 minutes and saves last 300 of them, checking few of them quickly showed a common thread among all 3 data centres. It seems there was a long running job that was trying to replicate pending…