Skip to main content

Mysql JDBC driver and Streaming large resultset

We are moving one legacy component from Berkeley db to mysql so that we can replicate it to distribute read requests and shard it to scale it. I had to dump the entire db contents and stream it over http to some other component. the db content for a single customer can range from 100K to 3-4 Million records. We are using spring JDBC to make the task of dealing with jdbc api simple. Now I was using a ResultSetExtractor to stream the resultset like this.

            ResultSetExtractor resultSet = new ResultSetExtractor() {
                @Override
                public Object extractData(ResultSet result) throws SQLException, DataAccessException {
                    while(result.next()){
                        XXXXX
                    }
                    return "";
                }
            };

getJdbcTemplate(context).getJdbcOperations().query( sql, resultSetExtractor);

But it appears that Mysql native JDBC driver loads entire resultset into memory before passing the control onto ResultSetExtractor and that was causing OOM.

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
              java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.


This is explained here http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html 

It seems SimpleJdbcTemplate doesnt have a setFetchSize method and JdbcTemplate has a setFetchSize but that doesn not work when you call the query method and if you use statement then the applySettings method applies fetchSize only if its >0.  So the solution is to use PreparedStatement directly

callback = new PreparedStatementCallback() {
                @Override
                public Void doInPreparedStatement(PreparedStatement pstmt) throws SQLException, DataAccessException {
                    ResultSet rs = pstmt.executeQuery();
                    resultSetExtractor.extractData(rs);
                    rs.close();
                    return null;
                }
            };

        executeStreamed(jdbcTemplate, callback, sql);

    /**
     * http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
     * Unless you specify the statement settings as below the mysql driver is going to load all results in memory.
     *
     * @param jdbcTemplate
     * @param callback
     * @param sql
     */
    protected void executeStreamed(SimpleJdbcTemplate jdbcTemplate, PreparedStatementCallback callback, final String sql) {
        PreparedStatementCreator creator = new PreparedStatementCreator() {
            @Override
            public PreparedStatement createPreparedStatement(Connection conn) throws SQLException {
                PreparedStatement pstmt = conn.prepareStatement(sql, java.sql.ResultSet.TYPE_FORWARD_ONLY,
                        java.sql.ResultSet.CONCUR_READ_ONLY);
                pstmt.setFetchSize(Integer.MIN_VALUE);
                return pstmt;
            }
        };
        jdbcTemplate.getJdbcOperations().execute(creator, callback);
    }


Comments

  1. Alternatively, we can subclass JdbcTemplate and override applySettings method and use this subclass.

    ReplyDelete

Post a Comment

Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/

+++++++++++++++++RabbitMQProducer.java+++++++++++++++++++++++++++
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost("127.0.0.1"); factory.setPort(5672); Conne…

Logging to Graphite monitoring tool from java

We use Graphite as a tool for monitoring some stats and watch trends. A requirement is to monitor impact of new releases as build is deployed to app nodes to see if things like
1) Has the memcache usage increased.
2) Has the no of Java exceptions went up.
3) Is the app using more tomcat threads.
Here is a screenshot

We changed the installer to log a deploy event when a new build is deployed. I wrote a simple spring bean to log graphite events using java. Logging to graphite is easy, all you need to do is open a socket and send lines of events.
import org.slf4j.Logger;import org.slf4j.LoggerFactory; import java.io.OutputStreamWriter; import java.io.Writer; import java.net.Socket; import java.util.HashMap; import java.util.Map; public class GraphiteLogger { private static final Logger logger = LoggerFactory.getLogger(GraphiteLogger.class); private String graphiteHost; private int graphitePort; public String getGraphiteHost() { return graphiteHost; } public void setGraphite…

What a rocky start to labor day weekend

Woke up by earthquake at 7:00 AM in morning and then couldn't get to sleep. I took a bath, made my tea and started checking emails and saw that after last night deployment three storage node out of 100s of nodes were running into Full GC. What was special about the 3 nodes was that each one was in a different Data centre but it was named same app02.  This got me curious I asked the node to be taken out of rotation and take a heap dump.  Yesterday night a new release has happened and I had upgraded spymemcached library version as new relic now natively supports instrumentation on it so it was a suspect. And the hunch was a bullseye, the heap dump clearly showed it taking 1.3G and full GCs were taking 6 sec but not claiming anything.



I have a quartz job in each jvm that takes a thread dump every 5 minutes and saves last 300 of them, checking few of them quickly showed a common thread among all 3 data centres. It seems there was a long running job that was trying to replicate pending…