Skip to main content

New relic aha moments with Java app

I am integrating New Relic at our startup and in last 2 weeks there were several Aha moments.  Below are some of the Aha moments

1) Our DBAs had done some profiling an year ago and told me that there is this "Select 1 from dual" query that is taking most of the time in database.  I was like this is the query that we do for validating connection out of commons-dbcp pool so I cant take it out and why would select 1 from dual take time in database.  But then I installed new relic on 16 app servers in a data centre and then I went to Database tab and immediately I see this query being fired 40-50K times. This was an Aha moment and immediately I started looking for alternatives.

Finally I settled on tomcat-dbcp because it has a property called as validationInterval(default is 30 sec). So what it means is it would still fire select 1 from dual but will fire it on the connection only if it hasnt fired it in last 30 sec. This weekend the fix is going live so crossing fingers if there are no serious issues with tomcat-dbcp in highly concurrent environment, UAT tests looks promising so far.

2) Second Aha moment was with Plugins support for new relic. I used to write platform software in my previous startup so I really like extensible software. Especially platforms like eclipse or facebook that are extensible with plugins. We use RabbitMQ as our queuing backbone and its rock solid. I asked ops to install the new relic plugin and immediately I see some queue has 1.5M messages. Upon asking questions to the developer I found out they had changed queue names and this one was not purged.  These 1.5M messages were consuming 75M memory so it was not a big issue but still this was an Aha moment. We purged the queue and problem was solved.

3) Third Aha moment was when I installed memcahed plugin. Immediately I can see that one class of memcached servers that store static metadata is using 10% memory but another class of memcached servers that store transactional metadata is using 80-90% of memory. Memcached has a slab allocation so if you are using 80-90% memory you are sure of seeing evictions and we were seeing at rate of 10 per sec which is not much but still why should we see evictions. Earlier also all this data was present in memcached stats but we were not looking into it pro-actively because both memcached and rabbitMQ have rarely given us issues and finding this kind of trend over 20 memcached servers was not easy from stats output.  This weekend we are redistributing the memory allocation and giving more memory to transactional servers and reclaiming unused memory from metadata servers.

4) Also one more thing that was good to know is our average response time for all apis across the board. We had all these stats at daily level but no trend information on hourly basis.  Below as you can see there is not much difference between one DC to other DC over last 34 hours.

Offcourse not everything is perfect with new relic, I have some minor cribs also:

1) For plugins there is no aggregation. I wanted an aggregation for plugin data coming from all servers like they do for web transactions. we have close to 400+ servers and I dont want to see a dashboard to compare IO averages that look like this as I really want to focus on servers that are anomalies. I have close to 100+ Mysql servers and growing fast but I don’t want to install Mysql Plugin because the graphs are per server so its useless, I have same issue with haproxy plugin, I already have all this in cacti and graphite so I am seeing if I can use OpenTsDB for this and produce one graph that allows me to focus on anamolies.  Vividcortex has a nice bubble paradigm for this kind of analysis but again that tool didnt had much of interest to me for Mysql except Bubble view.

2) New relic is good at detecting surface level issues and pointing the developer in right direction but after that what??  Our app uses a Fair Share thread pool and every api request is routed via this. The Fair share thread pool ensures no one customer is hogging all resources in one machine. But problem is new relic doesnt detect transaction trace in Async activity, it tells me 99% time is spent in this method that delegates to fair share thread pool but its useless to me.  I have not given up on it but then I would need to spend time mutating code with custom @Trace annotation or deploying a yaml file with cut points to trace in each server. But this one I would try later in free time and if we decide to buy new relic.

3) New relic only shows errors that are sent back to browser as error but what about exceptions that are gobbled up or it tells me /webdav url has error and it send 500 at a rate of 15 per minute, but now what, which exception happend as same url could have failed due to different exception for different customer and I cant tail logs on 100+ servers?  I have a home grown APM that probably does a better job here so we would probably stick to that for time being for this.

Overall very happy with new relic and may be I need to play with more advanced features.  Its good at detecting some general trends and pointing you in right direction. It has less clutter than the other APM tools I was trying and best of all its super fast so far for me even If I am analyzing last 3 days worth of data.


Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost(""); factory.setPort(5672); Conne…

Logging to Graphite monitoring tool from java

We use Graphite as a tool for monitoring some stats and watch trends. A requirement is to monitor impact of new releases as build is deployed to app nodes to see if things like
1) Has the memcache usage increased.
2) Has the no of Java exceptions went up.
3) Is the app using more tomcat threads.
Here is a screenshot

We changed the installer to log a deploy event when a new build is deployed. I wrote a simple spring bean to log graphite events using java. Logging to graphite is easy, all you need to do is open a socket and send lines of events.
import org.slf4j.Logger;import org.slf4j.LoggerFactory; import; import; import; import java.util.HashMap; import java.util.Map; public class GraphiteLogger { private static final Logger logger = LoggerFactory.getLogger(GraphiteLogger.class); private String graphiteHost; private int graphitePort; public String getGraphiteHost() { return graphiteHost; } public void setGraphite…

What a rocky start to labor day weekend

Woke up by earthquake at 7:00 AM in morning and then couldn't get to sleep. I took a bath, made my tea and started checking emails and saw that after last night deployment three storage node out of 100s of nodes were running into Full GC. What was special about the 3 nodes was that each one was in a different Data centre but it was named same app02.  This got me curious I asked the node to be taken out of rotation and take a heap dump.  Yesterday night a new release has happened and I had upgraded spymemcached library version as new relic now natively supports instrumentation on it so it was a suspect. And the hunch was a bullseye, the heap dump clearly showed it taking 1.3G and full GCs were taking 6 sec but not claiming anything.

I have a quartz job in each jvm that takes a thread dump every 5 minutes and saves last 300 of them, checking few of them quickly showed a common thread among all 3 data centres. It seems there was a long running job that was trying to replicate pending…