Skip to main content

New relic aha moments with Java app

I am integrating New Relic at our startup and in last 2 weeks there were several Aha moments.  Below are some of the Aha moments

1) Our DBAs had done some profiling an year ago and told me that there is this "Select 1 from dual" query that is taking most of the time in database.  I was like this is the query that we do for validating connection out of commons-dbcp pool so I cant take it out and why would select 1 from dual take time in database.  But then I installed new relic on 16 app servers in a data centre and then I went to Database tab and immediately I see this query being fired 40-50K times. This was an Aha moment and immediately I started looking for alternatives.







Finally I settled on tomcat-dbcp because it has a property called as validationInterval(default is 30 sec). So what it means is it would still fire select 1 from dual but will fire it on the connection only if it hasnt fired it in last 30 sec. This weekend the fix is going live so crossing fingers if there are no serious issues with tomcat-dbcp in highly concurrent environment, UAT tests looks promising so far.

2) Second Aha moment was with Plugins support for new relic. I used to write platform software in my previous startup so I really like extensible software. Especially platforms like eclipse or facebook that are extensible with plugins. We use RabbitMQ as our queuing backbone and its rock solid. I asked ops to install the new relic plugin and immediately I see some queue has 1.5M messages. Upon asking questions to the developer I found out they had changed queue names and this one was not purged.  These 1.5M messages were consuming 75M memory so it was not a big issue but still this was an Aha moment. We purged the queue and problem was solved.

3) Third Aha moment was when I installed memcahed plugin. Immediately I can see that one class of memcached servers that store static metadata is using 10% memory but another class of memcached servers that store transactional metadata is using 80-90% of memory. Memcached has a slab allocation so if you are using 80-90% memory you are sure of seeing evictions and we were seeing at rate of 10 per sec which is not much but still why should we see evictions. Earlier also all this data was present in memcached stats but we were not looking into it pro-actively because both memcached and rabbitMQ have rarely given us issues and finding this kind of trend over 20 memcached servers was not easy from stats output.  This weekend we are redistributing the memory allocation and giving more memory to transactional servers and reclaiming unused memory from metadata servers.

4) Also one more thing that was good to know is our average response time for all apis across the board. We had all these stats at daily level but no trend information on hourly basis.  Below as you can see there is not much difference between one DC to other DC over last 34 hours.
DC1

DC2
Offcourse not everything is perfect with new relic, I have some minor cribs also:

1) For plugins there is no aggregation. I wanted an aggregation for plugin data coming from all servers like they do for web transactions. we have close to 400+ servers and I dont want to see a dashboard to compare IO averages that look like this as I really want to focus on servers that are anomalies. I have close to 100+ Mysql servers and growing fast but I don’t want to install Mysql Plugin because the graphs are per server so its useless, I have same issue with haproxy plugin, I already have all this in cacti and graphite so I am seeing if I can use OpenTsDB for this and produce one graph that allows me to focus on anamolies.  Vividcortex has a nice bubble paradigm for this kind of analysis but again that tool didnt had much of interest to me for Mysql except Bubble view.



2) New relic is good at detecting surface level issues and pointing the developer in right direction but after that what??  Our app uses a Fair Share thread pool http://neopatel.blogspot.com/2013/06/java-fair-share-threadpool.html and every api request is routed via this. The Fair share thread pool ensures no one customer is hogging all resources in one machine. But problem is new relic doesnt detect transaction trace in Async activity, it tells me 99% time is spent in this method that delegates to fair share thread pool but its useless to me.  I have not given up on it but then I would need to spend time mutating code with custom @Trace annotation or deploying a yaml file with cut points to trace in each server. But this one I would try later in free time and if we decide to buy new relic.

3) New relic only shows errors that are sent back to browser as error but what about exceptions that are gobbled up or it tells me /webdav url has error and it send 500 at a rate of 15 per minute, but now what, which exception happend as same url could have failed due to different exception for different customer and I cant tail logs on 100+ servers?  I have a home grown APM that probably does a better job here so we would probably stick to that for time being for this.

Overall very happy with new relic and may be I need to play with more advanced features.  Its good at detecting some general trends and pointing you in right direction. It has less clutter than the other APM tools I was trying and best of all its super fast so far for me even If I am analyzing last 3 days worth of data.

Comments

Popular posts from this blog

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues: 1) We have to overprovision each node with ability to handle worse case capacity. 2) If two or three high profile customers lands on to same node then we need to move them manually. 3) We need to cut over new nodes and we already have over 100+ nodes.  Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue. I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality. So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediat...

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.". After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that public class DurableJobDetailFactoryBean extends JobDetailFactoryBean {     public DurableJobDetailFactoryBean() {         setDurability(true);     } } and used this instead of JobDetailFactoryBean in the spring bean definition     <bean i...

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it ...