Skip to main content

New relic aha moments with Java app

I am integrating New Relic at our startup and in last 2 weeks there were several Aha moments.  Below are some of the Aha moments

1) Our DBAs had done some profiling an year ago and told me that there is this "Select 1 from dual" query that is taking most of the time in database.  I was like this is the query that we do for validating connection out of commons-dbcp pool so I cant take it out and why would select 1 from dual take time in database.  But then I installed new relic on 16 app servers in a data centre and then I went to Database tab and immediately I see this query being fired 40-50K times. This was an Aha moment and immediately I started looking for alternatives.







Finally I settled on tomcat-dbcp because it has a property called as validationInterval(default is 30 sec). So what it means is it would still fire select 1 from dual but will fire it on the connection only if it hasnt fired it in last 30 sec. This weekend the fix is going live so crossing fingers if there are no serious issues with tomcat-dbcp in highly concurrent environment, UAT tests looks promising so far.

2) Second Aha moment was with Plugins support for new relic. I used to write platform software in my previous startup so I really like extensible software. Especially platforms like eclipse or facebook that are extensible with plugins. We use RabbitMQ as our queuing backbone and its rock solid. I asked ops to install the new relic plugin and immediately I see some queue has 1.5M messages. Upon asking questions to the developer I found out they had changed queue names and this one was not purged.  These 1.5M messages were consuming 75M memory so it was not a big issue but still this was an Aha moment. We purged the queue and problem was solved.

3) Third Aha moment was when I installed memcahed plugin. Immediately I can see that one class of memcached servers that store static metadata is using 10% memory but another class of memcached servers that store transactional metadata is using 80-90% of memory. Memcached has a slab allocation so if you are using 80-90% memory you are sure of seeing evictions and we were seeing at rate of 10 per sec which is not much but still why should we see evictions. Earlier also all this data was present in memcached stats but we were not looking into it pro-actively because both memcached and rabbitMQ have rarely given us issues and finding this kind of trend over 20 memcached servers was not easy from stats output.  This weekend we are redistributing the memory allocation and giving more memory to transactional servers and reclaiming unused memory from metadata servers.

4) Also one more thing that was good to know is our average response time for all apis across the board. We had all these stats at daily level but no trend information on hourly basis.  Below as you can see there is not much difference between one DC to other DC over last 34 hours.
DC1

DC2
Offcourse not everything is perfect with new relic, I have some minor cribs also:

1) For plugins there is no aggregation. I wanted an aggregation for plugin data coming from all servers like they do for web transactions. we have close to 400+ servers and I dont want to see a dashboard to compare IO averages that look like this as I really want to focus on servers that are anomalies. I have close to 100+ Mysql servers and growing fast but I don’t want to install Mysql Plugin because the graphs are per server so its useless, I have same issue with haproxy plugin, I already have all this in cacti and graphite so I am seeing if I can use OpenTsDB for this and produce one graph that allows me to focus on anamolies.  Vividcortex has a nice bubble paradigm for this kind of analysis but again that tool didnt had much of interest to me for Mysql except Bubble view.



2) New relic is good at detecting surface level issues and pointing the developer in right direction but after that what??  Our app uses a Fair Share thread pool http://neopatel.blogspot.com/2013/06/java-fair-share-threadpool.html and every api request is routed via this. The Fair share thread pool ensures no one customer is hogging all resources in one machine. But problem is new relic doesnt detect transaction trace in Async activity, it tells me 99% time is spent in this method that delegates to fair share thread pool but its useless to me.  I have not given up on it but then I would need to spend time mutating code with custom @Trace annotation or deploying a yaml file with cut points to trace in each server. But this one I would try later in free time and if we decide to buy new relic.

3) New relic only shows errors that are sent back to browser as error but what about exceptions that are gobbled up or it tells me /webdav url has error and it send 500 at a rate of 15 per minute, but now what, which exception happend as same url could have failed due to different exception for different customer and I cant tail logs on 100+ servers?  I have a home grown APM that probably does a better job here so we would probably stick to that for time being for this.

Overall very happy with new relic and may be I need to play with more advanced features.  Its good at detecting some general trends and pointing you in right direction. It has less clutter than the other APM tools I was trying and best of all its super fast so far for me even If I am analyzing last 3 days worth of data.

Comments

Popular posts from this blog

Killing a particular Tomcat thread

Update: This JSP does not work on a thread that is inside some native code.  On many occasions I had a thread stuck in JNI code and it wont work. Also in some cases thread.stop can cause jvm to hang. According to javadocs " This method is inherently unsafe. Stopping a thread with Thread.stop causes it to unlock all of the monitors that it has locked". I have used it only in some rare occasions where I wanted to avoid a system shutdown and in some cases we ended up doing system shutdown as jvm was hung so I had a 70-80% success with it.   -------------------------------------------------------------------------------------------------------------------------- We had an interesting requirement. A tomcat thread that was spawned from an ExecutorService ThreadPool had gone Rogue and was causing lots of disk churning issues. We cant bring down the production server as that would involve downtime. Killing this thread was harmless but how to kill it, t

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it can happe

Preparing for an interview after being employed 11 years at a startup

I would say I didn't prepared a hell lot but  I did 2 hours in night every day and every weekend around 8 hours for 2-3 months. I did 20-30 leetcode medium problems from this list https://leetcode.com/explore/interview/card/top-interview-questions-medium/.  I watched the first 12 videos of Lecture Videos | Introduction to Algorithms | Electrical Engineering and Computer Science | MIT OpenCourseWare I did this course https://www.educative.io/courses/grokking-the-system-design-interview I researched on topics from https://www.educative.io/courses/java-multithreading-for-senior-engineering-interviews and leetcode had around 10 multithreading questions so I did those I watched some 10-20 videos from this channel https://www.youtube.com/channel/UCn1XnDWhsLS5URXTi5wtFTA