Skip to main content

Better Tools Happy Engineers

"The only thing constant in a Startup is Change"

If you aren't changing fast enough then order and complacency sets in which leads to mediocrity and you can soon become obsolete. We do biweekly releases and want to move to weekly and then daily releases. You can increase the release cadence, if you are able to automate most of the testing. But automation can't detect everything and things may break and the intent is to establish a culture of how can you prevent it but if you cant prevent it then how fast can diagnose and recover from it.

For a long time I had been pushing hard for a working centralized logging and after almost an year of trying to scale the ELK framework finally our team has been able to put a fast centralized logging system that's ingesting Terabytes of log per minute from thousands of nodes in production. This weekend we deployed a major architecture change by creating swimlanes in our cloud by directing different types of traffic to different swimlane and we also added more capacity to our Memcached servers. I was training my lablab bean plants on this Saturday morning when I received a call from India Operations team and I was like this doesn't look good. The operations engineer told me there were 4-5 support tickets opened in last 4 hours and they were all related to the datacenter zones where we added the swimlanes. I am like this doesn't look good. Luckily in one of the escalation ticket I found a  distributed tracing identifier called RequestId and I quickly did a Kibana search and found that the request ran into 503 due to no server found at HAProxy. I took the String "NOSRV" and fired a Kibana query for last 12 hours and found a pattern that it was happening during the deployment and isn't happening anymore for last 4 hours. This relaxed me and I can then think critically as to what might have happened.
I did some more queries and root caused it and send it to deployment team and also confirmed that for the customers things are working fine. We would do a postmortem to takes steps so it doesn't happen again. Thanks to a working centralized logging solution I was able to ROOT cause it end to end within 20 minutes and I was back to my gardening.

Previous weekend was also similar things happened where I had to take my son to a birthday party and a customer was complaining that his account is being locked out due to Invalid Auth attempts for 3 weeks and no one is able to chase which device it is and he escalated to top management. The customer ticket was in Ashville data centre and everyone was checking there but in one of the email exchange customer told me his Active Directory account is getting lot of invalid attempts from  San jose data centre. I was baffled if its a security issue but I was like what the heck I can just query his email "xxx@xxx.xom" in the San Jose centralized kibana and indeed he was right. He had one more account in San Jose data centre where an old device from one of his office was making the calls. I sent customer initial details and I took my son to birthday party and came back and root caused the issue within 1 hour all due to centralized logging and I was back to enjoy working on things I love.

I am lazy by nature and I hate doing mundane tasks like grepping logs, for a long time I was pushing team to get centralized logging working. The problems with grepping logs were:-
  1. New machines were added but the log mounts weren't  there so I had to chase people to mount them in middle of debugging session which can sometime takes hours on a weekend.
  2. Mounts would become  bad.
  3. Grepping logs for simple things take hours if done over a day worth of logs over 200 machines.
  4. You need to understand the entire architecture of all 20-30 services and 3-4 layers of load balancers to grep in a localized area to save time.
  5. "The only thing constant in a Startup is change" and architecture changes every month so you need to keep up to speed in all of them.
  6. Support or customer success team doesn't know architecture and then ticket is given to engineers who isn't happy to chase issues sometimes only to explain customer that its a feature not a bug.
  7. Product management team also doesn't know architecture enough to grep things.  
  8. With Grepping you cant see trends like above unless you are techy enough in linux to do graph on command lines.
Centralized logging solves all of this and best of all you can send a link to the issue or even put it in ticketing system.

Providing better tools and pushing hard for them should be the job of Engineering leads as the technology landscape keeps changing rapidly and whoever can move fast and recover from failures quickly will survive. Also better tools create Happy engineers as they can do mundane things quickly and move back to doing what they love (in my case programming, gardening and spending time with family).

Comments

Popular posts from this blog

RabbitMQ java clients for beginners

Here is a sample of a consumer and producer example for RabbitMQ. The steps are
Download ErlangDownload Rabbit MQ ServerDownload Rabbit MQ Java client jarsCompile and run the below two class and you are done.
This sample create a Durable Exchange, Queue and a Message. You will have to start the consumer first before you start the for the first time.

For more information on AMQP, Exchanges, Queues, read this excellent tutorial
http://blogs.digitar.com/jjww/2009/01/rabbits-and-warrens/

+++++++++++++++++RabbitMQProducer.java+++++++++++++++++++++++++++
import com.rabbitmq.client.Connection; import com.rabbitmq.client.Channel; import com.rabbitmq.client.*; public class RabbitMQProducer { public static void main(String []args) throws Exception { ConnectionFactory factory = new ConnectionFactory(); factory.setUsername("guest"); factory.setPassword("guest"); factory.setVirtualHost("/"); factory.setHost("127.0.0.1"); factory.setPort(5672); Conne…

Logging to Graphite monitoring tool from java

We use Graphite as a tool for monitoring some stats and watch trends. A requirement is to monitor impact of new releases as build is deployed to app nodes to see if things like
1) Has the memcache usage increased.
2) Has the no of Java exceptions went up.
3) Is the app using more tomcat threads.
Here is a screenshot

We changed the installer to log a deploy event when a new build is deployed. I wrote a simple spring bean to log graphite events using java. Logging to graphite is easy, all you need to do is open a socket and send lines of events.
import org.slf4j.Logger;import org.slf4j.LoggerFactory; import java.io.OutputStreamWriter; import java.io.Writer; import java.net.Socket; import java.util.HashMap; import java.util.Map; public class GraphiteLogger { private static final Logger logger = LoggerFactory.getLogger(GraphiteLogger.class); private String graphiteHost; private int graphitePort; public String getGraphiteHost() { return graphiteHost; } public void setGraphite…

What a rocky start to labor day weekend

Woke up by earthquake at 7:00 AM in morning and then couldn't get to sleep. I took a bath, made my tea and started checking emails and saw that after last night deployment three storage node out of 100s of nodes were running into Full GC. What was special about the 3 nodes was that each one was in a different Data centre but it was named same app02.  This got me curious I asked the node to be taken out of rotation and take a heap dump.  Yesterday night a new release has happened and I had upgraded spymemcached library version as new relic now natively supports instrumentation on it so it was a suspect. And the hunch was a bullseye, the heap dump clearly showed it taking 1.3G and full GCs were taking 6 sec but not claiming anything.



I have a quartz job in each jvm that takes a thread dump every 5 minutes and saves last 300 of them, checking few of them quickly showed a common thread among all 3 data centres. It seems there was a long running job that was trying to replicate pending…