Better Tools Happy Engineers

"The only thing constant in a Startup is Change"

If you aren't changing fast enough then order and complacency sets in which leads to mediocrity and you can soon become obsolete. We do biweekly releases and want to move to weekly and then daily releases. You can increase the release cadence, if you are able to automate most of the testing. But automation can't detect everything and things may break and the intent is to establish a culture of how can you prevent it but if you cant prevent it then how fast can diagnose and recover from it.

For a long time I had been pushing hard for a working centralized logging and after almost an year of trying to scale the ELK framework finally our team has been able to put a fast centralized logging system that's ingesting Terabytes of log per minute from thousands of nodes in production. This weekend we deployed a major architecture change by creating swimlanes in our cloud by directing different types of traffic to different swimlane and we also added more capacity to our Memcached servers. I was training my lablab bean plants on this Saturday morning when I received a call from India Operations team and I was like this doesn't look good. The operations engineer told me there were 4-5 support tickets opened in last 4 hours and they were all related to the datacenter zones where we added the swimlanes. I am like this doesn't look good. Luckily in one of the escalation ticket I found a distributed tracing identifier called RequestId and I quickly did a Kibana search and found that the request ran into 503 due to no server found at HAProxy. I took the String "NOSRV" and fired a Kibana query for last 12 hours and found a pattern that it was happening during the deployment and isn't happening anymore for last 4 hours. This relaxed me and I can then think critically as to what might have happened.

I did some more queries and root caused it and send it to deployment team and also confirmed that for the customers things are working fine. We would do a postmortem to takes steps so it doesn't happen again. Thanks to a working centralized logging solution I was able to ROOT cause it end to end within 20 minutes and I was back to my gardening.

Previous weekend was also similar things happened where I had to take my son to a birthday party and a customer was complaining that his account is being locked out due to Invalid Auth attempts for 3 weeks and no one is able to chase which device it is and he escalated to top management. The customer ticket was in Ashville data centre and everyone was checking there but in one of the email exchange customer told me his Active Directory account is getting lot of invalid attempts from San jose data centre. I was baffled if its a security issue but I was like what the heck I can just query his email "xxx@xxx.xom" in the San Jose centralized kibana and indeed he was right. He had one more account in San Jose data centre where an old device from one of his office was making the calls. I sent customer initial details and I took my son to birthday party and came back and root caused the issue within 1 hour all due to centralized logging and I was back to enjoy working on things I love.

I am lazy by nature and I hate doing mundane tasks like grepping logs, for a long time I was pushing team to get centralized logging working. The problems with grepping logs were:-

New machines were added but the log mounts weren't there so I had to chase people to mount them in middle of debugging session which can sometime takes hours on a weekend.
Mounts would become bad.
Grepping logs for simple things take hours if done over a day worth of logs over 200 machines.
You need to understand the entire architecture of all 20-30 services and 3-4 layers of load balancers to grep in a localized area to save time.
"The only thing constant in a Startup is change" and architecture changes every month so you need to keep up to speed in all of them.
Support or customer success team doesn't know architecture and then ticket is given to engineers who isn't happy to chase issues sometimes only to explain customer that its a feature not a bug.
Product management team also doesn't know architecture enough to grep things.
With Grepping you cant see trends like above unless you are techy enough in linux to do graph on command lines.

Centralized logging solves all of this and best of all you can send a link to the issue or even put it in ticketing system.

Providing better tools and pushing hard for them should be the job of Engineering leads as the technology landscape keeps changing rapidly and whoever can move fast and recover from failures quickly will survive. Also better tools create Happy engineers as they can do mundane things quickly and move back to doing what they love (in my case programming, gardening and spending time with family).

Programming fun at startup

Search This Blog

Better Tools Happy Engineers

Comments

Post a Comment

Popular posts from this blog

Haproxy and tomcat JSESSIONID

RabbitMQ java clients for beginners

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.