Skip to main content

Better Tools Happy Engineers

"The only thing constant in a Startup is Change"

If you aren't changing fast enough then order and complacency sets in which leads to mediocrity and you can soon become obsolete. We do biweekly releases and want to move to weekly and then daily releases. You can increase the release cadence, if you are able to automate most of the testing. But automation can't detect everything and things may break and the intent is to establish a culture of how can you prevent it but if you cant prevent it then how fast can diagnose and recover from it.

For a long time I had been pushing hard for a working centralized logging and after almost an year of trying to scale the ELK framework finally our team has been able to put a fast centralized logging system that's ingesting Terabytes of log per minute from thousands of nodes in production. This weekend we deployed a major architecture change by creating swimlanes in our cloud by directing different types of traffic to different swimlane and we also added more capacity to our Memcached servers. I was training my lablab bean plants on this Saturday morning when I received a call from India Operations team and I was like this doesn't look good. The operations engineer told me there were 4-5 support tickets opened in last 4 hours and they were all related to the datacenter zones where we added the swimlanes. I am like this doesn't look good. Luckily in one of the escalation ticket I found a  distributed tracing identifier called RequestId and I quickly did a Kibana search and found that the request ran into 503 due to no server found at HAProxy. I took the String "NOSRV" and fired a Kibana query for last 12 hours and found a pattern that it was happening during the deployment and isn't happening anymore for last 4 hours. This relaxed me and I can then think critically as to what might have happened.
I did some more queries and root caused it and send it to deployment team and also confirmed that for the customers things are working fine. We would do a postmortem to takes steps so it doesn't happen again. Thanks to a working centralized logging solution I was able to ROOT cause it end to end within 20 minutes and I was back to my gardening.

Previous weekend was also similar things happened where I had to take my son to a birthday party and a customer was complaining that his account is being locked out due to Invalid Auth attempts for 3 weeks and no one is able to chase which device it is and he escalated to top management. The customer ticket was in Ashville data centre and everyone was checking there but in one of the email exchange customer told me his Active Directory account is getting lot of invalid attempts from  San jose data centre. I was baffled if its a security issue but I was like what the heck I can just query his email "xxx@xxx.xom" in the San Jose centralized kibana and indeed he was right. He had one more account in San Jose data centre where an old device from one of his office was making the calls. I sent customer initial details and I took my son to birthday party and came back and root caused the issue within 1 hour all due to centralized logging and I was back to enjoy working on things I love.

I am lazy by nature and I hate doing mundane tasks like grepping logs, for a long time I was pushing team to get centralized logging working. The problems with grepping logs were:-
  1. New machines were added but the log mounts weren't  there so I had to chase people to mount them in middle of debugging session which can sometime takes hours on a weekend.
  2. Mounts would become  bad.
  3. Grepping logs for simple things take hours if done over a day worth of logs over 200 machines.
  4. You need to understand the entire architecture of all 20-30 services and 3-4 layers of load balancers to grep in a localized area to save time.
  5. "The only thing constant in a Startup is change" and architecture changes every month so you need to keep up to speed in all of them.
  6. Support or customer success team doesn't know architecture and then ticket is given to engineers who isn't happy to chase issues sometimes only to explain customer that its a feature not a bug.
  7. Product management team also doesn't know architecture enough to grep things.  
  8. With Grepping you cant see trends like above unless you are techy enough in linux to do graph on command lines.
Centralized logging solves all of this and best of all you can send a link to the issue or even put it in ticketing system.

Providing better tools and pushing hard for them should be the job of Engineering leads as the technology landscape keeps changing rapidly and whoever can move fast and recover from failures quickly will survive. Also better tools create Happy engineers as they can do mundane things quickly and move back to doing what they love (in my case programming, gardening and spending time with family).

Comments

Popular posts from this blog

Killing a particular Tomcat thread

Update: This JSP does not work on a thread that is inside some native code.  On many occasions I had a thread stuck in JNI code and it wont work. Also in some cases thread.stop can cause jvm to hang. According to javadocs " This method is inherently unsafe. Stopping a thread with Thread.stop causes it to unlock all of the monitors that it has locked". I have used it only in some rare occasions where I wanted to avoid a system shutdown and in some cases we ended up doing system shutdown as jvm was hung so I had a 70-80% success with it.   -------------------------------------------------------------------------------------------------------------------------- We had an interesting requirement. A tomcat thread that was spawned from an ExecutorService ThreadPool had gone Rogue and was causing lots of disk churning issues. We cant bring down the production server as that would involve downtime. Killing this thread was harmless but how to kill it, t

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it can happe

Preparing for an interview after being employed 11 years at a startup

I would say I didn't prepared a hell lot but  I did 2 hours in night every day and every weekend around 8 hours for 2-3 months. I did 20-30 leetcode medium problems from this list https://leetcode.com/explore/interview/card/top-interview-questions-medium/.  I watched the first 12 videos of Lecture Videos | Introduction to Algorithms | Electrical Engineering and Computer Science | MIT OpenCourseWare I did this course https://www.educative.io/courses/grokking-the-system-design-interview I researched on topics from https://www.educative.io/courses/java-multithreading-for-senior-engineering-interviews and leetcode had around 10 multithreading questions so I did those I watched some 10-20 videos from this channel https://www.youtube.com/channel/UCn1XnDWhsLS5URXTi5wtFTA