Skip to main content

Better Tools Happy Engineers

"The only thing constant in a Startup is Change"

If you aren't changing fast enough then order and complacency sets in which leads to mediocrity and you can soon become obsolete. We do biweekly releases and want to move to weekly and then daily releases. You can increase the release cadence, if you are able to automate most of the testing. But automation can't detect everything and things may break and the intent is to establish a culture of how can you prevent it but if you cant prevent it then how fast can diagnose and recover from it.

For a long time I had been pushing hard for a working centralized logging and after almost an year of trying to scale the ELK framework finally our team has been able to put a fast centralized logging system that's ingesting Terabytes of log per minute from thousands of nodes in production. This weekend we deployed a major architecture change by creating swimlanes in our cloud by directing different types of traffic to different swimlane and we also added more capacity to our Memcached servers. I was training my lablab bean plants on this Saturday morning when I received a call from India Operations team and I was like this doesn't look good. The operations engineer told me there were 4-5 support tickets opened in last 4 hours and they were all related to the datacenter zones where we added the swimlanes. I am like this doesn't look good. Luckily in one of the escalation ticket I found a  distributed tracing identifier called RequestId and I quickly did a Kibana search and found that the request ran into 503 due to no server found at HAProxy. I took the String "NOSRV" and fired a Kibana query for last 12 hours and found a pattern that it was happening during the deployment and isn't happening anymore for last 4 hours. This relaxed me and I can then think critically as to what might have happened.
I did some more queries and root caused it and send it to deployment team and also confirmed that for the customers things are working fine. We would do a postmortem to takes steps so it doesn't happen again. Thanks to a working centralized logging solution I was able to ROOT cause it end to end within 20 minutes and I was back to my gardening.

Previous weekend was also similar things happened where I had to take my son to a birthday party and a customer was complaining that his account is being locked out due to Invalid Auth attempts for 3 weeks and no one is able to chase which device it is and he escalated to top management. The customer ticket was in Ashville data centre and everyone was checking there but in one of the email exchange customer told me his Active Directory account is getting lot of invalid attempts from  San jose data centre. I was baffled if its a security issue but I was like what the heck I can just query his email "xxx@xxx.xom" in the San Jose centralized kibana and indeed he was right. He had one more account in San Jose data centre where an old device from one of his office was making the calls. I sent customer initial details and I took my son to birthday party and came back and root caused the issue within 1 hour all due to centralized logging and I was back to enjoy working on things I love.

I am lazy by nature and I hate doing mundane tasks like grepping logs, for a long time I was pushing team to get centralized logging working. The problems with grepping logs were:-
  1. New machines were added but the log mounts weren't  there so I had to chase people to mount them in middle of debugging session which can sometime takes hours on a weekend.
  2. Mounts would become  bad.
  3. Grepping logs for simple things take hours if done over a day worth of logs over 200 machines.
  4. You need to understand the entire architecture of all 20-30 services and 3-4 layers of load balancers to grep in a localized area to save time.
  5. "The only thing constant in a Startup is change" and architecture changes every month so you need to keep up to speed in all of them.
  6. Support or customer success team doesn't know architecture and then ticket is given to engineers who isn't happy to chase issues sometimes only to explain customer that its a feature not a bug.
  7. Product management team also doesn't know architecture enough to grep things.  
  8. With Grepping you cant see trends like above unless you are techy enough in linux to do graph on command lines.
Centralized logging solves all of this and best of all you can send a link to the issue or even put it in ticketing system.

Providing better tools and pushing hard for them should be the job of Engineering leads as the technology landscape keeps changing rapidly and whoever can move fast and recover from failures quickly will survive. Also better tools create Happy engineers as they can do mundane things quickly and move back to doing what they love (in my case programming, gardening and spending time with family).

Comments

Popular posts from this blog

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues: 1) We have to overprovision each node with ability to handle worse case capacity. 2) If two or three high profile customers lands on to same node then we need to move them manually. 3) We need to cut over new nodes and we already have over 100+ nodes.  Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue. I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality. So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediat...

Adding Jitter to cache layer

Thundering herd is an issue common to webapp that rely on heavy caching where if lots of items expire at the same time due to a server restart or temporal event, then suddenly lots of calls will go to database at same time. This can even bring down the database in extreme cases. I wont go into much detail but the app need to do two things solve this issue. 1) Add consistent hashing to cache layer : This way when a memcache server is added/removed from the pool, entire cache is not invalidated.  We use memcahe from both python and Java layer and I still have to find a consistent caching solution that is portable across both languages. hash_ring and spymemcached both use different points for server so need to read/test more. 2) Add a jitter to cache or randomise the expiry time: We expire long term cache  records every 8 hours after that key was added and short term cache expiry is 2 hours. As our customers usually comes to work in morning and access the cloud file server it ...

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.". After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that public class DurableJobDetailFactoryBean extends JobDetailFactoryBean {     public DurableJobDetailFactoryBean() {         setDurability(true);     } } and used this instead of JobDetailFactoryBean in the spring bean definition     <bean i...