Saturday, February 28, 2015

New relic and statuspage.io integration

Monitoring tools exposes a lot of data and we use Nagios, cacti, graphite,newrelic, mixpanel, flurry, boundary and many more tools.  But one of the ask for Support and marketing teams is how can they internally know if something is wrong. We cant expect them to wade through so many systems and so many applications to make sense of what is operational and what is not.  For e.g. we use a lot of services to serve the cloud filer server solution and this is the first page of status of our services in new relic and it spans 2 pages.

Support team and management relies on Operations team to notify them if an issue is on going. To make this easy I did a Proof of concept integration application responsible for serving main website with Statuspage.io.  The idea is simple

  1. Create public metrics in statuspage.io that are human readable.
  2. Query new relic and various systems for application status.
  3. Map new relic green/red/yellow status and other system status to statuspage.io status.
  4. If there is a status change then update statuspage.io metric status.
  5. Statuspage.io lets user subscribe to status change via SMS or email so support can  signup for that.
  6. Once this is fully baked then we can make this public and let customers do the same.
 

Now this is a proof of concept so I wrote a python script only with new relic integration with one application. I setup a cron job that runs every 2 min to do this integration. But we can enhance this script to derive status from various tools like new relic/nagios/boundary and map it transitively to human readable status. We can then even load this page up in various Tv screens in all offices to improve visibility.

The script I used to do the integration was for proof of concept and I cooked up in 2-3 hours including api research so use with caution.

import os
import sys
import requests
import logging
from logging.handlers import RotatingFileHandler

spiPageId = "PutYourPageIdHere"
cloudComponents = ["Metadata APIs","Web Interface","Webdav","Sharing"]
nrToSpiStatusMap = {"green":"operational", "yellow":"degraded_performance", "red":"major_outage"}

def getNewRelicApplications(apiKey):   
    interestedApps = ["cloud","mobile"]
    status = {}
    url ="https://api.newrelic.com/v2/applications.json"
    headers = {'X-Api-Key': apiKey}
    logging.info(url)
    r = requests.get(url, headers=headers)
    json=r.json()
    for application in json["applications"]:
        if application["name"] in interestedApps :
            status[application["name"]] = application["health_status"]
    return status

def getSpiComponents(spiApiKey):
    url ="https://api.statuspage.io/v1/pages/%s/components.json" % spiPageId
    headers = {'Authorization:': "OAuth %s" % spiApiKey}
    logging.info(url)
    r = requests.get(url, headers=headers)
    json=r.json()
    return json

def updateSpiComponent(spiApiKey, componentId, status):    
    url ="https://api.statuspage.io/v1/pages/%s/components/%s.json" % (spiPageId, componentId)
    headers = {'Authorization:': "OAuth %s" % spiApiKey}
    data = {"component[status]":status}
    logging.info(url)
    r = requests.patch(url, data=data, headers=headers)
    json=r.json()
    return json
   
def deriveSpiComponentToUpdate(spiApiKey, newRelicAppStatus):   
    spiComponents = getSpiComponents(spiApiKey)
    nrCloudStatus = newRelicAppStatus["cloud"]
    spiCloudStatus =  nrToSpiStatusMap[nrCloudStatus]
    componentsToUpdate = {}
    for component in spiComponents:
        if component["name"] in cloudComponents :            
            if component["status"] != spiCloudStatus :
                componentsToUpdate[component["id"]] = spiCloudStatus
    return componentsToUpdate

def updateSpiComponents(spiApiKey, componentsToUpdate):   
    for componentId in componentsToUpdate :
        status = componentsToUpdate[componentId]
        updateSpiComponent(spiApiKey, componentId, status)
            
def init():
    logger = logging.getLogger('')
    logger.setLevel(logging.DEBUG)

    log_format = 'Process-%(process)d %(asctime)s %(levelname)-8s %(message)s'
    handler = RotatingFileHandler("statuspage.log", maxBytes= 20 *1024 * 1024 , backupCount=10)   
    handler.setFormatter(logging.Formatter(log_format))
    logger.addHandler(handler)
   
# Main
if __name__ == "__main__":
    newRelicApiKey = sys.argv[1]
    spiApiKey = sys.argv[2]
    init()
    newRelicAppStatus = getNewRelicApplications(newRelicApiKey)
    componentsToUpdate = deriveSpiComponentToUpdate(spiApiKey, newRelicAppStatus)
    updateSpiComponents(spiApiKey, componentsToUpdate)    


Move fast break things but with monitoring

We run a complex system with multiple services and every 2 or 3 week we  update the Java applications.  I want to do it every week as most applications are stateless and can be patched anytime but the application serving the main website is using sticky session. We are working to make it failover sessions, once we do that, we can do mid week deployment and that will allow us to go faster than 3 weeks.  This week I pushed a huge infrastructure change related to user Id generation. I had asked ops team to check the status of new relic after the midnight deployment and it looked like this so everyone was happy.


I woke up and checked new relic mobile app and things looked ok to me. After finishing my morning routines I ran my daily exception report and one thing that caught the eye was 90K exceptions in last 12 hours in one of the files I had changed.  To gauge the impact I went in new relic and it showed me an error rate of 0.07 in one of the app

I then checked new relic and I see this blip that caused 90k errors, when the blip was there the status must have been red but then it was quickly green and Ops team didnt caught it


The issue was due to a wrong method invocation in one of the class used by this one specific application and it took just 15 min to fix after reproducing with a testcase.  So why didnt QA/UAT/automated tests caught it, well the issue was like a heisen bug and would occur only when the object is missing in cache. 99% of the calls would go to cache and only 0.07% were going to dao layer that had the bug.  I quickly made a patch and ops deployed it and I can see the issue is now gone.

Had there been no monitoring it would have been difficult to catch these kinds of bugs. New relic and internal monitoring tools makes life easy as it exposes anomalies.

Wednesday, February 11, 2015

Email slavery

It seems I have become an EmailSlave. The first half of the day is spent in just answering to emails. There are so many emails where I am copied but I need not be. There are many emails  where its a 1-2 page email and somewhere down someone says @KP please answer this.  So it seems daily my work schedule is:
  1. Signin to newrelic and check anomalies for 15 min. 
  2. Check emails related production exception report and yes there are a ton of these report daily. Need a better tool here as this model is not scalable. I need to reduce the incoming data at me to only see relevant data like what newrelic does. May be I need to create a webapp out of these emails.
  3. Check emails for next few minutes before team calls
  4. Do team calls
  5. Then again back to checking emails until a I have taken a best shot at answering everyone waiting for my reply.
  6. Attend team meetings on Tue/Thu

Being an architect and coder at heart I don't feel satisfied at end of the day if there is nothing tangible getting done at the end. Yes I can say hey I replied to 100 emails in a day and did 2 calls but that seems bullshit.  If you read this article http://tomtunguz.com/burnout/ it quotes
""
 I suspect burnout is much more pronounced for information workers - people who deal in bits each day - because unlike a mason or an architect, the product of much of our work isn’t visible. Even the tools we use, virtual to do lists and email, hide the work we’ve completed: tasks checked off and emails sent.
""

I had same problems when I was doing daily calls with the team but I have reduced it a lot by only talking to Team on Monday, Wednesday, Friday and instead of talking to 5 people now I talk to only 2.  I need to find something similar for email.


It seems I am not the only one chasing this dreaded #inboxzero, I reached it once on December 15th but am again back to lots dreaded slavery.

As per this venturebeat article http://venturebeat.com/2015/02/09/sendgrid-launches-iphone-app-so-marketers-can-track-their-email-performance-stats-on-the-move/  "Email’s demise may have been greatly exaggerated — certainly if SendGrid’s efforts are anything to go by.
Last year, we reported that SendGrid had sent out almost as many emails as McDonald’s had sold burgers. A year on from that, SendGrid reports that it has sent more than 300 billion emails since launch, equating to an average of 435 million emails per day, or 15 billion emails per month.
"
Action Items:

  1. I will myself try to think every time I hit reply All to see whether I need all these guys to be on the email or not. 
  2. Try writing short emails, twitter really make it harder for you to write as it allows only 140 chars but you eliminate a lot of garbage when you are constrained to only 140 chars.  check http://www.igzebedze.com/2014/12/business-haiku-efficient-emails/?wpst=1