I am a big fan of eliminating human touch points when it comes to scalable systems. When I say "human touch" I mean a step that involves a human to perform the operation rather than relying on bots. A good example is db schema updates. We use mysql sharding and recently we split our global db into 12 sharded databases with identical schema but diffrent hosts. As we were crunched on time, I didnt got time to write an automatic schema applier, the process agreed upon was to put the ddl statements in deployment notes and devops would apply it in all databases.
As usual I am always skeptical of human touch points so after 2 months of going live just for my curiosity I wrote a mysql schema differ that would dump all 12 schemas and diff them and to my surprise there were schema differences. Four of the mysql servers were setup with latin1 characterset rather than utf8_bin. No one caught the issue because that table was created a sleeper mode and the feature is yet to go live next month.
We use sharding in other parts of the system also, so I wrote one more script on those databases and there are indexes missing on some shards. The alter scripts are auto generated by python but applied by humans and this human touch caused the inconsistency.
Lesson learnt is that either you remove the human touch point or write a consistency checker to ensure humans did the job correctly. Once I get time I will write the schema applier bot instead of relying on humans.
Being lazy is one of the key principles of a good software engineer and when someone tries to involve me as a human touch point, my first goal is how can I remove myself from this process :). Like I was asked after every release to grep logs from production nodes and see how is release doing. I would rather have a bot do this job and enjoy my weekend. It was a pain to write the bot but once I write it, I am out of the equation and if I leave the company the process still works as I am not in the critical path.
As usual I am always skeptical of human touch points so after 2 months of going live just for my curiosity I wrote a mysql schema differ that would dump all 12 schemas and diff them and to my surprise there were schema differences. Four of the mysql servers were setup with latin1 characterset rather than utf8_bin. No one caught the issue because that table was created a sleeper mode and the feature is yet to go live next month.
We use sharding in other parts of the system also, so I wrote one more script on those databases and there are indexes missing on some shards. The alter scripts are auto generated by python but applied by humans and this human touch caused the inconsistency.
Lesson learnt is that either you remove the human touch point or write a consistency checker to ensure humans did the job correctly. Once I get time I will write the schema applier bot instead of relying on humans.
Being lazy is one of the key principles of a good software engineer and when someone tries to involve me as a human touch point, my first goal is how can I remove myself from this process :). Like I was asked after every release to grep logs from production nodes and see how is release doing. I would rather have a bot do this job and enjoy my weekend. It was a pain to write the bot but once I write it, I am out of the equation and if I leave the company the process still works as I am not in the critical path.
Comments
Post a Comment