Skip to main content

Posts

Showing posts from 2013

Penetration testing and crowdsourcing

We take security of our data and customers seriously and any security issue found is patched ASAP.  We hired outside security testing companies to do our testing in case the developers missed something. Initially we hired a company lets call it YYYhat and they were great in first few months and found many standard issues like XSS, XSRF, session hijacking but after sometime no new issues were found. Then one day a customer reported an issue to us that was not detected by YYYhat so we hired another company, lets call it YYYsec and this company was good in finding some sql injection and some XSS in another parts of the system that were constantly missed by YYYhat company.  But again the pipelined dried from YYYsec and we thought we were secure.

We even asked our engineers to start downloading penetration test tools and automate them to detect standard issues.  But again they didnt found much.

Lately I am observing that these specialized security testing companies are a one skill or some …

Combine mysql alter statements for performance benefits

We have 1200+ shards spread across 20 mysql servers. I am working on some denormalization project to remove joins and improve performance of a big snapshot query.  I had to alter a table with 8 denormalized columns so initially my script was

alter table file_p1_mdb1_t1 add column ctime BIGINT;
alter table file_p1_mdb1_t1 add column size BIGINT;
alter table file_p1_mdb1_t1 add column user_id BIGINT;

in performance environment when I generated the alter script and started applying alter I started seeing below data where each alter is taking 30sec. I was like this could take 80 hours to alter 1200 shards each with 8 alter per table. Even if we do 20 servers in parallel this could take 4 hours.

Query OK, 1446841 rows affected (33.58 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (31.66 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (31.86 sec)
Records: 1446841  Duplicates: 0  Warnings: 0

Query OK, 1446841 rows affected (32.15 s…

Mysql replication and DEFAULT Timestamp

if you are using mysql replication to replicate data across data centres and using statement replication then dont use the DEFAULT and on UPDATE fields

last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

We recently did a simple RING replication across 3 datacentres to manage our user account identities and ran into this issue.  Due to this when the same statement is applied on other data centre the timestamp column gets a different value.  For now we would remove this and generate the time from Java so that we can use do a hash based consistency checker on replicated rows.

Url rewriting XSS and tomcat7

If you have url rewriting enabled and your site has a XSS vulnerability then the site can he hacked by reading url in Javascript and sending sessions to remote servers.

To disable this if you are on tomcat7 the fix is simple

add this to your web.xml
    <session-config>
        <tracking-mode>COOKIE</tracking-mode>
    </session-config>

Also you should mark your sessionId cookie secure and httpOnly.

Velocity, code quality and manual QA

Unless you can remove manual QA, you can either have code quality or you can have release velocity. You cant have both goals, you need to compromise on one of them.

We recently did a big release where we replaced lot of old DWR apis with REST apis and we revamped a big part of old ExtJS UI with backbone based UI. The release was done in a duration of 4 weeks and even after delaying the release by 1 week, there were some bugs found by some customers.  I had expected some issues and I am of proponent of "fail fast and fix fast".  Many of these issues we were able to find and fix it on the weekend itself.

In the coming week we found an issue on Friday afternoon and we had a similar bug fixed for next release that would fix this bug also, but QA was unsure of merging this fix as most of the QA team was out. This is the second last week of december and it seems everyone is out for christmas so it was decided to delay the fix to be done when QA was available. Now I have to daily …

nginx enable compression and JS/CSS files

So one of my old colleague was trying to find some quick way of increasing performance for his pet project website so he asked me to take a look.  First thing I did was ran pagespeed and immediately I saw compression was off, css/js was not aggregated. JS/CSS aggregation would require either build changes or installing pagespeed so I thought quick 1 hour fix would be to just enable compression so I went and enabled " gzip  on;"  in nginx conf and it gave a good bump on login page but the home page after login, it kept complaining about JS/CSS not being compressed. I was scratching my head and left it as is that day.

Yesterday night finally I found out that turning gzip on would only compress text/html by default in nginx.  You need to explicity turn it on for other types. So I went and added the below and problem was solved.

    gzip_types  text/plain application/xml text/css text/js text/xml application/x-javascript text/javascript application/javascript application/json;


Wh…

120K lines of code changed and 7000 commits in 4+ years and going strong

wow svn stats reports are interesting.  I ran into a report recently that I made 7000+ commits to svn and changed 120K+ lines at my startup.  Still I feel like I just joined and do not know enough parts of the system.



I wonder if there is a tool that can detect how many lines of code I would have read.  If I changed 120K lines and assuming I read 3 times that of code, it seems low.

how a loose wire can lead to strange network issues

On Thursday I started having weird network issues where my skype would disconnect and network would be choppy. I had to use my 3G on phone to make skype calls with team. But streaming services like netflix would work fine.  I tried all things from changing DNS to restarting router to shutting down extra devices.  Finally I remember last time it happened the time warner cable had asked me to plug the laptop directly to modem to weed out router issues.

To do this I was connecting modem with ethernet and it also was choppy and suddenly I thought let me check the connections and found that the incoming cable connection to modem required 1-2 turns to tight it. Thats it all issues solved but this loose wire definitely gave me some heart burn for 2 days and even to my son who would start complaining because his netflix would constantly start showing the red download bar :).

eclipse kepler pydev not working

On my new laptop I installed eclipse kepler and then tried installing pydev and no matter what I do pydev wont show up.  Finally I figured out that the latest pydev requires jdk1.7 an unless I boot eclipse in jdk 1.7 vm it wont show up. wtf its pretty lame atleast it could show some warning dialog or something.

Anyways my startup uses jdk1.6 and I didnt wanted to pollute the default jdk so the solution was to download the .tar.gz version from oracle and then I exploded it in some directory, I opened the eclipse.ini in the eclipse install directory and added

-vm
/home/kpatel/software/jdk1.7.0_45/bin/java

and thats it. After I restarted eclipse, pydev showed up.


Change

Change is always hard for me. I installed ubuntu 12.04 and didnt liked unity as I was so used to ubuntu10.  I did everything possible to make the new ubuntu look like classic ubuntu.  But if you want to use classic ubuntu in ubuntu 12.04 then its bare bones and you have to install everything manually.  There were many things that kept crashing and there is no fix for it. Also somethings like seeing skype,pidgin in systray was a must for me. Ultimately I gave up and tried unity.

Honestly I still dont prefer unity that much but have adjusted to its nuances and slowly started to liking it.

Inertia

Lot of intertia is built up when you are using a laptop for 4 years. I got a new company laptop and I was procrastinating to switch to it.  Primary reason was that some of the things that I use daily were not working on new laptop.  Mostly pidgin, py-dev. Also evolution and other settings have changed have changed in ubuntu 12.04.

Finally I found out why my pidgin was not working in my new laptop. It seems I have a 8 year old router and somehow one of the devices in my home keeps hijacking the ip address assigned to my new laptop. I found this out  because randomly web browsing was slow on the new laptop and when Idid ifconfig I realized it was assigned 192.168.2.2 IP and I remember the hijacking of this IP by Roku or some other device. Anyways assigning a static IP to my wireless and ethernet connection seems to have solved the issue.

I am up and running on new laptop but still lots of small small thigns are annoying when you change to a new laptop like the new unity sucks so I switc…

Building reusable tools saves precious time in future

I hate cases when I had made a code fix but the code fix is going live in 4 days and customer cant wait for the fix. Now you have to go and fix the data manually or give some workaround.  Anyways today ran into an issue where due to some recent code change a memcache key was not getting flushed and the  code fix for that is supposed to go live in 3 days. But a production support  guy called me to fix it for customer and I can understand frustration on part of customer. So the choices were:

1) Ask the production support guy to flush the key in all memcached instances with help of ops.
2) Manually go myself and flush the key. This could take anywhere from 10-15 min plus its boring doing data cleanup.
3)Flush the entire memcached cluster. This is a big NO NO.

So I was about to go on path #2 and realized that there is a internal REST API that the production support guy can call to flush arbitary key from memcached. Hurray I just told him what keys to flush and he knew how to call such api…

you just have to be desperate

I recently setup a new ubuntu machine and  coincidently my company Jabber account stopped working on both new and old laptop. I was busy so didnt spent much time on it and sent it to ops and they said its an issue on my side. I was puzzled because how come it can be an issue on both laptops.

But using skype is a pain and its also not secure so I had to find a way to fix it.  Finally I found  that you can start pidgin in debug mode using
pidgin -d > ~/debug.log 2>&1


then I tailed the logs and immediately I saw

17:44:44) account: Connecting to account kpatel@mycompany.com/.
> (17:44:44) connection: Connecting. gc = 0x1fc9ae0
> (17:44:44) dnssrv: querying SRV record for mycompany.com: _xmpp-client._tcp.mycompany.com
> (17:44:44) dnssrv: found 1 SRV entries
> (17:44:44) dns: DNS query for '127.0.0.1' queued
> (17:44:44) dnsquery: IP resolved for 127.0.0.1

doing a dig on record shows it was correct

dig +short SRV _xmpp-client._tcp.mycompany.com
10 0 5222 my…

Human touch and scalable systems

I am a big fan of eliminating human touch points when it comes to scalable systems.  When I say "human touch" I mean a step that involves a human to perform the operation rather than relying on bots.  A good example is db schema updates.  We use mysql sharding and recently we split our  global db into 12 sharded databases with identical schema but diffrent hosts.  As we were crunched on time, I didnt got time to write an automatic schema applier, the process agreed upon was to put the ddl statements in deployment notes  and devops would apply it in all databases. 

As usual I am always skeptical of human touch points so after 2 months of going live just for my curiosity I wrote a mysql schema differ that would dump all 12 schemas and diff them and to my surprise there were schema differences.  Four of the mysql servers were setup with latin1 characterset rather than utf8_bin. No one caught the issue because that table was created a sleeper mode and the feature is yet to go li…

Black Friday or Black Thursday

I remember 8 years ago when I came to US people used to wake up early to get in line for 6:00 AM or 8:00 AM on Friday after Thanksgiving.  Then 2 years ago retail stores started opening around 12:00 in night and then last year some stores opened at 10:00 but this year stores opened on 6:00 PM thanks giving day itself.

I am not an Indian and havent started celebrating Thanksgiving yet but even to me this sounds like BS as it seems Black Friday is becoming Black Thursday.

Perception of downtime

5 years ago when I joined the startup if a node went down then you get sometime to analyze it and bring it up at your pace. Now a days as the startup has grown to millions of users, today I was called in because the node was inaccessible and I asked 10 min and ops guy was like "huh this is an emergency", I was like it a big stacktrace so give me 10 min to analyze it.

Downtime perception changes when you grow :).

why people think I get better projects to do than them

I got this feedback from two of my colleagues at company. One joined and left in a year because he said the three initial engineers at the startup gets all the best projects and he joined late.  I call this BS, there is one principal engineer who joined after him and delivered a game changing product at the company and earned respect from the peers.

This week I got feedback from one other colleague that I am doing the complex and better projects hmm... I was like wth.

I do crappy projects also but may be people see is the better ones but not the crappy ones that gets delivered side by side.

Also in a startup it all depends on who took the ball and get it rolling. We recently uncovered security issues and as I was curious so I took the initiative and got the ball rolling and got it done across all products that I can.

Similarly the project of converting from BDB or Ldap 2 mysql  was known to be a done for a long time but no one was owning it and I took initiative and got the ball rolli…

Doing side project keeps you motivated and make you smart in your real job

I am already in 5th year in the company and its the usual 4 year itch. Things have not become boring in the startup but for past year I have become a sweeper. It seems we found this nirvana in mysql and I am transforming all products to store the data in a sharded mysql to solve the scalability issue in that component.  Cloud file system is a field that is yet to be cracked and we scale now to store billions of rows and we can continue to add mysql and scale but there are still interesting problems like how to scale a single customer to 100M+ files. We currently can scale to 10M+ files for live data and 25M for long lived archival data.  But again daily its boring as nothing new to learn.

So how to motivate yourself when you work from home and your work is becoming boring. Well do a side project, open source or something that you want to do but cant introduce at your company. I started reading hadoop but again I was just reading but not doing it.

Then one of my friend came up to me an…

install ubuntu on dell lattitude 6430 and UEFI mode

I got this dell laptop from my employer and it had windows7 on it. I installed ubuntu on it and with windows side by side and it wont show ubuntu option during boot. I tried installing multiple times, ultimately I thought I would wipe out windows and install ubuntu. I did that and suddenly I got this message "Invalid partition table" I was like wtf happened.

I tried booting many times and same option, I thought I would press F12 and boot, and luckily i saw a weird option ubuntu under UEFI so I gave it a chance and it booted up. Now daily I had to press F12 to boot ubuntu and then after getting tired of this I remembered I had seen this UEFI somewhere in BIOS settings.  I rebooted and pressed F2 and I saw under Boot sequence there was Legacy and UEFI.  I selected UEFI and chose ubuntu as the option.  Thats it no more invalid partition table on boot.


LDAP loses to msyql when it comes to HA

Before I joined my employer we were already using ldap for storing users and customer data. The reason to pick ldap was that it matched with active directory.

It seems we can no longer scale on ldap. The main reasons for us to move away from ldap:

1) Index creation requires restarting ldap. WTF.  this is a big no no for any decent size company because this makes ldap a Single point of failure.
2) Schema changes requires ldap restart.
3) No a very big community support like we have for mysql and other relational databases.
4) Scaling developers is tough, for most people ldap is alien technology.
5) Loading 5K users in ldap took 2 hours as after a point like 5-10M users in a ldap the insert performance just sucks.

we recently migrated some customer from ldap2mysql and all above 5 points are solved by moving to mysql.  Even the performance of insert rocks in mysql.

Dumping ldap schema

I recently ran into an issue where I checked in an updated schema but it wont get reflected. The ops kept saying its updated but the app code wont show it.

Anyways to prove I thought how can I dump the ldap schema and finally found


ldapsearch -LLL -x -h "xxx.xxx.202.131" -b cn=Subschema -s base '(objectClass=subschema)' +

Ensure you dont miss the + sign in the end as I didnt added it and spent 15 extra min debugging it.

Dell lattitude 6430u drifting mouse pointer

Today i received  a brand new laptop shipped to my home  from my company. I opened it and booted it and immediately I realized something strange that I cant use the touchpad as no matter what I do the pointer would just drift randomly to one corner of the screen. I tried various options and I was surprised to see that even when I touch the sides of laptop that were not part of touchpad area the pointer would drift.  I thought may be its the first boot and somehow using keyboard I finished the initial windows boot. But again after restart same issue.

I plugged in an external mouse and it would work fine. I tried using the joystick pointer and it would also drift.

I tried various options and then thought I got a lemon or something went weird in shipping.  I read a forum about pressing Fn + F6 to disable joystick.  I tried that and it worked like a charm.

WTF this is second issue I saw with Dell, not sure what their QA team tested before certifying the laptop.  Anyways I work 90% using e…

Dell latitude 6430u boot order issue

Got a new Dell laptop from company and first thing I saw that it came with a external usb pluggable DVD drive.  I inserted a ubuntu CD in it and rebooted the computer. I pressed F2 to change boot order to DVD and then USB and then HDD. After restarting it wont boot from ubuntu CD.  I tried various options but then gave up.


Suddenly I realized about F12 so i rebooted and pressed F12 during boot and from there when I chose DVD it worked fine.

God knows what issue it was but it seems either I am missing something very minor or Dell BIOS is missing something major.

mysql weird trailing space in where query

I was surprised to see that mysql can be this dumb

The varchar field preserves trailing spaces but you can do any query like below and all three will match '/Shared/test'

select  * from folders where path ='/Shared/test'
select  * from folders where path ='/Shared/test '
select  * from folders where path ='/Shared/test      '

as per http://bugs.mysql.com/bug.php?id=64772  its a feature not a bug. WTF

Scaling pains changing ldap to mysql was like changing engine of a nascar during race

When you have a big customer base then you get lots of interesting problems like :
Customers trying to store 400K files in one folderCustomers trying to dump 100+ TB of files in one accountCustomers with more than half of their data in Trash Customers wanting to create> 64K users in an accountCustomers with 25K+ users trying to do search/load/sort on users listing Our startup used to use LDAP to store users and customer metadata and Mysql to store files/folder metadata. 

Lately we had been hit by #4 and #5 item and that is causing scaling issues because no matter what we do ldap write performance sucks when we go beyond some million users.  Earlier ldap used to scale because we had 30-40 ldaps but to reduce ops management issues we consolidated then to 4 per DC and it worked initially but lately with #4 and #5 its not scaling fine.  Ldap is an alien to most programmers, there are only 2-3 guys in the team that knows a little bit about it so most of the times it remains orphan and be…

mysql order by and bind params

We use a grid to show files and when user clicks on a header we sort it.

I was using a statement like "order by :orderBy :sortDirection" and binding it in sql using spring but it was not doing any sorting.

After spending 1+hours on it I found out that mysql doesnt support bind parameters in order by and direction, yikes.

This is second weird thing I noticed in 1 day. The other thing I noticed was mysql doesnt support multicolumn updates

you cant do

update files set (size,ctime)=(select ctime,size from versions where versions.file_id=files.file_id) where file_id=:xxx

aparently you need to do a join query in update which makes the query very weird.

But even with issues Mysql rocks in performance.

Memcached stale keys on new code deployment and evictions

Scalability problems are interesting. This weekend we deployed a new release and suddenly we started getting ldap alerts from one datacentre. All fingers were pointed to something new in the code making more ldap queries but I didnt thought we made any ldap code changes, infact we were moving away from ldap to mysql.  We use 15G of memcached and looking into memcache stats I found that we were seeing evictions.  We had bumped memory from 12G to 15G in other data centre during same code and that data centre was not having issues. So this problem was interesting.

At last we found out that our new release had made 90% of memcached data stale. We store our objects as Json in memcached and deprecated fields removal from objects made the json in cache unusable. But the code bug was that the old jsons were still sitting in cache consuming memory.  As new jsons was getting pumped into memcached they were fighting for memory and memcached  LRU is slab based so when it evicts a slab it would ev…

Easmock partial mocking

Wow didnt knew I could partially mock a method on easy mock. Earlier i used to inherit the class and override methods to mock and have AtomicIntegers to determine which methods were called or not.  Recently I realized I was dumb and I should use easy mock to do that.

In below example I needed to very that if I get a json parsing exception for data stored in memcached then I need to delete the key from memcached. To test this I had to preserve all the code in MemCacheWrapper class but just mock the deleted method. here is how I did it.

    @Test
    public void testDeleteOldKeyOnJsonParse() throws Exception {
        String key="user.1";
        MemCacheWrapper sut = EasyMock.createMockBuilder(MemCacheWrapper.class).addMockedMethod("delete").createMock();       
        EasyMock.expect(sut.delete(key)).andReturn(true);
        EasyMock.replay(sut);
        sut.convertJsonToObject(key, User.class, "crap");
        EasyMock.verify(sut);
    }

Creating a Platform rather than tool

There are few things that I like the most because they are simple, elegant and flexible by writing plugins.

1) Eclipse
2) Jenkins
3) Firefox
4) Ubuntu


recently I started playing more with jenkins and the more I play with it the better my respect for it grows, same happened with eclipse or firefox or ubuntu.

The reason I think all of these are great is that they are a platform rather than tool. They allow users to plugin their own code seamlessly.  This way you get users to write code for your tool and as soon as one guy writes a plugin you see the network effect because other guys now has more reasons to use the platform.

Offcourse creating a platform is much harder than writing a tool but given a choice I would rather create a platform than just a tool.

Last minute fixes

I hate last minute fixes and especially on things that I dont understand much.

Today is release day and I was called by my product manager friend about some bug that I had no clue what that feature was.

With some help from product manager I was able to reproduce the issue locally and 90% of the problem is solved when you can reproduce the issue. So it took 30 min to fix it as it was an easy fix.  It took more time to merge the damn fix. The reason was that developer had totally different packages in release and trunk and also in trunk there were many private interface changes.

The bug has been there for 2 weeks in production so why was it detected after 2 weeks. Two reasons:
1) This was some custom feature developed for only this customer. I hate these WTF custom feature.
2) Well it seems the customer already knew about the bug and Professional services schedules calls on Friday and Friday night is our release so customer waited a week to tell us.

Nothing can be done about custom feat…

Applet and httpOnly session cookies

We use jfileupload applet in our cloud server to allow users to upload folder hierarchy from browser.

Recently our security team found an issue that if our site was vulnerable to XSS then anyone can read the jsessionid cookie.  To fix this I changed the tomcat server.xml context attribute useHttpOnly="true" and most of the things were fine but the applet broke.

now it was giving me nothing except "unable to load" and NullPointerException string (no stacktrace) in applet console.  I first thought its some local issue but then  I tried from multiple machines and same issue. Googling didnt helped.

Finally after spending 3-4 hours I found that when applet tries to download the jar files the request were coming to tomcat and we were applying a WebSessionFilter that would redirect requests with no sessions to login page.

Skipping .jar files download from session filter  check solved the issue.  (I know, I know we should have used apache to serve the jar files and thats w…

weird memcached behavior

I need to chase this but when I ran memcached-top command on our prod boxes I saw we had 1:20 read vs write ratio.


INSTANCE USAGE HIT % CONN TIME EVICT READ WRITE
memcache01:12345 85.8% 94.7% 4039 1.1ms 1.5M 4.3T 80.9T
memcache02:12345 85.8% 93.4% 4022 1.2ms 1.6M 4.7T 80.5T

This is throwing me off. I will analyze to see what i causing it because we should technically see more reads and less writes.

injecting a request attribute into a jersey rest api

My colleague made an interesting command that AOP is like a drug and once you have tasted it you can spot cross cutting concerns and the mere presence of duplicate code tells you the signs of martin's fowler's bad smell in code.  

We use jersey to implement rest apis and our rest apis can be called from Session or BasicAuth or Oauth, after a user is successfully authenticated we inject the caller user object in request as an attribute so the rest api can derive it to further make api level business validation.  But that means every api method has to write this ugly piece of code where we inject request into the signature and then add this one line of code to get the user object.

    public Response getDevicesForCustomer(@Context HttpServletRequest request, ....) {
        User user = (User) request.getAttribute("user");
...
}


This sounds like a perfect cross cutting concern to be baked into AOP layer. What would be nice if we could do it like this

    public Respons…

Jenkins archiving artifacts outside your workspace

it seems in jenkins you cant really archive artifacts outside your workspace, I had a requirement to start tomcat(job1) and then run webdriver tests(job2 which runs on slave) and now archive the logs of tomcat(job3).  But the  tomcat lives outside of job1 or job3 workspace. Well it seems the solution is simple, add a shell step in your job that will create a soft link from the outside folder to your workspace and then you can use that softlink to archive the artifacts.

Debugging random webdriver issues on jenkins

So one of my friend was facing one issue where he wrote a bunch of webdriver tests and they all work fine but when he runs on jenkins slave it randomly fails.  He runs his jenkins job hourly and problem is that it fails may be 4 times in 24 hours.  So how do you debug the issue, well he was adding loggers in the test to figure this out and then plodding over logs to figure out what went wrong. This is quite a bit of guess work and I thought sometimes a picture is better than 1000 words. So it would be nice if I could take a screenshot when the test errors out and then save it as an artifact. Guess what the webdriver already has an api for that. So all i needed to do was to add a TestRule like this and add the screenshots directory in the publish artifacts in jenkins.  I will know it in a week or so if this would save him a lot of time or not.

    @Rule
    public TestRule testWatcher = new TestWatcher() {
        @Override
        public void failed(Throwable t, Description test) {
 …

Pagespeed and cache flush

Ran into an issue where customers would complain that random logins are slow and then subsequent request are fast. Took a long time to debug because it was totally random.

Finally found that because we give each customer a unique subdomain like XXX.yyy.com page speed was caching the aggregated bundles per subdomain.  When we configure pagespeed we never configured the cache size so by default it was taking 100M.  Before we did Tomcat HA domains were pinned to a node so we never ran into the issue but after HA any customer can be served from any node so we were running into an issue where every hour the cache was flushed and domains would see this random login. 

Took almost 2-3 hours to debug the issue and the only reason I was able to figure out the issue because I was thinking like if I had to write pagespeed how would I write it. Also I  ran a du on the cache and it was 267M and luckily I saw the apache error logs that the cache clean had ran and ran du again and it was 97M.  I feel…

Customers are smarter than you

We just finished a call with a big customer who found 4 security issues in  the product. While fixing them took only 1 hour, they were real issues. We have hired a third party consultants to monitor security issues in our product after every release but it seems they gave the green signal and still the customer found issues.

So in short Customers are always smarter than you and a lot to learn from them if you keep eyes and ear open.

Spring manual applying interceptor to a bean

We had a weird requirement where the same spring bean needs to be deployed on two different node types (storage and metadata). When its deployed on metadata node it needs to talk to a specific shard database and when its deployed on storage node it can talk to multiple shard databases.  To achieve this in metadata node we wrote our own transaction annotation and applied a  TransactionInterceptor to the bean in spring that would  start transaction before every method.

Now in order for same bean in storage node talk to different databases over different transaction managers we created a pool of beans at startup each with its own interceptor and there we had to hand create the beans but now problem was how to manually apply the same AOP inteceptor. We tried using ProxyFactoryBean but it was not easy and then my colleague landed on to this which saved the day.

                ProxyFactory proxyFactory = new ProxyFactory(sqlDirectoryService);
                proxyFactory.addAdvice(transact…

abnormal data migration

Cloud storage is a funny and interesting field.  Just analyzed one data pattern where one customer sent a 4 TB hard drive and just migrating 1.2TB of it into cloud created 5M files.  Doing some rough calculation the avg size came out to be 300KB,  which is weird.  Digging deep into the system revealed that the customer scanned all his documents into TIF and the avg size ranged from 10KB to 100KB to 300KB.  wth.

Also they had a special LFT or loft file that was 4112 bytes and 2M of them were there.

As of right now the sharding approach I had implemented, pins a customer to a shard and that means if we migrate the entire 4TB we would end up with 30M+ files. Life is going to be interesting in next few months.

It seems the solution I did an year back is already reaching limits and I need some other solution to federate a customer data across multiple shards and machines but still able to do a consistent mysql backup and replication and also do a 2 phase commit across multiple servers.

or …

webdriver windows server 2012 IE10 slow sendKeys keytype

I was writing a selenium test for Login and while running it on local virtual box vm it runs fine but when I ran it on an EC2 windows server 2012 instance it was crawling, every keypress was taking 2-3 sec. At first I thought its the autosuggest that is slow but even keypress in password column were slow. On googling I found that the 64 bit webdriver selenium native driver was the culprit.

Replacing that with a 32 bit driver fixed the issue.

Kibana/logstash integration with reports or JIRA

We use Logstash for our centralized logging. All our app nodes track exceptions and write them to scribe, in night a python scribe consumer aggregates that information and sends an email with top exceptions with counts and first 200 characters of the exception.  Well the problem was that now I have to go through email and for each row I need to find the exception in logstash or in application logs. This is painful, I wanted to find some better way.


So I changed our exception tracking code to include a requestId into scribe logs and this requestId is logged with each exception to logstash. Now all I need to do is add a column in my table to include a logstash query link.  Problem is that it wasnt that straight forward. Ultimately after reading some forums I found that I need  to  prepare a search json and then base64 encode it and then create a url and voila I was done.  What used to take 1 hour to trace 10 exceptions would take me just 30 mins. 


    linkPrefix = "https://%s-logs…

java Fair share Threadpool

To avoid thundering herd problem we only allow X no of write and Y no of reads to Msyql database from a node. Recently I introduced HA into our tomcat stack  that reduced no of nodes by 60% and the HA is helpful but it can happen that one customer can hog all the threads in the cluster.  Before HA this would cause a downtime of only one node but now it has a potential to bring down 1/4 th of data centre.

To avoid this issue I was looking for various alternatives and finally the idea was to use a fair share thread pool that would pin an upper bound on no of threads per customer but it was becoming too complex and I was not going anywhere. I kept it as a background thread and then the worse happened and yesterday we had a downtime as one bad customer gobbled up all reader threads.

So in crunch I came up with a java fair share threadpool  approach by implementing a pool of thread pool. Each customer in our site has a random UUID called as customerId all read/write methods have it in the …

Quarz HA remove unused triggers

Have to find a proper solution  for this. Recently we deployed quartz in HA mode and job info is persisted in db. Now I removed one job from spring config but I keep getting this class not found exception because the info was still in db.

When using RamStorejob it was ok because it would reinit it from config. For now I just deleted the rows from db to move on but need to find a proper solution for this.

delete from QRTZ_SIMPLE_TRIGGERS where TRIGGER_NAME= 'GoogleDocsInfoCleanerJobTrigger';
delete from QRTZ_TRIGGERS where JOB_NAME='GoogleDocsInfoCleanerJob';
delete from QRTZ_JOB_DETAILS where job_name='GoogleDocsInfoCleanerJob';

Junit4 custom method order or custom MethodSorters

Junit4 provides three method sorters org.junit.runners.MethodSorters.NAME_ASCENDING and org.junit.runners.MethodSorters.JVM and org.junit.runners.MethodSorters.DEFAULT .  Normally Junit recommends you dont sort your tests that way it can run randomly and I 100% agree to that for unit tests.

But I had a requirement to write selenium tests that would first register a user and then run bunch of tests like add file, move file and copy file.  Now each of these methods have to run in the order. So I started naming methods like

test10Registration
test11AddFile
test12MoveFile

But soon this started looking odd method naming convention so I wanted something where I can define my own order.  It seems it was easy and all I needed was to create a custom runner. So I created this OrderedRunner and an annotation and then annotated my test methods with the new annotation and had my test class annotated with RunWith and asked it to use this runner.

public class OrderedRunner extends BlockJUnit4ClassRu…

showing proper 500 error page

went to www.hawkelectronics.com and found this.  One should have proper 500 page on the website instead of exposing these kind of errors :)


Benefits of indirection

We shard our databases and keep the metadata information as to what shard is located in what database host in a table.  We recently had a database spike where due to thundering herd problem, lots of our clients would come and execute a very costly query concurrently.  The only solution is to avoid making this query and change the application architecture but the problem would take 1-2 months to solve. We didn't had the luxury to wait that long.  So we were looking for solutions to buy time.

Luckily to buy time we can throw more hardware, we had recently ordered a pair of mysql servers for some other purpose so we repurposed it.  Because we had desgined our application to have a layer of indirection on how to look what customer is located on what shard and what shard is located on what database host.  We were able to quickly spin off 2 slaves and connect them to the db host having load isues. As soon as replication was done we cut off the db access and wait for replication to be 10…

This year's plants in my vegetable garden

hybrid tea


 garden beans seedling
 tuver/pigeon pea seedling
 okra sedling
 cucumber seedling
 mint
 beet root
 basil
carrot/cilantro

toro 7.0hp lawn mower would start and dies within 20 sec

This season after the first mowing on my second mowing my Toro 7.0 hp lawnmower  would start but dies quickly.  I first thought its a air filter issue so I cleaned it but then it again showed same symptoms.  Then I thought its a spark plug issue so I went to lowes and got a new spark plug and replaced it but again it wont start.  So I started googling.

Turned it was due to the gas I was using,  I had some 4 months left over gas that I had used in first mowing and then I got new gas and on second mowing. So either it was due to old gas or the new gas was bad. Anyways googling tells me that its a carb issue and I need to open the carb bolt and carb cup and clean it.  In case you dont know where it is see the image.  This  bolt has a hole and if air filter is bad and grass clippings goes in this or dirt goes in here then that hole will be clogged.  Or if the gas has ethanol then I read that this hole can get clogged




So you need to pinch the fuel line or empty your fuel tank and then open…

2000 unit tests mark

This is a great progress. Six to eight months ago tests were there but not running on jenkins and after I added to jenkins the team has stood up and added more than 1500 unit tests. Thats a good sign because now I am not called on wekeend by Ops team to fix a regression bug.

Still the code coverage is only 35% but we are making progress to improve it, when we started it was near 25% so this is a good progress.

Misconceptions of working from home

People think "oh you work from home", 50% of think you are unemployed, wth.Many people think you slackYour family think you can take an hour off early just because you work from home.Family come with all sorts of requests and enters in your work area many times a day.Kids dont understand the concept of working from home and if they are at home instead of daycare then they can enter X no of times and want to work on your laptop.
Still I enjoy working from home because I get long stretches of time to think and implement cool things for the startup.

Mysql deadlock during delete query and index

I had written a hierachical lock manager using mysql and once a thread finishes it was unlocking by deleting the locks.  The 20 thread perf test was fine but daily some 100-200 of deadlock exceptions would come randomly on some nodes. Now each thread is working in isolation so it doesnt makes sense to get deadlock, the query was like

delete from hie_path_locks where customer_id=? and thread_id= ?

Finally after some hit and try and troubleshooting I found that we had an index on customer_id only and when 10 thread on same workgroup would try to read locks to detect conflicts they would do 

select * from hie_path_locks where customer_id=? and thread_id= ?

apparently mysql default isolation of REPEATABLE_READ would take locks on even rows that are read, check http://www.mysqlperformanceblog.com/2012/08/28/differences-between-read-committed-and-repeatable-read-transaction-isolation-levels/

Adding an index on customer_id,thread_id instead of just customer_id  solved the random deadlocks.

Spring 3.2 quartz 2.1 Jobs added with no trigger must be durable.

I am trying to enable HA on nodes and in that process I found that in a two test node setup a job that has a frequency of 10 sec was running into deadlock. So I tried upgrading from Quartz 1.8 to 2.1 by following the migration guide but I ran into an exception that says "Jobs added with no trigger must be durable.".

After looking into spring and Quartz code I figured out that now Quartz is more strict and earlier the scheduler.addJob had a replace parameter which if passed to true would skip the durable check, in latest quartz this is fixed but spring hasnt caught up to this. So what do you do, well I jsut inherited the factory and set durability to true and use that

public class DurableJobDetailFactoryBean extends JobDetailFactoryBean {
    public DurableJobDetailFactoryBean() {
        setDurability(true);
    }
}

and used this instead of JobDetailFactoryBean in the spring bean definition

    <bean id="restoreJob" class="com.xxx.infrastructure.quartz.DurableJ…

graphite dynamic counters trending

I generate this report daily using cron as to top exceptions across all datacentres and top queries across all datacentre and top urls across all datacentres and send them via email.

Problem is that after every release the no goes up and down as due to some bug a new exception will popup or some exception will resurrect.  How do I trend and correlate these dynamic counters.

Solution came from my colleague in just an informal chat and he recommended I should md5-hash the url and create a graphite counter for it and in the email  just make the count a hyperlink  like shown below. 




Now I can trend the query as clicking on this shows me a graph as shown below. My next plan is to inline the graph for top 10 urls in the email itself so I don't even need to click them.


Being Analytics driven vs firefight driven

We have doubled our incoming traffic every 6 months and past 3 years have always been in firefighting mode where some customer reports an issue or a node goes down and we try to analyze the root cause and fix them.

Lately I am trying to move away from working in a firefight driven mode to Analytics driven mode. What I meant is being proactive to monitor and understand the system by gathering various metrics and fixing issues before cusotmer notices them. For e.g. to put our nodes in HA mode I had to store a sessionId to userId mapping in database, the only real reason to do this was for our Flash file uploader because it makes a request but doesnt pass the sessionId in cookie but it passes as request parameter. This causes the request to go to a completely different node.  So to handle this we wrote a session listener that would save the sessionId to userId mapping in db.  The code went live and suddenly after some days the db went down. What happened was that the developer forgot to …

Haproxy and tomcat JSESSIONID

One of the biggest problems I have been trying to solve at our startup is to put our tomcat nodes in HA mode. Right now if a customer comes, he lands on to a node and remains there forever. This has two major issues:

1) We have to overprovision each node with ability to handle worse case capacity.

2) If two or three high profile customers lands on to same node then we need to move them manually.

3) We need to cut over new nodes and we already have over 100+ nodes. 

Its a pain managing these nodes and I waste lot of my time in chasing node specific issues. I loath when I know I have to chase this env issue.

I really hate human intervention as if it were up to me I would just automate thing and just enjoy the fruits of automation and spend quality time on major issues rather than mundane task,call me lazy but thats a good quality.

So Finally now I am at a stage where I can put nodes behing HAProxy in QA env. today we were testing the HA config and first problem I immediately saw is that…

%E2%80%90 and links issue

Ran into an issue where a customer creates public link to a file and pastes into word and it works fine but when he converts it to PDF it no longer works thought the  link in browser url bar looks exact same.  I reproduced it and saw that "h-s" in link url was getting converted to h%E2%80%90s. After some Googling it turned out to be an adobe bug http://forums.adobe.com/message/2807241 related to hyphen character.

Final nail in BDB coffin-Part2

Missing indexes can be really pain.  We were migrating data from bdb to mysql and the migration on few nodes were going on for 3-4 days. As I was involved in firefight, I didnt got a chance to look at it. But on one node only 10% of the workgroups were migrated and while chasing a customer reported issue I found on index was missing.  we created that index and restarted migration and wow it finished in 5 hours.

Finally we ended up with 400M+ more rows in our mysql infrastructure and now bdb is finally out of the product. Hurray!!

Spring an Quartz JobDataMap serialization exception

We dont run our nodes in HA yet and once a customer registers he is assigned a node and he lives there. Problem is that if the node dies we incur a downtime for that customer and also we need to overallocate hardware to prepare for worse case scenario.  for the past 6 months I have been working to making the code stateless so that we can do HA and reduce our node count by 4 times.

So we used to run quartz using in memory scheduler but for HA I need to run quartz in a cluster. We chose org.quartz.impl.jdbcjobstore.JobStoreTX for this.

Problem was that as soon as I tried it I ran into issues because I was injecting spring beans into our quartz job using JobDataMap and JobStoreTX was trying to serialize the jobData into a table and our spring beans are not serializable.  There were two options:

1) Load the entire applicationContext in each job and read the bean from there.

2) Use the schedulerContextAsMap.

After evaluating options I found scheduler context as the best option .The way to…

Final nail in BDB coffin

This weekend we would finally put the Final nail in BDB coffin.  We were using BDB in webdav, Cloud file and backup product. Over the course of last 6 months my team was able to remove BDB from webdav and Cloud file and the mysql is scaling fine. We have now billions of rows in mysql and last weekend we had a pilot migration of few backup product nodes.  This weekend we would strive to migrate all the backup nodes. I am hoping we would increase the no of rows in mysql by 30%.

Mysql and sharding rocks!!

Being ruthless

Lot of time our system deals with abuse. For e.g. some customer will move the same file between 2 folders the whole day and normally it doesn't cause issues but in some extreme cases it would generates hundreds of thousands of records.  Also there are some customers who have bots written that will make millions of call in a day. Or sometimes some customer will put malware on FTP and use our servers as a way to spread malware, this causes antvirus to flag our site as spammers causing field issues. One of the strategy we use to deal with abuse is to throttle the user for a while but sometimes it hurts good users also. In some cases the abuse is so much that it can bring down the system or hurt other genuine users. Like in case of malware we just block the customer as there is not time to reach the customer and solve the issue, some user might have accidentally share the file but we have to be ruthless.

openldap adding an index

We use ldap to store customer metadata like users, customers and other stuff. Right now each node has a ldap and we are trying to consolidate ldaps across different nodes. So I loaded one ldap with data from 20 nodes and some of the queries were taking 4000 msec.  So I saw that indexes were missing .

I went to  slapd.conf  and  added

index customerNumber,customerDomain,email   eq
index customerName eq,sub

I also updated the cache size to 4G in DB_CONFIG file

set_cachesize   4 0 1




I restarted ldap and suddenly things were flying but something  was fishy as the new numbers for 100 threads was faster then previous 1 thread time for all operations. So I picked one query and ran it manually and found that a simple query like below was coming empty



ldapsearch -x -H ldap://localhost:389 -b "dc=xxx,dc=com" "(&(objectclass=objuser)(email=kpatel@comcast1.net))"|grep email|more


I removed the index and restarted ldap and same query started returning results. Upon googling …

It seems people work less in december in US

Traffic to my blog was down to 60% in december and I was worried what had changed but I am glad to see that its back to original level in 2nd week of January.  It seems people work less in December.


Fun to see bugs

Received this spam email today morning and first thing i saw that instead of image it had a path of "c:\users\Manoj". :)


LDAP wildcard search

I was able to make a LDAP query with wild card on a field username like

(&(username=*kpatel))

but I wasnt able to make it on a field like

(&(customername=*kpatel))

me and my teammate searched and finally found that username was inheriting from a super type uid and thats why it was working on it but customername was not.

Luckily solution to the problem was to add a SUBSTR clause to enable wildcard match on a field in ldap schema.

SUBSTR caseIgnoreSubstringsMatch