2016 has been a wild ride. The engineering team at Wootric has had our fair share of ups and downs and I thought it would be fun to collect some of our best war stories for posterity. Here is a list of things that happened, and what we are doing differently as a result.
1. If you merge into master you have to shepherd your changes to production. If any bugs arise you are expected to fix them.
Before we implemented this policy people would frequently merge into master without deploying. Eventually some poor soul would have to deploy all of that code and chaos would ensue. Adding some accountability has vastly improved the stability of our master branch.
2. Do not deploy on Fridays or late at night unless it's really important.
We know you're an awesome programmer, but stuff happens.
3. When writing database queries only select the columns you need.
For months we were having intermittent timeouts on a table we thought was tiny. It turns out that table had a JSONB column which could grow to any size. For most of our customers this column was small but for a select few it was gigantic and would cause timeouts.
4. Don't use Angular filters on large lists.
We had several customers complain that our dashboard feedback tab would get slower and slower as they paginated deeper.
The issue was an Angular filter that would sort the feedback by date. We discovered that Angular filters are run multiple times over the entire list after each AJAX request. So, every time new feedback was added Angular would sort the list several times. And once this list grew large enough the app would freeze. We fixed this by sorting the feedback on the backend.
5. Put something in place to block or throttle traffic.
Eventually a search engine bot or a rogue script will inadvertently DDOS you.
6. Tests aren't free. Be economical.
We had a TDA (Test Driven Apocalypse) where our CI builds had crept up to 20 minutes, tests were failing randomly and development speed was at an all time low. It was extremely demoralizing waiting 15-20 minutes and getting a random test failure. We called it The Roulette.
Looking back I'm surprised that we put up with it for so long. We didn't fix it in one Herculean push. It was a slow process with a fix here and a fix there. But eventually we got our CI builds under 5 minutes and free of random test failures. In the aftermath of our TDA we've become much more vigilant about new tests during code review.
7. RFCs (Request For Comments) for larger projects have been instrumental in improving the overall quality of our engineering.
In the past an engineer would be tasked with a project, crawl into a dark hole, and come out days, weeks or months later clutching their precious code. Sometimes this worked out really well but other times it was disastrous.
RFCs have helped turn our engineering process into a transparent conversation where everyone can contribute. We try our best to keep things civil and focused on solving the problem at hand.
8. There's no I in concurrency.
Traffic was growing steadily but our system for processing surveys was crumbling under the load. And to make matters worse this system could not be scaled beyond a single process without introducing duplicate data.
We were in a serious pickle. If we couldn't make this system concurrent our entire product would be rendered useless. It took the effort of the entire team to come up with a scalable solution.
We are relieved to say that it was a success and the new system has been running smoothly for a while now.
9. Don't be afraid to explore new technologies.
The performance of our dashboard was getting really bad and we got locked into this mindset of only trying to optimize our existing stack. It wasn't until Alf created a proof of concept using Elasticsearch for us to realize that we needed to try something new. Read his blog about our transition to Elasticsearch.
10. If you are using PostgreSQL make sure to VACUUM ANALYZE your database weekly when traffic is low.
In order to build accurate query plans PostgreSQL's statistics need to be up-to-date. If the statistics are inaccurate or there are too many dead rows the query planner will get confused. For example, it might ignore an obvious index and choose a sequential scan instead.
And for big tables, the default statistics size of 100 is usually not enough. We had to allocate more space for VACUUM ANALYZE to have any noticeable effect.
I think 2017 will be fine, really.
If you are interested in solving these kinds of problems and would like to join our engineering team, please reach out.