Engineering

All about boosting customer happiness and maximizing Net Promoter Score in an online world

Posts by Tag

❮ Blog Home

Don’t make users wait: Optimizing the Performance of an Analytics Dashboard

September 29, 2016 By Alf Cora

At Wootric we've been huge fans of PostgreSQL from day one, and we still are. It's a robust and reliable fully-featured open source database and in most cases it fulfills all the data storage needs for your application. On the other hand, we work on getting customer feedback such as NPS at a huge scale, so it was only a matter of time until we started experiencing the first performance issues from pieces of code that were not prepared to work at that scale, back at the beginning of the year. And users hate performance issues. Time is our most valuable asset these days and nobody wants to spend it waiting for an application, duh!

For developers, this is a different story. A performance bug unveils an algorithm that did its job, but it's old and about to become obsolete; its days are numbered. A new shiny piece of code is about to arrive and you're the big mind behind it... It's refactor time!

Issues that can be solved by doing things differently

If you use Wootric regularly, you may be familiar with user segments. Basically every custom attribute passed to us with a survey response can be used as a filter to segment your users. And a great feature could only be accompanied by great UX, so in the dashboard you can see the list of all values that were ever sent for a given attribute, so that you can quickly select them and apply the filters.

In early versions of Wootric, we used to run a background script to regenerate all the properties every 10 minutes, which used to work great. But as the data volume kept growing, this script demanded more memory and a few months later it began to take more than an hour to complete.  This was clearly an inefficient algorithm, since it scanned the whole database when most of the filters hadn't changed. It became clear to us that a better solution was to build this incrementally, adding new values when survey replies were received. The result: reduced database traffic and saved $ from not running that script anymore.

Issues that can be solved with query optimizations

Once we were comfortable with our algorithms, it was time to optimize our database queries. This a complex topic and it's outside of the scope of this article. We got big performance improvements by:

  • removing unnecessary queries
  • avoiding expensive joins
  • eager loading
  • selecting only the columns we need
  • avoid counting on big tables
  • optimizing indexes.

Regarding the indexes, it's always important to remember that indexes are a trade off. They will make your queries faster at the expense of slower throughput for database writes. And sometimes you might be ok with the slower transactions, to the point where you would be open to sacrifice even more performance in order to get faster queries. And this is where you start thinking about refactoring your database, using partitions or replicating data in multiple formats. Well, the news here is that you don't have to DIY: there are better tools for this job (see below).

Issues that can be solved with caching

Calculating Net Promoter Score -- something we do in real-time for all of our customers -- might look like a simple task. To determine this popular customer loyalty metric, just subtract the % of detractor responses from the % of promoter responses and you're done, right? Well, that works fine until you need to run it for thousands of replies out of a table with a hundred million surveys, all of this within the scope of a web request. Things could get out of control quickly.

Our first solution to this problem was to optimize the database queries as much as possible, and when it wasn't possible anymore we started caching things. The first approach was to cache the calculation results. However, users can change the date range of the calculation with just a few clicks, so we built a cache with daily results. With this, and the proper set of caching rules and background jobs to keep the caches warm, we managed to fix the problem for a while.

But then, our segmentation feature presented a challenge. We offer the ability to calculate NPS for groups of users with any combination of custom attributes. The number of possibilities quickly grows exponentially. It's not practical or efficient to create all these caches. So we felt the need for something that supported the "schema-less" nature of custom attributes but still allowed fast queries.

Issues that can be solved by using better tools

We found the perfect solution to these problems with Elasticsearch, an open source search engine with a fairly simple setup: once a survey has been stored in Postgres, we push it to ES, where it's stored in a format that's optimal for querying later. The indexing process is slow compared to writing to a database, so we use background jobs to deal with this. You can decide how the document is stored in ES by defining mappings and as your business logic keeps growing, you can have multiple indexes with specific mappings for each problem. We also use dynamic templates to deal with our schema-less custom attributes.

ES has become so popular that almost all languages have a client library to talk to the servers. However, we only needed a small set of features to get started. The official documentation has clear examples on how to connect using HTTP requests with JSON encoded objects, so we decided to build our own library. It started very small, including only the logic to process a couple of requests, and we kept iterating as we needed more.

How simple is that library?

Very. Here you can find a simplified version of our library, which is written in Ruby and has enough endpoints to get you started. Feel free to use it and build on top of it.

Why not Solr?

Elasticsearch is based on Lucene, just like Solr, and they're both great products, but we had to choose one. We did not want to host it ourselves and the vast documentation on ES was the tiebreaker.

Ok, show me the numbers

The Wootric dashboard displays data in several different ways. We adopted ES progressively, beginning with the Snapshot tab, then Feedback, followed by Tags, and finally Trends. You can see a visual of the customer dashboard we are talking about here. The whole process took like 3 months since we were also focusing on new features during that time, but we started seeing results from day 1:

response_times

That's right, we went from an average response time of 1700ms to 61.1ms!
At the same time, the volume of requests kept growing, so it's scary to think what would've happened if we hadn't changed anything:

throughput

But not all of the results can be measured with the response time of the requests. Our last challenge, the Trends tab, used to run on a complex caching algorithm in background. When a user requested to see the trends chart for a certain time period and filters, a complex background job was triggered and the dashboard kept polling the API, waiting for the job to be completed. Once that we learned about the Date Histogram aggregation in Elasticsearch, this whole algorithm became history.

Here's a comparison of the calculation time from both algorithms in a few scenarios:

Test PG+Caching ES
Daily 4384ms 133ms
Week over week 1305ms 64ms
Month over month 1050ms 53ms
30-day-rolling avg 4309ms 125ms

Conclusion

Application performance tuning is a complex topic. In this blog post we addressed a few problems and solutions based on what's worked for us. By thinking about problems in different ways, optimizing Postgres queries and caching, we were able to obtain great results, but introducing Elasticsearch was the biggest win without a doubt. It helped us improve performance, save money and clean up code. See what Git says :

cleanup

Start measuring your customer loyality and Net Promoter Score for free with Wootric

Filed Under: Engineering