Engineering

All about boosting customer happiness with science and code

❮ Blog Home

Understanding Accuracy in ML Systems

March 12, 2019 By Ryan Smith

Note: An article about this from the CX perspective can be found here. 

The last few years have seen an explosion in companies adopting AI/ML solutions into their business pipeline. Huge advancements in Computer Vision and Natural Language Processing have made it possible to automate tasks we wouldn’t have attempted just 5 years ago. However, sometimes the utility of these methods can be lost in translation.

At Wootric, we have built a platform to measure and boost customer happiness. Our newest product, CX Insight, uses NLP to, among other things, surface themes in customer feedback. We do this by using our Machine Learning models to “Tag” each feedback with the themes it contains. This can be expressed as a Multi-Label Classification problem, where each feedback can have no tags, one tag, or multiple tags applied to it.

Wootric CXInsight Text Analytics Dashboard

Figure Description: The first comment has the theme “Shipping & Packaging” associated with it, while the second comment has no specific themes, and the fourth comment has “Price” and “Shipping & Packaging” associated.

Recently, I have found myself frequently answering the question “The AI got this feedback classification wrong, so why should I trust your product?”. As a data scientist, I figured I would give the standard law of large numbers response - that as you get more data, the individual errors will have less of an impact on your results. After a while, though, I started to think a more structured answer would benefit both myself, and those asking the question. This blog post shows the results of my time spent diving down the rabbit hole for a formal answer to this problem.

Defining Metrics 

Obviously some models work better than others. In order to assess how “good” a model is, it is important to have a relevant accuracy metric. The standard metrics to use when assessing classification accuracy are Precision and Recall. At a high level, precision depends on the amount of False Positive errors (tagging a feedback we should not have tagged), while recall depends on the amount of False Negative errors (failing to tag a feedback we should have tagged). There is an inherent tradeoff between these two values, as minimizing your false positives will lead to more false negatives, and vice versa. Because of this, it is helpful to use F1-Score as a single metric that combines precision and recall. When we discuss the accuracy of classification systems below, we will be referring to their F1-Score as a benchmark.

Why is Classification Useful? 

Great, so let’s say my model has an F1-Score of 70%. Is this good or bad? And how can I make use of its predictions? These are some pretty significant questions that get glossed over in most of the buzz surrounding Machine Learning. It’s important to know what your model will be helping you accomplish, and how it’s accuracy will factor into that goal. Everyone’s use case will be somewhat unique, but I will describe a few of our specific goals in a generic way. Hopefully, you will be able to apply the general information below to your own ML solutions.

Information Retrieval

A very common application of classification systems is Information Retrieval. Information retrieval is the task where a system attempts to find relevant information to show to a user (usually a human). Search engines are the most well known example, but users of all sorts of products will encounter a version of this problem at some point.

For us at Wootric, information retrieval involves locating customer feedback pertaining to a specific theme. Consider the case where you are interested in what feature requests your customers are making. Having a classification system in place will save you time, and keep you from the distractions of looking at tons of unrelated feedback.

For a concrete example, let’s say you want to read 100 Feature Requests from users. Now, assume that it takes, on average, 5 seconds to read each feedback. Looking at real life data from one of our customers, Feature Requests show up in around 6.8% of overall feedback. It would take about 7,300 seconds (5 * 100 / 0.068), or a little over 2 hours, to read 100 feature requests.

Now, consider we use our classification model with a Precision of 70% to speed this process up. If we only look at feedback that the model tagged as Feature Request, we will see that around 70% of that feedback will actually be Feature Requests. This means that it would take 715 seconds (5 * 100 / 0.7), or about 12 minutes to read 100 feature requests.

In summary, using a classification system would save 1 hour 50 minutes of reading through irrelevant feedback. This frees up time to investigate more questions about your data, and/or scale your team to solve other problems. Obviously this is just one specific example, but it shows that even a classification system that errors on 30% of feedback will still contribute a large amount to efficiency in information retrieval. To experiment with various factors affecting how much time would be saved in your problem, you can view this simple Desmos Graph. It has descriptions of the various parameters used, and allows you to change them as you wish.

Desmos graph 1

Aggregate Information 

Another use case for ML classification is comparing various trends across your dataset. As a generic example, think in the context of an image posting app (like Instagram or Snapchat). You may be curious to find out what images are trending among your users. To answer this, you run an image classification algorithm over posted images, and come up with a ranking of various image contents.

Image classification by theme

Now, we know the image recognition algorithm will definitely not be 100% accurate - however, this list is still incredibly helpful. We are still able to tell, with some amount of confidence, that Dogs appears more often in user images than Cats. Calculating this confidence number obviously varies depending on factors related to the algorithm (Dogs category accuracy, Cats category accuracy, amount of training data, etc.). Below, we demonstrate this confidence in list ordering for a real life example in our CXI platform.

Categorized Feedback via Text Analytics - Wootric dashboard

In our dashboard (shown above), you can see the relative ranking of various themes: Shipping-Packaging, Product Quality, Price, etc. While it is useful to see what the system predicts as the right ordering, we also care about what our confidence in that ordering is. So, after playing around with some probability distribution math, we end up with the following couple of confidences:

  • We are 99.998% confident that more comments concern Shipping-Packaging than Product Quality

  • We are only 70.35% confident that more comments concern Product Quality than Price


This lets us calculate the confidence of the true list ordering for any pair of tags in the list. Of course, this confidence depends on a variety of factors. Since this is such a small sample we are using (111 feedback), and there are only two more comments tagged with Product Quality than Price, our confidence is much lower for that ordering of themes. The most notable factors are described below (with their values for the Product Quality vs. Price example in accompanying parenthesis):

  • The size of the sample we are comparing (111)

  • The percentage of feedback tagged with PRODUCT QUALITY (13.5%)

  • The percentage of feedback tagged with PRICE (11.7%)

  • The F1-Score of the PRODUCT QUALITY model (67%)

  • The F1-Score of the PRICE model (78%)

  • How often PRODUCT QUALITY shows up in the training data (10.2%)

  • How often PRICE shows up in the training data (15.6%)

  • The size of the training dataset (3,000)

If you would like to see how these various factors affect the confidence in list ordering, see this Desmos Graph

Desmos graph 2

What Next?

Naturally, these are not the only two questions that can be answered by ML classification systems. I meant to provide a basis for answering the question “Why should I trust this model?”  The answer is that it depends what you are using that model for. In the cases described above, we give mathematical reasons for why/when you should trust our classification models. Hopefully, you can take these ideas and apply them to your own problems, and inspire confidence in your users about why your classification (or any other type of ML application) is useful even in the face of a few errors.

Filed Under: Engineering, AI