All about boosting customer happiness with science and code

❮ Blog Home

When BERT meets Pytorch

August 14, 2019 By Yash Vijay

A walkthrough of using BERT with pytorch for a multilabel classification use-case

It’s almost been a year since the Natural Language Processing (NLP) community had its pivotal ImageNet moment. Pre-trained Language models have now begun to play exceedingly important roles in NLP pipelines for multifarious downstream tasks, especially when there’s a scarcity of training data. They can encode general aspects and semantics of text into dense vector representations that are universally useful. 

In this post, we focus on Bidirectional Encoder Representations from Transformers (BERT), a general purpose language representation model open-sourced by Google in November 2018. We won’t be going into the finer details of the BERT architecture, since we’re primarily concerned with integrating BERT into custom pytorch model pipelines. If you’re interested in the underlying technical details, I recommend doing the following steps, in order:

  • Read the Attention is all you Need paper by Google. It introduces the Transformer architecture, which comprises an encoder and a decoder. BERT only uses the encoder.
  • Read this blog by Jay Alammar where he illustratively describes the Transformer. There’s no easier way to understanding this architecture than by reading Jay’s blog.
  • Read this series from Miguel Romero and Francisco Ingham. They’ve done a fantastic job at intuitively explaining the math behind BERT.

End Goal

Here at Wootric, we do a lot of research on transfer learning approaches in NLP that can improve our accuracy on the multi-label text classification task on customer and employee feedback for different industries. To learn more about our product, visit this link. 


In this blog, we’re going to incorporate (and fine-tune) a pre-trained BERT model as an encoder for the task of multi-label text classification, in pytorch. Our labels are 11 different tags, as shown below.

["Alerts_Notification", "Bugs", "Customer_Support", "Documentation", "Feature_Request", "Onboarding", "Performance", "Price", "Reporting", "UX_UI", "Value_Prop"]

These tags are common themes in feedback written by consumers about SAAS products.

In addition, we will also see how to perform domain adaptation, i.e., fine-tune all BERT layers using a large amount of text data from a different domain (customer feedback) than what it was trained on (Wikipedia), and compare results.

Setting it up..

We will be using the amazing implementation of BERT in the PyTorch-Transformers (PT) library by Hugging Face, so make sure you have that set up, along with pytorch of course! You can install Pytorch-Transformers using pip by running the following command:

pip install pytorch_transformers

Data Pre-processing

Like any other NLP model that takes word embeddings as input, It is imperative to tokenize text inputs and convert them into a tensor of IDs corresponding to the pre-trained BERT model vocabulary. The PT library has a nice interface for this:

In the above code snippet you may have noticed the bert-base-uncased argument being passed to the pretrained_model_name_or_path parameter while loading the tokenizer. This can either be a string with the shortcut name of a pre-trained model to download from cache, like shown above, or a path to a directory containing model weights saved using the save_pretrained() function of a pytorch_transformers.PreTrainedModel object. PT has multiple implementations of BERT, differentiated by either model architecture or data pre-processing methodologies. You can find the list of available pre-trained models to download here. The bert-based-uncased model is a smaller BERT model trained on all lowercased data. So if you’re using this one, make sure your training data has every token lowercased as well. 

We’re now going to build a custom dataset class that uses the BERT tokenizer (as shown above) to map batches of text data to a tensor of its respective BERT model vocabulary IDs, while also adding the right amount of padding. This process is similar to constructing any custom dataset class in pytorch, by inheriting the base Dataset class, and modifying the __getitem__ function. Below is the custom dataset class:

Model Architecture

Here’s the code for the model below:

Essentially, I initialize a pre-trained BERT model using the BertModel class. We can then add additional layers to act as classifier heads, very similar to other custom Pytorch architectures. An important point to note here is the creation of a config object using the BertConfig class and setting the right parameters based on the BERT model in use. In case you pre-train another BERT model with a different configuration by using either the google-research or hugging face implementations, you can also pass the path of the generated config file to the vocab_size_or_config_json argument.


Training this model works like training any other model in pytorch so we’re not going to get into the code and will only mention a few specifics:

  • We keep the BERT encoder unfrozen so that all weights are updated with every iteration. Alternatively, you can unfreeze only a few deeper layers.
  • Given the number of trainable parameters it’s useful to train the model on multiple GPUs in parallel. I used 4 Tesla K80’s for about 4500 training samples. For parallel training, wrap the model inside the DataParallel module. Just remember that to access any model attribute, you can access it using modelName.module.attribute and not just modelName.attribute.
  • I used Stochastic Gradient Descent with momentum as the optimizer and found that cycling both the learning rates and momentum really helped to get the training and validation losses down.  
  • In this case, since the task is that of Multilabel classification, make sure to use the right loss function with the right input. For example, if you choose to use BCELoss, make sure to apply the sigmoid activation before calculating the loss. BCEWithLogitsLoss applies the sigmoid activation internally.
  • I trained for about 25 epochs, after which the model started overfitting. 

Domain Adaptation

For fine-tuning the BERT model on a large corpus of domain-specific data, you should download the scripts from hugging face here. There are 2 main scripts — and

Input Data Format

The scripts expect a single file as input, consisting of one un-tokenized sentence per line, and one blank line between documents.The sentence splitting is necessary as training BERT involves the next sentence prediction task where the model predicts if two sentences from contiguous text within the same document.


The next step is to use to pre-process your data (which should be in the input format mentioned above) into training examples. This script uses the same pre-processing methodology as in the BERT paper and repository. The script also includes an option to generate multiple epochs of pre-processed data, to avoid training on the same random splits every epoch. This should result in a better model. An example is shown below:

--train_corpus my_corpus.txt
--bert_model bert-base-uncased
--output_dir training/
--epochs_to_generate 3
--max_seq_len 256

You can then train on this pre-generated data using, by pointing it to the folder created by Both scripts should be given the same bert_model and case parameters. It should also be noted that max_seq_len does not need to be specified for the script, as it is inferred from the training examples. An example is shown below:

--pregenerated_data training/
--bert_model bert-base-uncased
--output_dir finetuned_lm/
--epochs 3

I fine-tuned the bert-base-uncased model using about 800,000 pieces of customer feedback on 4 Tesla T4’s for 3 epochs with each epoch taking about 6 hours. 

How do I load this new fine-tuned model as an encoder?

It’s super easy! The script generates the model weights, the model config file, and the vocabulary under the mentioned output_dir. To load these, we can use the BertModel class the same way we used it while loading the pre-trained bert-based-uncased model, by passing in the output_dir as an argument to the parameter pretrained_model_name_or_path as shown below.

bert = BertModel('finetuned_lm/')
tokenizer = BertTokenizer.from_pretrained('finetuned_lm/')

Yup, that easy! Now you can fine-tune BERT models using data from your domain and incorporate it in your NLP pipeline. I observed that training my feedback data of 4500 samples using a domain adapted pre-trained bert-based-uncased encoder performed better than just the original bert-based-uncased encoder.


Here are the improvements in the  f1-scores for each tag, from our previously best ULMfit model:

0.12004551886891468: Improvement for Alerts_Notification
0.08922285132464836: Improvement for Bugs
0.21451892229509129: Improvement for Customer_Support
0.1558193767224931: Improvement for Documentation
0.005092426173436415: Improvement for Feature_Request
0.12261143263275459: Improvement for Onboarding
0.07988165680473391: Improvement for Performance
0.008398452996517314: Improvement for Price
0.10675711339400029: Improvement for Reporting
0.06932178626518071: Improvement for UX_UI
0.0089379171594417: Improvement for Value_Prop

Prior to BERT, we also experimented with some non-transfer learning approaches for this task. To name a few, we experimented with convolutional neural networks, LSTM network with a many to one architecture, and the Bag of Tricks architecture. Bert consistently outperformed each of these for every tag!

To conclude

So, there you go! Now you have to yourself a BERT pipeline which you can build on top off. 

I really hope that you enjoyed reading this article and found it helpful. If you do have any questions please leave a comment!

Filed Under: Engineering, AI, NLP