HuggingFace have been working on a model that can be used for small datasets. The aim is to leverage the pretrained transformer and use contrastive learning to augment and extend the dataset, by using similar labels that share a same dimensional space.
In this tutorial I will talk you through what SetFit is and how to fine tune the model to provide a way to do classification with a smaller sized dataset.
Let’s jump into it!
What is SetFit?
To illustrate what SetFit does I will use a visual from the HuggingFace GitHub repository that shows you how this multi-step approach works:
SetFit takes advantage of Sentence Transformers’ ability to generate dense embeddings based on paired sentences. These are the steps it takes to get there:
- In the initial fine-tuning phase stage, it makes use of the limited labeled input data, as showed on the few-shot training data section, by contrastive training, where positive and negative pairs are created by in-class and out-class selection. Basically, it looks for similarity in the sentences and either tags them as in-class (positive) or out-class (negative) encodings.
- Next, the Sentence Transformer model then trains on these pairs (or triplets) and generates dense vectors (where the matrix contains many encodings, as opposed to a sparse matrix) per example. In the second step, the classification head trains on the encoded embeddings (a numerical representation of the word and where it is positioned in the sentence) with their respective class labels.
- At inference time, the unseen example passes through the fine-tuned Sentence Transformer, generating an embedding that when fed to the classification head outputs a class label prediction.
These are adaptive and allow for a simple switch to be set to turn this from an `english` model to a multilingual one.
The other advantages, as listed on the supporting repository, are:
- Fast to train: SetFit doesn’t require large-scale models like T0 or GPT-3 to achieve high accuracy. As a result, it is typically an order of magnitude (or more) faster to train and run inference with.
- Multilingual support: SetFit can be used with any Sentence Transformer on the Hub, which means you can classify text in multiple languages by simply fine-tuning a multilingual checkpoint.
For the supporting research behind this model, read here: https://github.com/huggingface/setfit.
The data part…
We are going to be working with a custom social media dataset, with very few examples of whether there is online abuse detected over examples that are not abuse. In these steps we will use the supporting GitHub repository to load the data from a csv file and then push this into HuggingFace as a Dataset.
The dataset can be viewed in the viewer:
As you can see, these contain text and string pair. This is the result of pushing our CSV file to the dataset. The supporting repository with the raw data can be found here: https://github.com/StatsGary/transformers-playground/blob/main/data/smabuse.csv.
This will be our end result, but first we will step through importing the relevant packages and loading the data.
Getting our imports installed
The first stage of this will be to install all the packages we are going to need to work with datasets, and later, to fine tune our model. These are implemented below:
Great – now we have these imports we will move to loading our custom csv dataset. This will be covered in the following section.
Loading in the custom dataset
As specified above, we will load this dataset: https://github.com/StatsGary/transformers-playground/blob/main/data/smabuse.csv into our Python notebook, or Python script file. We will use the
load_dataset function from the Datasets package. This is implemented as below:
To explain these lines:
train_datasetis the dataset we will use to train our custom
setfitmodel. We use the
load_datasetfunction and we specify which file type we wish to load. In this instance it is a csv file that we are loading in. Next we specify where the data file is contained, using
data_filesparameter in the function and finally we choose the
splitparameter as the training dataset.
- With the
eval_datasetwe will use the initial training set and split it. The aim is to split the dataset so that 90% of the data is retained in the training set and 10% is used in our validation set (the set that is used to refine the loss function in our setfit model)
Running both of these commands you will end up with two seperate datasets, that will be stored in memory.
The next step is to combine these into a
DatasetDict, as this is the format HuggingFace Datasets expects the file to be in.
Create the DatasetDict dictionary
In the last section we mentioned we would now need to take our inputs and load these into a DatasetDict format, which is a dictionary of dictionaries essentially. This is how I have implemented it for our project in the notebook:
All this does is combines our
eval_dataset into a dictionary of dictionaries, which is now ready to be uploaded to HuggingFace Datasets for retrieval from the system.
Push our custom dataset to the hub
We will use the special command
notebook_login() to login to our HuggingFace account.
If you do not have a HuggingFace account you will need to sign up for one at huggingface.co and once you have done this you will need to obtain a token.
This can be achieved by:
- Click your image at the top right hand screen
- From here, choose Settings
- On the left tab choose Access Tokens
- Click New Token to generate a new token, as in the screen hereunder:
Once you have a token you will need to register it against your account using
huggingface-cli login. For the full set of steps, or if you get stuck with these instructions, go to: https://huggingface.co/docs/huggingface_hub/quick-start.
This should take care of the setup and now we will use the command
push_to_hub to push our Dataset to HuggingFace.
You can skip the next bit if you have the dataset already in your environment, but this is how you would pull the dataset from the hub.
Pulling the Dataset from the HuggingFace Hub
The below code will then be used to retrieve our brand new dataset from the HuggingFace Hub:
A couple of things to note:
- We referenced what our dataset was called in the previous section, but this time we prefixed it with a HuggingFace user name or handle. The user name here was StatsGary.
- The print statement returns the dataset dictionary:
This brings back our train and validation Datasets.
Bolster the classes with simulated oversampling
Because we have a small dataset we will bolster the number of examples we have. Setfit recommends that the model can perform really accurately with ≈ 2 thousand examples. Because we are not even close to this, we will use just a few.
This will essentially multiply the number of classes by 8 giving us a larger training set to work with. Not ideal, but useful for this tutorial.
The modelling part…
In this section we will:
- Load a pretrained sentence model from the SetFitModel library
- Create a Trainer (you could create a PyTorch loop here as well – outside the scope of this tutorial, but see: https://huggingface.co/docs/transformers/training.
- Train the model
Loading a pre-trained model
There are many sentence transformer models to use on HuggingFace: https://huggingface.co/sentence-transformers. Here I chose one of the matching sentence transformers, these are used for sentence, text and image embeddings (Sentence BERT was the first described using this framework).
That is all there is load in a pre-trained model. Take a look at the documentation for the chosen model: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and feel free to choose another model to see how this can improve results.
Fine tuning the model head with Trainer
transformers has an in built training loop, which is a wrapper for a PyTorch training loop. This is a more complicated loop than you may have seen with other PyTorch implementations, so we will use the out of the box functionality in the setfit package. For those that are interested in how the trainer.py file works, check out the associated repository for the package: https://github.com/huggingface/setfit/blob/main/src/setfit/trainer.py.
Below shows how to achieve this training step:
The parameters for the setfit trainer are:
- model – the name of the model we initialised in the previous step
- train_dataset – the training dataset for the social media abuse task
- eval_dataset – the evaluation dataset that the neural network is going to compare against when trying to minimise the loss between the train and validation set
- loss_class – this is equal to
CosinneSimilarityLosswhich is the similarity between the predicted and observed values. This is a similarity measure so will look at the loss / error between the predicted sentences and the actual sentences, with the aim to minimise that loss as much as possible. This is well described here: https://en.wikipedia.org/wiki/Cosine_similarity.
- The size of the batches of data to process through the network at once
- num_iterations – this is the contrastive learning process and says we should generate 20 text pairs for contrastive learning. The aim of contrastive learning is to
- Contrastive learning is to learn the general features of a dataset without labels by teaching the model which data points are similar or different. This will learn difference and similarity in a spatial sense i.e. those sentences closest to what appears to be an abuse item will be labelled abuse, and those further away will be labelled not abuse.
- column_mapping – this just maps the column to the input data, as our fields are called text and label
- num_epochs – this is the number of passes through the network the dataset takes
- learning_rate – the rate in which the network learns to optimise the loss function and the updates to the weights in the network. A large loss value can lead to weights not being learned as effectively, whereas a small learning rate can lead to better learnability but much slower training times.
Once we have all this in place, the last thing to do is to train the network.
Training the network
The last step we need to do to train the network is to simply use
Once the training is done we want to evaluate if we are happy with the output. Here we will use accuracy:
Here on my run I was getting in the region of 70% accuracy, which is not great, but for a small number of examples, it actually does rather well. This will vary by the size of the dataset you use, as well as the neural network hyperparameters you use in the model. Let’s say I was happy to now productionise this model for ‘all and sundry’ to use, then the next steps would be how we would achieve this.
Deploying model step…
We will use the HuggingFace hub, as we have been doing throughout, although there will be a local version that is saved to your machine with the same model name, so it could be deployed locally. The following steps will show you how to publish and push this model to the hub.
Now this has been actioned I should see my new model appear in my HuggingFace page:
The next step we move on to is pulling this model down from the HuggingFace hub and using this to make inferences.
The model inference step…
We have come far. This is the last piece of the puzzle. We have tested our model on an evaluation set and have then pushed it to the HuggingFace hub to use. Now we will pass a few examples through our model and see what classifications we get back.
Step one: The first step would be to pull the model down from the hub:
Step two: next we will pass some inputs into an empty list and see what the model returns:
Step three: we will use the predictions and then create a helper function to get our labels back from the function, as these have been dummy encoded to 1 = risk and 0 = no risk:
This requires a little more explanation. I will break this down:
- We converted our PyTorch tensor into a numpy tensor and then cast that to a
list(). This will give us a list of the class predictions
- We create a function called
return_label()which has parameters of the val to look up and a list of labels to pass in
- This is a simple if block which states if the encoding equals 1 then the value is equal to the first index value of abuse. Conversely, this is also done for not abuse.
- We then use the
map()function to pass the results list to the function i.e. for every element in the list apply the function to assign the label
- Finally, we store the mapped description to a list and store this in a variable risks
Step four (the finale): all that is left to do is convert the results to a pandas
DataFrame() and use the
zip() function to create a tuple of lists and then we finally wrap this in an outer
list() to turn the tuple from a tuple to a list:
The final question – does our model do an okay job with just those few training examples?
Yes, it does a perfect job on those few examples. Can you imagine how much you could improve your estimations if you fed this model a couple of thousand examples? The research suggests that this outperforms standard few shot learning models.
Where can I get the code?
I have captured the working notebooks in my GitHub repository. Feel free to download the workbooks from there: https://github.com/StatsGary/transformers-playground/blob/main/09_SetFit_Custom_Data.ipynb.
I hope you found this tutorial useful and can now see the power of what this model has to offer. Please have a go and adapt it with your own dataset.
Let’s make the neural net-work!