Python-bloggers

Few Shot Learning with SetFit

This article was first published on Python – Hutsons-hacks , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

HuggingFace have been working on a model that can be used for small datasets. The aim is to leverage the pretrained transformer and use contrastive learning to augment and extend the dataset, by using similar labels that share a same dimensional space.

In this tutorial I will talk you through what SetFit is and how to fine tune the model to provide a way to do classification with a smaller sized dataset.

Let’s jump into it!

What is SetFit?

To illustrate what SetFit does I will use a visual from the HuggingFace GitHub repository that shows you how this multi-step approach works:

SetFit takes advantage of Sentence Transformers’ ability to generate dense embeddings based on paired sentences. These are the steps it takes to get there:

  1. In the initial fine-tuning phase stage, it makes use of the limited labeled input data, as showed on the few-shot training data section, by contrastive training, where positive and negative pairs are created by in-class and out-class selection. Basically, it looks for similarity in the sentences and either tags them as in-class (positive) or out-class (negative) encodings.
  2. Next, the Sentence Transformer model then trains on these pairs (or triplets) and generates dense vectors (where the matrix contains many encodings, as opposed to a sparse matrix) per example. In the second step, the classification head trains on the encoded embeddings (a numerical representation of the word and where it is positioned in the sentence) with their respective class labels.
  3. At inference time, the unseen example passes through the fine-tuned Sentence Transformer, generating an embedding that when fed to the classification head outputs a class label prediction.

These are adaptive and allow for a simple switch to be set to turn this from an `english` model to a multilingual one.

The other advantages, as listed on the supporting repository, are:

For the supporting research behind this model, read here: https://github.com/huggingface/setfit.

The data part…

We are going to be working with a custom social media dataset, with very few examples of whether there is online abuse detected over examples that are not abuse. In these steps we will use the supporting GitHub repository to load the data from a csv file and then push this into HuggingFace as a Dataset.

The dataset can be viewed in the viewer:

As you can see, these contain text and string pair. This is the result of pushing our CSV file to the dataset. The supporting repository with the raw data can be found here: https://github.com/StatsGary/transformers-playground/blob/main/data/smabuse.csv.

This will be our end result, but first we will step through importing the relevant packages and loading the data.

Getting our imports installed

The first stage of this will be to install all the packages we are going to need to work with datasets, and later, to fine tune our model. These are implemented below:

Great – now we have these imports we will move to loading our custom csv dataset. This will be covered in the following section.

Loading in the custom dataset

As specified above, we will load this dataset: https://github.com/StatsGary/transformers-playground/blob/main/data/smabuse.csv into our Python notebook, or Python script file. We will use the load_dataset function from the Datasets package. This is implemented as below:

To explain these lines:

Running both of these commands you will end up with two seperate datasets, that will be stored in memory.

The next step is to combine these into a DatasetDict, as this is the format HuggingFace Datasets expects the file to be in.

Create the DatasetDict dictionary

In the last section we mentioned we would now need to take our inputs and load these into a DatasetDict format, which is a dictionary of dictionaries essentially. This is how I have implemented it for our project in the notebook:

All this does is combines our train_dataset and eval_dataset into a dictionary of dictionaries, which is now ready to be uploaded to HuggingFace Datasets for retrieval from the system.

Push our custom dataset to the hub

We will use the special command notebook_login() to login to our HuggingFace account.

If you do not have a HuggingFace account you will need to sign up for one at huggingface.co and once you have done this you will need to obtain a token.

This can be achieved by:

  1. Click your image at the top right hand screen
  2. From here, choose Settings
  3. On the left tab choose Access Tokens
  4. Click New Token to generate a new token, as in the screen hereunder:

Once you have a token you will need to register it against your account using huggingface-cli login. For the full set of steps, or if you get stuck with these instructions, go to: https://huggingface.co/docs/huggingface_hub/quick-start.

This should take care of the setup and now we will use the command push_to_hub to push our Dataset to HuggingFace.

You can skip the next bit if you have the dataset already in your environment, but this is how you would pull the dataset from the hub.

Pulling the Dataset from the HuggingFace Hub

The below code will then be used to retrieve our brand new dataset from the HuggingFace Hub:

A couple of things to note:

This brings back our train and validation Datasets.

Bolster the classes with simulated oversampling

Because we have a small dataset we will bolster the number of examples we have. Setfit recommends that the model can perform really accurately with ≈ 2 thousand examples. Because we are not even close to this, we will use just a few.

This will essentially multiply the number of classes by 8 giving us a larger training set to work with. Not ideal, but useful for this tutorial.

The modelling part…

In this section we will:

Loading a pre-trained model

There are many sentence transformer models to use on HuggingFace: https://huggingface.co/sentence-transformers. Here I chose one of the matching sentence transformers, these are used for sentence, text and image embeddings (Sentence BERT was the first described using this framework).

That is all there is load in a pre-trained model. Take a look at the documentation for the chosen model: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 and feel free to choose another model to see how this can improve results.

Fine tuning the model head with Trainer

The package transformers has an in built training loop, which is a wrapper for a PyTorch training loop. This is a more complicated loop than you may have seen with other PyTorch implementations, so we will use the out of the box functionality in the setfit package. For those that are interested in how the trainer.py file works, check out the associated repository for the package: https://github.com/huggingface/setfit/blob/main/src/setfit/trainer.py.

Below shows how to achieve this training step:

The parameters for the setfit trainer are:

Once we have all this in place, the last thing to do is to train the network.

Training the network

The last step we need to do to train the network is to simply use trainer.train():

Once the training is done we want to evaluate if we are happy with the output. Here we will use accuracy:

Here on my run I was getting in the region of 70% accuracy, which is not great, but for a small number of examples, it actually does rather well. This will vary by the size of the dataset you use, as well as the neural network hyperparameters you use in the model. Let’s say I was happy to now productionise this model for ‘all and sundry’ to use, then the next steps would be how we would achieve this.

Deploying model step…

We will use the HuggingFace hub, as we have been doing throughout, although there will be a local version that is saved to your machine with the same model name, so it could be deployed locally. The following steps will show you how to publish and push this model to the hub.

Now this has been actioned I should see my new model appear in my HuggingFace page:

The next step we move on to is pulling this model down from the HuggingFace hub and using this to make inferences.

The model inference step…

We have come far. This is the last piece of the puzzle. We have tested our model on an evaluation set and have then pushed it to the HuggingFace hub to use. Now we will pass a few examples through our model and see what classifications we get back.

Step one: The first step would be to pull the model down from the hub:

Step two: next we will pass some inputs into an empty list and see what the model returns:

Step three: we will use the predictions and then create a helper function to get our labels back from the function, as these have been dummy encoded to 1 = risk and 0 = no risk:

This requires a little more explanation. I will break this down:

Step four (the finale): all that is left to do is convert the results to a pandas DataFrame() and use the zip() function to create a tuple of lists and then we finally wrap this in an outer list() to turn the tuple from a tuple to a list:

The final question – does our model do an okay job with just those few training examples?

Note – these are not my views and have been included for demonstrating the potential of the model

Yes, it does a perfect job on those few examples. Can you imagine how much you could improve your estimations if you fed this model a couple of thousand examples? The research suggests that this outperforms standard few shot learning models.

Where can I get the code?

I have captured the working notebooks in my GitHub repository. Feel free to download the workbooks from there: https://github.com/StatsGary/transformers-playground/blob/main/09_SetFit_Custom_Data.ipynb.

Finishing up!

I hope you found this tutorial useful and can now see the power of what this model has to offer. Please have a go and adapt it with your own dataset.

Let’s make the neural net-work!

To leave a comment for the author, please follow the link and comment on their blog: Python – Hutsons-hacks .

Want to share your content on python-bloggers? click here.
Exit mobile version