Hugging Face NLP Course

08 December 2024

AITools

Hugging Face - NLP Course

This note contains notes about the Hugging Face NLP couruse: Introduction - Hugging Face NLP Course

1 Transformer Models

Natural Language Processing

Aim of NLP tasks is not only to understand single words, but be able to understand the context of those words.

Some common NLP tasks (with examples):

Classifying whole sentences: Get a sentiment of a review, detect if an email is spam
Classifying each wrd in a sentence: Identifying grammatical components (noun, verb, adjective) or named entities (person, location)
Generating text content
Extracting an answer from a text: Given a question and context, extract the answer to the question
Generating a new sentence from an input text: Translating text, summarizing text

NLP isn't limited to only text, works als on challenges in speech recognition and computer vision

How do Transformers work

Transformer architecture was introduced in 2017. Can be grouped in three categories:

GPT-like (auto-regressive Transformer models)
BERT-like (auto-encoding Transformer model)
BART/T5-like (sequence-to-sequence Transformer models)

Transformers are language models

All transformer model types mentioned have been trained as language models, meaning they were trained on large amounts of raw text in a self-supervised fashion.

Self-Supervised: Objective is automatically computed from the input of the model, no human interaction needed

These pretrained models develop an understanding of language but are not very useful for specific practical tasks. Therefore, model goes through transfer learning (fine-tuning in a supervised way).

Examples of tasks:

Predicting the next word in a sentence having read the previous n words (casual language modeling)
Predicting a masked word in a sentence (masked language modeling)

Transformers are big models

In general: Better performance by increasing model sizes and the amount of training data. Training models is very costly in time and compute resources (Transformes can output CO2 Impact of a model)

Therefore, sharing pretrained language models is paramount.

Transfer Learning

Pretraining: Act of training a model from scratch (randomly initialized weights, training without prior knowledge)

Pretraining takes in huge amounts of data and can take several weeks.

Fine-tuning: Training done after a model has been pretrained, to make it work for a specific usecase

To perform fine-tuning, a pretrained language model is taken and additional training i performed with a dataset specific to the task. Advantages of training another model from cratch:

Fine-tuning can take advantage of knowledge acquired by the initial model during pretraining (e.g. general NLP prblems, model already has an understanding of language)
Fine-tuning takes less data to lead to decent results
Amount of time and resources for good reults are much lower

For example, take a pretrained model on the English language and fine-tune it on an arXiv dataset, resulting in a science/research-based model. The knowledge of the pretrained model is transferred.

Transfer Learning: Process of fine tuning a pretrained model and therefore transferring the knowledge of the pretraining

General Architecture

Introduction

The model is primarly composed of two blocks:

Encoder: Receives and input and builds a representation of it (its features). Model is optimized to acquire understanding from the input
Decoder: Uses the encoder's representation along with other inputs to generate a target sequence. Model is optimized for generating outputs These parts can be used independently, depending on the task:
Encoder-only models: Good for tasks that require understanding f the input (sentence classification, named entity recognition)
Decoder-only models: Good for generative tasks (text generation)
Econder-decoder models (also sequence-to-sequence models): Good for generative tasks that also require input (translation, summarization)

Attentioin Layers

The attention layer tell the model to pay specfic attention to certain words (and more or les ignore the others) when dealing with the representation of each word.

Consider the task of translating text from English to French:

Input: "You like this course"
To translate "like": "You" and "like" are important (verb "like" is conjugated based on the subject)
To translate "this": "this" and "course" are important ("this" is translated differently depending on the associated noun being m or f)
The other words are not important for the translatin of these specific words -> attention layer removes them

The Original Architecture

Transformers were initially designed for translation. During traininig, the encoder receives inputs in a certain language, while the decoder receives the same sentences in the desired target language. The attention layers in the encoder can use all the words in a sentence. The decoder, however, workes sequentially and can only pay attention to the words already translated. (e.g. when we have predicted the first three words of the translated target, we give them to the decoder to predict the fourth word).

To speed traning up, the decoder is fed the whole target, but not allowed to use future words.

First attention layer in a decder block pays attention to all past inputs, escond attention layer uses the output of the encoder. This is useful as different languages can put words in different orders.

The attentino mask can also be used in the encoder/decoder to prevent the model from paying attentin to some special words (e.g. padding words used to make training the same size)

Encoder Models

How does it work?

Encoders transform the input word into a feature vector (length of the vector is defined by the architecture of the model)
- Important: Feature vector is not only a representation of the actual word, but also influenced by the context (words next to it)
- Feature vector therefore holds the meaning of the word within the text

General

Use only the encoder of the transformer model
At each stage, the attention layer can access all the words in the initial sentence
Have bi-directional attention and are often called auto-encoding models
Training of these models ften involves corrupting a given sentence and tasking the model with reconstructing the initial sentence.
Best suited for:
- Tasks requiring an understanding of the full sentence
- Named Entity Recognition
- Extractive question answering
- Masked language modeling (find masked word)

Examples

Decoder Models

How does it work?

Words are passed through the decoder and returns a feature vector per input word
- Compared to the encoder, the decoder uses a masked attention layer → Can only access words before the current word
- Feature vector output by the decoder is then transformed back to a word

General

Only use the decoder of the transformer model
Can only access the words positioned before it in the sentence
Also called auto-regressive models
Pretraining revolves around predicting the next word in the sentence
Best suited for text generation

Examples

Sequence-to-Sequence Models

How does it work?

Encoder casts words to a feature vector based on the context
Decoder takes outputs from the encoder and the start of sequence word
- This sequence was encoded by the encoder and the decoder uses this encoded information to predict a word
- This is repeated in an auto regressive way to predict complete sentences
- This is repeated until an end of the sequence is detected (e.g. a dot)

Example Workflow

We want to translate the sentence "Welcome to NYC" to French
The encoder encodes the sentence into a feature vector ("understand" the English sentence), which is fed to the Decoder
The Start of Sequence word is then transformed using the decoder, leading to "Bienvenue"
Then Bienvenue is added to the input sequence to the decoder, resulting in "à"
"à" is added to the input sequence resulting in "NYC"

General

Encoder is trained to understand the sequence, decoder is used to generate a sequence based on the understanding of the encoder
For summarisation, the context length of the decoder is smaller than the context length of the encoder, leading to a summarised output
Best suited for:
- Translation
- Summarization
- Generative question answering

Examples

Bias and Limitations

These models are trained by huge amounts of data, scraped from the internet. This can lead to biases, as for example:

from transformers import pipeline

unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
>> ['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']

result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
>> ['nurse', 'waitress', 'teacher', 'maid', 'prostitute']

Even fine-tuning a model will not be able to remove this intrinsic bias!

2 Using Transformers

Behind the Pipeline

Looking at the code from the first chapter:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Resulting in:

[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

The pipeline fulfills several tasks here:

Preprocessing with a Tokenizer

Transformer models can't process raw text directly, so the text has to be converted first. This is done using a Tokenizer, which is responsible for:

Splitting the input into words, subwords or symbols (called tokens)
Mapping each token to an integer
Adding additional inputs that may be useful to the model

This preprocessing has to be done exactly as when the model was pretrained!

The tokenizer can be retrieved from a model directly using:

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

This tokenizer can then be used to transform the input sentence to tokens:

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Going through the model

The pretrained model can be downloaded similar to the tokenizer:

from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

This model will only return the feature vectors, no human readable output. The output generally has three dimensions:

Batch size: number of sequences at the time (2 in our case for the two sentences)
Sequence length: Length of the numerical representation of the sequence (16 for our sentences)
Hidden size: Vector dimension of each model input (768 for this model, can reach up to 3072 or more for other models)

This output is then sent to the model head to be processed.

The model head is selected based on the task, here we want a sequence classification head:

from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

The outputs of this is of a much smaller dimension now (in our case [2, 2] for two sentences resulting in two labels). These outputs however still don't make any sense. Here we get back:

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

Postprocessing of the output

The values returned by the model had are call logits (raw, unnormalized scores outputted by the last layer of the model).

Logit: A logit is a mathematical function that transforms probabilities into a continuous scale, making them easier to use in statistical models like logistic regression

To get probabilities from logits they now need to go through a SoftMax layer:

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

This then results in actual probabilities:

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

To get the labels associated with each position, we can use the id2label attribute of the model and conclude:

First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005

Models

The AutoModel can help to easily instantiate any type of model. If the exact model type you want to use is known, you can also directly use that specific class.

Let's look at this based on the BERT encoder model

Creating a transformer

First, the configuration object is loaded:

from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)

This configuration holds values like the hidden\_size, number of hidden layers etc.

A model can be loaded with random initialization (as in the above code snippet), but will then only output gibberish and would need to be trained from scratch.

The better way is to load a Transformer model which is pretrained:

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

This model is now initialized with all the weights of the checkpoint and can be used or fine-tuned directly.
A pre-trained model can also easily be saved:

model.save_pretrained("directory_on_my_computer")

Using a Transformer model for inference

A transformer can only handle tokenized inputs. If you have tokenized inputs in the form of a tensor, they can be fed to the model:

output = model(model_inputs)

Tokenizers

Tokenizers convert the text input into data that can be processed by the model (→ numbers)

The goal of tokenization is to find a way to convert the text in a way which makes the most sense to the model while also minimizing the representation size.

Word-based

The first tokenizer that comes to mind is word-based → Split the sentence into words, assign a value to each word

This leads to a huge amount of tokens, since every existing word of a language leads to one token (e.g. over 500'000 for English).

Words like "dog" would also get a different representation than "dogs", even though they are closely related.

Normally a token is added for unknown words which aren't in our vocabulary (often represented as [UNK] or <unk>) → the goal is to minimize the number of unkown tokens

Character-based

Character based tokenizers split the text into characters, instead of words. This has two advantages:

Vocabulary is much smaller
There are fewer unkown tokens

The representation however is less meaningful: each character does not mean a lot on its own (depending on the language)

Also, the model needs to process a very large amount of tokens

Subword Tokenization

Subword tokenization tries to get the best of the previous tokenizer approaches.

Frequently used words should not be split into smaller subwords, rare words however are decomposed.

For instance: "annoyingly" could be decomposed into "annoying" and "ly".

These subwords end up providing a lot of semantic meaning while being space efficient.

Loading and saving

Loading and saving tokenizers is as easy as with models.

Loading the a model tokenizer (e.g. Bert) can be done from a checkpoint like this:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

The tokenizer can then be used to tokenize input:

tokenizer("Using a Transformer network is simple")
>>>
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Encoding

Translating text to numbers is known as encoding. Encoding consists of two steps: tokenization and conversion to input IDs.

Tokenization only includes the splitting of the text into tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

print(tokens)
>>> ['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']

These tokens can then be transformed into input IDs:

ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)
>>> [7993, 170, 11303, 1200, 2443, 1110, 3014]

Decoding

Decoding goes the other way around and decodes a set of input IDs back to the input text:

decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
>>> 'Using a Transformer network is simple'

Handling Multiple Sequences

Models generally expect a set of inputs, not only a single sequence.

Model inputs are tensors, which means that all vectosr need to have the same length. If this is not the case, we have to pad the input.

Each tokenizer has its padding id defined in tokenizer.pad\_token\_id

The padding tokens have to be excluded from the prediction. If this is not done, it can lead to wrong predictions:

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

>>> tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
>>> tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
>>> tensor([[ 1.5694, -1.3895],
            [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>) # Predictions are different from single sequence prediction, due to the padding tokens

To exclude the padding token from the prediction use an attention mask:

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

>>> tensor([[ 1.5694, -1.3895],
            [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

Longer sequences

Transformer modles have a limit of the lengths that can be passed to the models. Most models handle up to 512 or 1024 tokens.

To overcome this problem, truncate your sequences

Putting it all together

Tokenizers can tokenize the sequence automatically:

from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

They support single or multiple sqeuences and can pad automatically. Padding strategies can be configured:

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

They also support truncation or returning sensors for a specific framework:

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

Tokenizers also automatically add special tokens to the beginning and ends of the sequence. These special tokens match the ones used during training

3 Fine-Tuning a Pretrained Model

Processing The Data

Looking at the example from the previous chapter on how to train a sequence classifier on one batch in PyTorch:

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

However only with two sentences, the model will not lead to a very good result, instead we should use bigger datasets

Hugging face also contains datasets, which can be downloaded easily:

from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

>>> DatasetDict({
      train: Dataset({
          features: ['sentence1', 'sentence2', 'label', 'idx'],
          num_rows: 3668
      })
      validation: Dataset({
          features: ['sentence1', 'sentence2', 'label', 'idx'],
          num_rows: 408
      })
      test: Dataset({
          features: ['sentence1', 'sentence2', 'label', 'idx'],
          num_rows: 1725
      })
  })

Dataset contains training, validation and testing set.

We can access one entry of the dataset easily:

raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
>>> {'idx': 0,
   'label': 1,
   'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
   'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

We can see the meaning of the 'label' entry by inspecting the features:

raw_train_dataset.features

>>> {'sentence1': Value(dtype='string', id=None),
   'sentence2': Value(dtype='string', id=None),
   'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
   'idx': Value(dtype='int32', id=None)}

Preprocessing a Dataset

To prepare the dataset for training, we need to tokenize it first.

Make sure to use the same checkpoint for the tokenizer and the model → this way you ensure compatibility.

We can tokenize the complete dataset at once:

tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

This however only works if we have enough ram to store all the data in memory.

We can also use the Dataset.map() method which gives us more flexibility for preprocessing besides tokenization. It works by applying a function on each element of the dataset. A function that tokenizes our inputs could look like this:

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a dictionary and returns a new dictionary with the keys input\_ids, attention\_mask and token\_type\_ids. This also works if the example dictionary contains multiple samples, which allows us to use the batched=True configuration in our call to map() which will significantly speed up the tokenization process.

We now also don't use padding anymore. This also speeds up processing time because we can pad to the maximum length per batch instead of maximum length per dataset.

Tokenization of the whole dataset using our map function can be done like this now:

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

>>>
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 408
    })
    test: Dataset({
        features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
        num_rows: 1725
    })
})

Dynamic Padding

The function which puts together the samples inside a batch is called collate function. The default collate function is to just concatenate the vectors into PyTorch tensors, which is only possible if they have the same length.

With dynamic padding, we pad only as much as needed to have the same length in one batch, not the whole dataset, which improves training speed (if you are training on TPU, this might however have a bad effect on training speed!)

The DataCollatorWithPadding function handles exactly this dynamic padding automatically:

from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Fine-tuning a model with the Trainer API

Transformers provide the Trainer class which helps to fine-tune any of the pretrained models provided. This will run very slow on a CPU.

The code from the previous section looks like this:

from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Training

First we need to define a TrainingArguments class which contains all the hyperparameters the Trainer will use for training and evaluation. The only argument we have to provide is a directory where the trained model and checkpoints should be saved:

from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")

Now we need to define a model, we will use the AutoModelForSequenceClassification class with two labels:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

We now get a warning because this BERT model has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded.

Now that we have our model,w e can define a Trainer by passing it all the objects we've initialized:

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

To now start the training we use:

trainer.train()

This will start finetuning which will take a couple of minutes on a GPU or on Colab and report the training loss every 500 steps. It doesn't tell you how good or bad your model is performing because:

We didn't tell the Trainer to evaluate during training (by setting the evaluation\_strategy to steps or epoch)
We didn't provide the Trainer with a compute\_metrics() function to calculate the metrics

Evaluation

Let's see how we can build a compute\_metrics() function. It takes an EvalPrediction object and will return a dictionary mapping strings to floats. We can then use the Trainer.predict()command:

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
>>> (408, 2) (408,)

The output of the predict() function is another named tuple with the fields: predictions, label\_ids and metrics.

To build our compute_metrics() function, we rely on metrics from the Evaluate library. We can load the metrics associated with a model directly:

import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

>>> {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}

This shows that our model has an accuracy of 85.78% and an F1 Score of 89.97%

F_1 = \frac{2}{\text{recall}^{-1}+\text{precision}^{-1}}=2\frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}  = \frac{2\text{TP}}{2\text{TP}+\text{FP}+\text{FN}}

F1 represents precision (number of true positives divided by all positive predictions) and recall (number of true positive results divided by all actually positive samples) in one single symmetric metric
Wrapping everything together into the compute\_metrics() function:

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

We can define a new trainer which can now output metrics at the end of each epoch:

training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

27 February 2025

Hugging Face - NLP Course

1 Transformer Models

Natural Language Processing

How do Transformers work

Transformers are language models

Transformers are big models

Transfer Learning

General Architecture

Introduction

Attentioin Layers

The Original Architecture

Encoder Models

How does it work?

General

Examples

Decoder Models

How does it work?

General

Examples

Sequence-to-Sequence Models

How does it work?

Example Workflow

General

Examples

Bias and Limitations

2 Using Transformers

Behind the Pipeline

Preprocessing with a Tokenizer

Going through the model

Postprocessing of the output

Models

Creating a transformer

Using a Transformer model for inference

Tokenizers

Word-based

Character-based

Subword Tokenization

Loading and saving

Encoding

Decoding

Handling Multiple Sequences

Longer sequences

Putting it all together

3 Fine-Tuning a Pretrained Model

Processing The Data

Preprocessing a Dataset

Dynamic Padding

Fine-tuning a model with the Trainer API

Training

Evaluation

Related Posts

AI Reasoners

What is RAG (Retrieval-Augmented Generation)

Todoist Pocket Integration