Hugging Face NLP Course
Hugging Face - NLP Course
This note contains notes about the Hugging Face NLP couruse: Introduction - Hugging Face NLP Course
1 Transformer Models
Natural Language Processing
Aim of NLP tasks is not only to understand single words, but be able to understand the context of those words.
Some common NLP tasks (with examples):
- Classifying whole sentences: Get a sentiment of a review, detect if an email is spam
- Classifying each wrd in a sentence: Identifying grammatical components (noun, verb, adjective) or named entities (person, location)
- Generating text content
- Extracting an answer from a text: Given a question and context, extract the answer to the question
- Generating a new sentence from an input text: Translating text, summarizing text
NLP isn't limited to only text, works als on challenges in speech recognition and computer vision
How do Transformers work
Transformer architecture was introduced in 2017. Can be grouped in three categories:
- GPT-like (auto-regressive Transformer models)
- BERT-like (auto-encoding Transformer model)
- BART/T5-like (sequence-to-sequence Transformer models)
Transformers are language models
All transformer model types mentioned have been trained as language models, meaning they were trained on large amounts of raw text in a self-supervised fashion.
Self-Supervised: Objective is automatically computed from the input of the model, no human interaction needed
These pretrained models develop an understanding of language but are not very useful for specific practical tasks. Therefore, model goes through transfer learning (fine-tuning in a supervised way).
Examples of tasks:
- Predicting the next word in a sentence having read the previous n words (casual language modeling)
- Predicting a masked word in a sentence (masked language modeling)
Transformers are big models
In general: Better performance by increasing model sizes and the amount of training data. Training models is very costly in time and compute resources (Transformes can output CO2 Impact of a model)
Therefore, sharing pretrained language models is paramount.
Transfer Learning
Pretraining: Act of training a model from scratch (randomly initialized weights, training without prior knowledge)
Pretraining takes in huge amounts of data and can take several weeks.
Fine-tuning: Training done after a model has been pretrained, to make it work for a specific usecase
To perform fine-tuning, a pretrained language model is taken and additional training i performed with a dataset specific to the task. Advantages of training another model from cratch:
- Fine-tuning can take advantage of knowledge acquired by the initial model during pretraining (e.g. general NLP prblems, model already has an understanding of language)
- Fine-tuning takes less data to lead to decent results
- Amount of time and resources for good reults are much lower
For example, take a pretrained model on the English language and fine-tune it on an arXiv dataset, resulting in a science/research-based model. The knowledge of the pretrained model is transferred.
Transfer Learning: Process of fine tuning a pretrained model and therefore transferring the knowledge of the pretraining
General Architecture
Introduction
The model is primarly composed of two blocks:
- Encoder: Receives and input and builds a representation of it (its features). Model is optimized to acquire understanding from the input
- Decoder: Uses the encoder's representation along with other inputs to generate a target sequence. Model is optimized for generating outputs These parts can be used independently, depending on the task:
- Encoder-only models: Good for tasks that require understanding f the input (sentence classification, named entity recognition)
- Decoder-only models: Good for generative tasks (text generation)
- Econder-decoder models (also sequence-to-sequence models): Good for generative tasks that also require input (translation, summarization)
Attentioin Layers
The attention layer tell the model to pay specfic attention to certain words (and more or les ignore the others) when dealing with the representation of each word.
Consider the task of translating text from English to French:
- Input: "You like this course"
- To translate "like": "You" and "like" are important (verb "like" is conjugated based on the subject)
- To translate "this": "this" and "course" are important ("this" is translated differently depending on the associated noun being m or f)
- The other words are not important for the translatin of these specific words -> attention layer removes them
The Original Architecture
Transformers were initially designed for translation. During traininig, the encoder receives inputs in a certain language, while the decoder receives the same sentences in the desired target language. The attention layers in the encoder can use all the words in a sentence. The decoder, however, workes sequentially and can only pay attention to the words already translated. (e.g. when we have predicted the first three words of the translated target, we give them to the decoder to predict the fourth word).
To speed traning up, the decoder is fed the whole target, but not allowed to use future words.
First attention layer in a decder block pays attention to all past inputs, escond attention layer uses the output of the encoder. This is useful as different languages can put words in different orders.
The attentino mask can also be used in the encoder/decoder to prevent the model from paying attentin to some special words (e.g. padding words used to make training the same size)
Encoder Models
How does it work?
- Encoders transform the input word into a feature vector (length of the vector is defined by the architecture of the model)
- Important: Feature vector is not only a representation of the actual word, but also influenced by the context (words next to it)
- Feature vector therefore holds the meaning of the word within the text
General
- Use only the encoder of the transformer model
- At each stage, the attention layer can access all the words in the initial sentence
- Have bi-directional attention and are often called auto-encoding models
- Training of these models ften involves corrupting a given sentence and tasking the model with reconstructing the initial sentence.
- Best suited for:
- Tasks requiring an understanding of the full sentence
- Named Entity Recognition
- Extractive question answering
- Masked language modeling (find masked word)
Examples
Decoder Models
How does it work?
- Words are passed through the decoder and returns a feature vector per input word
- Compared to the encoder, the decoder uses a masked attention layer → Can only access words before the current word
- Feature vector output by the decoder is then transformed back to a word
General
- Only use the decoder of the transformer model
- Can only access the words positioned before it in the sentence
- Also called auto-regressive models
- Pretraining revolves around predicting the next word in the sentence
- Best suited for text generation
Examples
Sequence-to-Sequence Models
How does it work?
- Encoder casts words to a feature vector based on the context
- Decoder takes outputs from the encoder and the start of sequence word
- This sequence was encoded by the encoder and the decoder uses this encoded information to predict a word
- This is repeated in an auto regressive way to predict complete sentences
- This is repeated until an end of the sequence is detected (e.g. a dot)
Example Workflow
- We want to translate the sentence "Welcome to NYC" to French
- The encoder encodes the sentence into a feature vector ("understand" the English sentence), which is fed to the Decoder
- The Start of Sequence word is then transformed using the decoder, leading to "Bienvenue"
- Then Bienvenue is added to the input sequence to the decoder, resulting in "à"
- "à" is added to the input sequence resulting in "NYC"
General
- Encoder is trained to understand the sequence, decoder is used to generate a sequence based on the understanding of the encoder
- For summarisation, the context length of the decoder is smaller than the context length of the encoder, leading to a summarised output
- Best suited for:
- Translation
- Summarization
- Generative question answering
Examples
Bias and Limitations
These models are trained by huge amounts of data, scraped from the internet. This can lead to biases, as for example:
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")
result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])
>> ['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic']
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])
>> ['nurse', 'waitress', 'teacher', 'maid', 'prostitute']
Even fine-tuning a model will not be able to remove this intrinsic bias!
2 Using Transformers
Behind the Pipeline
Looking at the code from the first chapter:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier(
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)
Resulting in:
[{'label': 'POSITIVE', 'score': 0.9598047137260437},
{'label': 'NEGATIVE', 'score': 0.9994558095932007}]
The pipeline fulfills several tasks here:
Preprocessing with a Tokenizer
Transformer models can't process raw text directly, so the text has to be converted first. This is done using a Tokenizer, which is responsible for:
- Splitting the input into words, subwords or symbols (called tokens)
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model
This preprocessing has to be done exactly as when the model was pretrained!
The tokenizer can be retrieved from a model directly using:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
This tokenizer can then be used to transform the input sentence to tokens:
raw_inputs = [
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
Going through the model
The pretrained model can be downloaded similar to the tokenizer:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
This model will only return the feature vectors, no human readable output. The output generally has three dimensions:
- Batch size: number of sequences at the time (2 in our case for the two sentences)
- Sequence length: Length of the numerical representation of the sequence (16 for our sentences)
- Hidden size: Vector dimension of each model input (768 for this model, can reach up to 3072 or more for other models)
This output is then sent to the model head to be processed.
The model head is selected based on the task, here we want a sequence classification head:
from transformers import AutoModelForSequenceClassification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
The outputs of this is of a much smaller dimension now (in our case [2, 2] for two sentences resulting in two labels). These outputs however still don't make any sense. Here we get back:
tensor([[-1.5607, 1.6123],
[ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)
Postprocessing of the output
The values returned by the model had are call logits (raw, unnormalized scores outputted by the last layer of the model).
Logit: A logit is a mathematical function that transforms probabilities into a continuous scale, making them easier to use in statistical models like logistic regression
To get probabilities from logits they now need to go through a SoftMax layer:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)
This then results in actual probabilities:
tensor([[4.0195e-02, 9.5980e-01],
[9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)
To get the labels associated with each position, we can use the id2label
attribute of the model and conclude:
- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598
- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005
Models
The AutoModel
can help to easily instantiate any type of model. If the exact model type you want to use is known, you can also directly use that specific class.
Let's look at this based on the BERT encoder model
Creating a transformer
First, the configuration object is loaded:
from transformers import BertConfig, BertModel
# Building the config
config = BertConfig()
# Building the model from the config
model = BertModel(config)
This configuration holds values like the hidden\_size
, number of hidden layers etc.
A model can be loaded with random initialization (as in the above code snippet), but will then only output gibberish and would need to be trained from scratch.
The better way is to load a Transformer model which is pretrained:
from transformers import BertModel
model = BertModel.from_pretrained("bert-base-cased")
This model is now initialized with all the weights of the checkpoint and can be used or fine-tuned directly.
A pre-trained model can also easily be saved:
model.save_pretrained("directory_on_my_computer")
Using a Transformer model for inference
A transformer can only handle tokenized inputs. If you have tokenized inputs in the form of a tensor, they can be fed to the model:
output = model(model_inputs)
Tokenizers
Tokenizers convert the text input into data that can be processed by the model (→ numbers)
The goal of tokenization is to find a way to convert the text in a way which makes the most sense to the model while also minimizing the representation size.
Word-based
The first tokenizer that comes to mind is word-based → Split the sentence into words, assign a value to each word
This leads to a huge amount of tokens, since every existing word of a language leads to one token (e.g. over 500'000 for English).
Words like "dog" would also get a different representation than "dogs", even though they are closely related.
Normally a token is added for unknown words which aren't in our vocabulary (often represented as [UNK]
or <unk>
) → the goal is to minimize the number of unkown tokens
Character-based
Character based tokenizers split the text into characters, instead of words. This has two advantages:
- Vocabulary is much smaller
- There are fewer unkown tokens
The representation however is less meaningful: each character does not mean a lot on its own (depending on the language)
Also, the model needs to process a very large amount of tokens
Subword Tokenization
Subword tokenization tries to get the best of the previous tokenizer approaches.
Frequently used words should not be split into smaller subwords, rare words however are decomposed.
For instance: "annoyingly" could be decomposed into "annoying" and "ly".
These subwords end up providing a lot of semantic meaning while being space efficient.
Loading and saving
Loading and saving tokenizers is as easy as with models.
Loading the a model tokenizer (e.g. Bert) can be done from a checkpoint like this:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
The tokenizer can then be used to tokenize input:
tokenizer("Using a Transformer network is simple")
>>>
{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
Encoding
Translating text to numbers is known as encoding. Encoding consists of two steps: tokenization and conversion to input IDs.
Tokenization only includes the splitting of the text into tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)
print(tokens)
>>> ['Using', 'a', 'transform', '##er', 'network', 'is', 'simple']
These tokens can then be transformed into input IDs:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)
>>> [7993, 170, 11303, 1200, 2443, 1110, 3014]
Decoding
Decoding goes the other way around and decodes a set of input IDs back to the input text:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])
print(decoded_string)
>>> 'Using a Transformer network is simple'
Handling Multiple Sequences
Models generally expect a set of inputs, not only a single sequence.
Model inputs are tensors, which means that all vectosr need to have the same length. If this is not the case, we have to pad the input.
Each tokenizer has its padding id defined in tokenizer.pad\_token\_id
The padding tokens have to be excluded from the prediction. If this is not done, it can lead to wrong predictions:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]
print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)
>>> tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
>>> tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
>>> tensor([[ 1.5694, -1.3895],
[ 1.3373, -1.2163]], grad_fn=<AddmmBackward>) # Predictions are different from single sequence prediction, due to the padding tokens
To exclude the padding token from the prediction use an attention mask:
batched_ids = [
[200, 200, 200],
[200, 200, tokenizer.pad_token_id],
]
attention_mask = [
[1, 1, 1],
[1, 1, 0],
]
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)
>>> tensor([[ 1.5694, -1.3895],
[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
Longer sequences
Transformer modles have a limit of the lengths that can be passed to the models. Most models handle up to 512 or 1024 tokens.
To overcome this problem, truncate your sequences
Putting it all together
Tokenizers can tokenize the sequence automatically:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)
They support single or multiple sqeuences and can pad automatically. Padding strategies can be configured:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")
# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")
# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
They also support truncation or returning sensors for a specific framework:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)
# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")
# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")
Tokenizers also automatically add special tokens to the beginning and ends of the sequence. These special tokens match the ones used during training
3 Fine-Tuning a Pretrained Model
Processing The Data
Looking at the example from the previous chapter on how to train a sequence classifier on one batch in PyTorch:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# This is new
batch["labels"] = torch.tensor([1, 1])
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()
However only with two sentences, the model will not lead to a very good result, instead we should use bigger datasets
Hugging face also contains datasets, which can be downloaded easily:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets
>>> DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 408
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 1725
})
})
Dataset contains training, validation and testing set.
We can access one entry of the dataset easily:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]
>>> {'idx': 0,
'label': 1,
'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}
We can see the meaning of the 'label' entry by inspecting the features:
raw_train_dataset.features
>>> {'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'idx': Value(dtype='int32', id=None)}
Preprocessing a Dataset
To prepare the dataset for training, we need to tokenize it first.
Make sure to use the same checkpoint for the tokenizer and the model → this way you ensure compatibility.
We can tokenize the complete dataset at once:
tokenized_dataset = tokenizer(
raw_datasets["train"]["sentence1"],
raw_datasets["train"]["sentence2"],
padding=True,
truncation=True,
)
This however only works if we have enough ram to store all the data in memory.
We can also use the Dataset.map()
method which gives us more flexibility for preprocessing besides tokenization. It works by applying a function on each element of the dataset. A function that tokenizes our inputs could look like this:
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
This function takes a dictionary and returns a new dictionary with the keys input\_ids
, attention\_mask
and token\_type\_ids
. This also works if the example dictionary contains multiple samples, which allows us to use the batched=True
configuration in our call to map()
which will significantly speed up the tokenization process.
We now also don't use padding anymore. This also speeds up processing time because we can pad to the maximum length per batch instead of maximum length per dataset.
Tokenization of the whole dataset using our map function can be done like this now:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets
>>>
DatasetDict({
train: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 3668
})
validation: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 408
})
test: Dataset({
features: ['attention_mask', 'idx', 'input_ids', 'label', 'sentence1', 'sentence2', 'token_type_ids'],
num_rows: 1725
})
})
Dynamic Padding
The function which puts together the samples inside a batch is called collate function. The default collate function is to just concatenate the vectors into PyTorch tensors, which is only possible if they have the same length.
With dynamic padding, we pad only as much as needed to have the same length in one batch, not the whole dataset, which improves training speed (if you are training on TPU, this might however have a bad effect on training speed!)
The DataCollatorWithPadding
function handles exactly this dynamic padding automatically:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Fine-tuning a model with the Trainer API
Transformers provide the Trainer class which helps to fine-tune any of the pretrained models provided. This will run very slow on a CPU.
The code from the previous section looks like this:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
Training
First we need to define a TrainingArguments
class which contains all the hyperparameters the Trainer
will use for training and evaluation. The only argument we have to provide is a directory where the trained model and checkpoints should be saved:
from transformers import TrainingArguments
training_args = TrainingArguments("test-trainer")
Now we need to define a model, we will use the AutoModelForSequenceClassification
class with two labels:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
We now get a warning because this BERT model has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded.
Now that we have our model,w e can define a Trainer
by passing it all the objects we've initialized:
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
To now start the training we use:
trainer.train()
This will start finetuning which will take a couple of minutes on a GPU or on Colab and report the training loss every 500 steps. It doesn't tell you how good or bad your model is performing because:
- We didn't tell the Trainer to evaluate during training (by setting the
evaluation\_strategy
tosteps
orepoch
) - We didn't provide the Trainer with a
compute\_metrics()
function to calculate the metrics
Evaluation
Let's see how we can build a compute\_metrics()
function. It takes an EvalPrediction
object and will return a dictionary mapping strings to floats. We can then use the Trainer.predict()
command:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
>>> (408, 2) (408,)
The output of the predict()
function is another named tuple with the fields: predictions
, label\_ids
and metrics
.
To build our compute_metrics() function, we rely on metrics from the Evaluate library. We can load the metrics associated with a model directly:
import evaluate
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)
>>> {'accuracy': 0.8578431372549019, 'f1': 0.8996539792387542}
This shows that our model has an accuracy of 85.78% and an F1 Score of 89.97%
F_1 = \frac{2}{\text{recall}^{-1}+\text{precision}^{-1}}=2\frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} = \frac{2\text{TP}}{2\text{TP}+\text{FP}+\text{FN}}
F1 represents precision (number of true positives divided by all positive predictions) and recall (number of true positive results divided by all actually positive samples) in one single symmetric metric
Wrapping everything together into the compute\_metrics()
function:
def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
We can define a new trainer which can now output metrics at the end of each epoch:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)