Advanced Machine Learning with Python (Session 1)

Fernando Cervantes

Workshop outcomes

  • Understand the process of training ML models.
  • Load pre-trained ML models and fine-tune them with new data.
  • Evaluate the performance of ML models.
  • Adapt ML models for different tasks from pre-trained models.


0. Setup environment

Select runtime and connect

On the top right corner of the page, click the drop-down arrow to the right of the Connect button and select Change runtime type.

Make sure Python 3 runtime is selected. For this part of the workshop CPU acceleration is enough.

Now we can connect to the runtime by clicking Connect. This will create a Virtual Machine (VM) with compute resources we can use for a limited amount of time.


In free Colab accounts these resources are not guaranteed and can be taken away without notice (preemptible machines).

Data stored in this runtime will be lost if not moved into other storage when the runtime is deleted.

1. What is Machine Learning (ML)?

Machine Learning (ML)

Sub-field of Artificial Intelligence that develops methods to address tasks that require human intelligence

Artificial intelligence tasks

Common tasks


what is this?


where is something?


where specifically is something?

More tasks addressed in recent years

  • Style transference

  • Compression of image/video/etc…

  • Generation of content

  • Language processing

Types of machine learning

Depending on how the model is trained

  • Supervised

  • Unsupervised

  • Weakly supervised

  • Reinforced

Inputs and outputs

For a task, we want to model the outcome/output (\(y\)) obtained by a given input (\(x\))

\(f(x) \approx y\)


The complete set of (\(x\), \(y\)) pairs is known as dataset (\(X\), \(Y\)).


Inputs can be virtually anything, including images, texts, video, audio, electrical signals, etc.

While outputs are expected to be some meaningful piece of information, such as a category, position, value, etc.

Use case: Image classification with the CIFAR-100 dataset

import torch
import torchvision

cifar_ds = torchvision.datasets.CIFAR100(root="/tmp", train=True, download=True)
Files already downloaded and verified
x_im, y = cifar_ds[0]

len(cifar_ds), type(x_im), type(y)
(50000, PIL.Image.Image, int)
y = 19 (cattle)

Introduction to PyTorch

What is a tensor (PyTorch)?

A tensor is a multi-dimensional array. In PyTorch, this comes from a generalization of the notation of variables that exists on more than two dimensions.

  • zero-dimensional variables are points,
  • one-dimensional variables are vectors,
  • two-dimensional variables are matrices,
  • and three or more dimensional variables, are tensors.
import torch

x0 = torch.Tensor([7]) # This is a point

x1 = torch.Tensor([15, 64, 123]) # This is a vector

x2 = torch.Tensor([[3, 6, 5],
                   [7, 9, 12],
                   [10, 33, 1]]) # This is a matrix

x3 = torch.Tensor([[[[1, 0, 0],
                     [0, 1, 0],
                     [0, 0, 1]],
                    [[2, 0, 1],
                     [0, 2, 3],
                     [4, 1, 5]]]]) # This is a tensor


We can use the utilities in torchvision to convert an image from PIL to tensor

from torchvision.transforms.v2 import PILToTensor

pre_process = PILToTensor()

x = pre_process(x_im)

x = x.float()

type(x), x.shape, x.dtype, x.min(), x.max()
 torch.Size([3, 32, 32]),


For convenience, PyTorch’s tensors have their channels axis before the spatial axes.

from torchvision.transforms.v2 import Compose, PILToTensor, ToDtype

pre_process = Compose([
  ToDtype(torch.float32, scale=True)

x = pre_process(x_im)

type(x), x.shape, x.dtype, x.min(), x.max()
 torch.Size([3, 32, 32]),


For convenience, PyTorch’s tensors have their channels axis before the spatial axes.

Exercise: Add the preprocessing pipeline to the CIFAR-100 dataset

cifar_ds = torchvision.datasets.CIFAR100(root="/tmp", train=True, download=True, transform=pre_process)
Files already downloaded and verified

Training, Validation, and Test data

Training set

The examples (\(x\), \(y\)) used to teach a machine/model to perform a task

Validation set

Used to measure the performance of a model during training

This subset is not used for training the model, so it is unseen data.

Test set

This set of samples is not used when training

Its purpose is to measure the generalization capacity of the model

Exercise: Load the test set and split the train set into train and validation subsets

cifar_test_ds = torchvision.datasets.CIFAR100(root="/tmp", train=False, download=True, transform=pre_process)
Files already downloaded and verified
from import random_split

cifar_train_ds, cifar_val_ds = random_split(cifar_ds, (40_000, 10_000))

Deep Learning (DL) models

Models that construct knowledge in a hierarchical manner are considered deep models.

Exercise: Create a Logisic Regression model with PyTorch

import torch.nn as nn

lr_clf_1 = nn.Linear(in_features=3 * 32 * 32, out_features=100, bias=True)
lr_clf_2 = nn.Softmax()


We have to reshape x before feeding it to the model because x is an image with axes: Channels, Height, Width (CHW), but the Logistic Regression input should be a vector.

y_hat = lr_clf_2( lr_clf_1( x.reshape(1, -1) ))

type(y_hat), y_hat.shape, y_hat.dtype
(torch.Tensor, torch.Size([1, 100]), torch.float32)

Exercise: Create a MultiLayer Perceptron (MLP) model with PyTorch

mlp_clf = nn.Sequential(
  nn.Linear(in_features=3 * 32 * 32, out_features=1024, bias=True),
  nn.Linear(in_features=1024, out_features=100, bias=True),
y_hat = mlp_clf(x.reshape(1, -1))

type(y_hat), y_hat.shape, y_hat.dtype
(torch.Tensor, torch.Size([1, 100]), torch.float32)

Model optimization

Model fitting/training

Models behavior depends directly on the value of their set of parameters \(\theta\).

  • \(f(x) \approx y\)
  • \(f_\theta(x) = y + \epsilon = \hat{y}\)


As models increase their number of parameters, they become more complex

Training is the process of optimizing the values of \(\theta\)

Loss function

This is measure of the difference between the expected outputs and the predictions made by a model \(L(Y, \hat{Y})\).


We look for smooth loss functions for which we can compute their gradient

11.1 Loss function for regression

In the case of regression tasks we generally use the Mean Squared Error (MSE).

\(MSE=\frac{1}{N}\sum \left(Y - \hat{Y}\right)^2\)

Loss function for classification

And for classification tasks we use the Cross Entropy (CE) function.

\(CE = -\frac{1}{N}\sum\limits_i^N\sum\limits_k^C y_{i,k} log(\hat{y_{i,k}})\)

where \(C\) is the number of classes.


For the binary classification case:

\(BCE = -\frac{1}{N}\sum\limits_i^N \left(y_i log(\hat{y_i}) + (1 - y_i) log(1 - \hat{y_i})\right)\)

Exercise: Define the loss function for the CIFAR-100 classification problem

loss_fun = nn.CrossEntropyLoss()


According to the PyTorch documentation, the CrossEntropyLoss function takes as inputs the logits of the probabilities and not the probabilities themselves. So, we don’t need to squash the output of the MLP model.

mlp_clf = nn.Sequential(
  nn.Linear(in_features=3 * 32 * 32, out_features=1024, bias=True),
  nn.Linear(in_features=1024, out_features=100, bias=True),
  # nn.Softmax() # <- remove this line

We are using a PyTorch loss function, and it expects PyTorch’s tensors as arguments, so we have to convert y to tensor before computing the loss function.

loss = loss_fun(y_hat, torch.LongTensor([y]))

tensor(4.6085, grad_fn=<NllLossBackward0>)

Gradient based optimization

Gradient based optimization

Gradient-based methods are able to fit large numbers of parameters when using a smooth Loss function as target.


We compute the gradient of the loss function with respect to the model parameters using the chain rule from calculous. Generally, this is managed by the machine learning packages such as PyTorch and Tensorflow with a method called back propagation.

Gradient Descent

  • \(\theta^{t+1} = \theta^t - \eta \nabla_\theta L(Y, \hat{Y})\)

To back propagate the gradients we use the loss.backward() method of the loss function.

loss = loss_fun(y_hat, torch.LongTensor([y]))


Stochastic methods


The Gradient descent method require to obtain the Loss function for the whole training set before doing a single update.

This can be inefficient when large volumes of data are used for training the model.

  • These methods use a relative small sample from the training data called mini-batch at a time.

  • This reduces the amount of memory used for computing intermediate operations carried out during optimization process.

Stochastic Gradient Descent (SGD)

  • \(\theta^{t+1} = \theta^t - \eta \nabla_\theta L(Y_{b}, \hat{Y_{b}})\)

  • \(\eta\) controls the update we perform on the current parameter’s values


This parameter in Deep Learning is known as the learning rate

Training with mini-batches


PyTorch can operate efficiently on multiple inputs at the same time. To do that, we can use a DataLoader to serve mini-batches of inputs.

Exercise: Train the MLP classifier

from import DataLoader

cifar_train_dl = DataLoader(cifar_train_ds, batch_size=128, shuffle=True)
cifar_val_dl = DataLoader(cifar_val_ds, batch_size=256)
cifar_test_dl = DataLoader(cifar_test_ds, batch_size=256)
import torch.optim as optim

optimizer = optim.SGD(mlp_clf.parameters(), lr=0.01, )

Gradients are accumulated on every iteration, so we need to reset the accumulator with optimizer.zero_grad() for every new batch.


To perform get the new iteration’s parameter values \(\theta^{t+1}\) we use optimizer.step() to compute the update step.

for x, y in cifar_train_dl:

  y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) ) # Reshape it into a batch of vectors

  loss = loss_fun(y_hat, y)



Exercise: Train the MLP classifier and track the training and validation loss


To extract the loss function’s value without anything else attached use loss.item().

train_loss = []
train_loss_avg = 0
total_train_samples = 0

for x, y in cifar_train_dl:

  y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) ) # Reshape it into a batch of vectors

  loss = loss_fun(y_hat, y)

  train_loss_avg += loss.item() * len(x)
  total_train_samples += len(x)



train_loss_avg /= total_train_samples

Because we don’t train the model with the validation set, back-propagation and optimization steps are not needed.

Additionally, we wrap the loop with torch.no_grad() to prevent the generation of gradients that could fill the memory unnecessarily.

val_loss_avg = 0
total_val_samples = 0

with torch.no_grad():
  for x, y in cifar_val_dl:
    y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) ) # Reshape it into a batch of vectors
    loss = loss_fun(y_hat, y)

    val_loss_avg += loss.item() * len(x)
    total_val_samples += len(x)

val_loss_avg /= total_val_samples

Exercise: Train the MLP classifier and track the training and validation loss

import matplotlib.pyplot as plt

plt.plot(train_loss, "b-", label="Training loss")
plt.plot([0, len(train_loss)], [train_loss_avg, train_loss_avg], "r:", label="Average training loss")
plt.plot([0, len(train_loss)], [val_loss_avg, val_loss_avg], "b:", label="Average validation loss")

Exercise: Train the MLP classifier and track the training and validation loss through several epochs

num_epochs = 10
train_loss = []
val_loss = []

for e in range(num_epochs):
  train_loss_avg = 0
  total_train_samples = 0

  for x, y in cifar_train_dl:

    y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) ) # Reshape it into a batch of vectors

    loss = loss_fun(y_hat, y)

    train_loss_avg += loss.item() * len(x)
    total_train_samples += len(x)



  train_loss_avg /= total_train_samples

  val_loss_avg = 0
  total_val_samples = 0

  with torch.no_grad():
    for x, y in cifar_val_dl:
      y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) ) # Reshape it into a batch of vectors
      loss = loss_fun(y_hat, y)

      val_loss_avg += loss.item() * len(x)
      total_val_samples += len(x)

  val_loss_avg /= total_val_samples

Exercise: Show the progress of the training throughout the epochs

import matplotlib.pyplot as plt

plt.plot(train_loss, "b-", label="Average training loss")
plt.plot(val_loss, "r-", label="Average validation loss")

Performance metrics

Performance metrics

Used to measure how good or bad a model carries out a task

  • \(f(x) \approx y\)

  • \(f(x) = y + \epsilon = \hat{y}\)


The output \(\hat{y}\) is called prediction given the context taken from statistical regression analysis.


Selecting the correct performance metrics depends on the training type, task, and even the distribution of the data.

Exercise: Measure the accuracy of the MLP trained to classify images from CIFAR-100

!pip install torchmetrics
from torchmetrics.classification import Accuracy


train_acc_metric = Accuracy(task="multiclass", num_classes=100)

with torch.no_grad():
  for x, y in cifar_train_dl:
    y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) )
    train_acc_metric(y_hat.softmax(dim=1), y)

  train_acc = train_acc_metric.compute()

print(f"Training acc={train_acc}")
Training acc=0.12927499413490295

Exercise: Measure the accuracy of the MLP trained to classify images from CIFAR-100

val_acc_metric = Accuracy(task="multiclass", num_classes=100)
test_acc_metric = Accuracy(task="multiclass", num_classes=100)

with torch.no_grad():
  for x, y in cifar_val_dl:
    y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) )
    val_acc_metric(y_hat.softmax(dim=1), y)

  val_acc = val_acc_metric.compute()

  for x, y in cifar_test_dl:
    y_hat = mlp_clf( x.reshape(-1, 3 * 32 * 32) )
    test_acc_metric(y_hat.softmax(dim=1), y)

  test_acc = test_acc_metric.compute()

print(f"Validation acc={val_acc}")
print(f"Test acc={test_acc}")

Validation acc=0.125
Test acc=0.12290000170469284

Convolutional Neural Network (CNN or ConvNet)

Convolution layers

The most common operation in DL models for image processing are Convolution operations.

2D Convolution

The animation shows the convolution of a 7x7 pixels input image (bottom) with a 3x3 pixels kernel (moving window), that results in a 5x5 pixels output (top).

Exercise: Visualize the effect of the convolution operation

conv_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=7, padding=0, bias=True)

x, _ = next(iter(cifar_train_dl))

fx = conv_1(x)

type(fx), fx.dtype, fx.shape, fx.min(), fx.max()
 torch.Size([128, 1, 26, 26]),
 tensor(-0.1479, grad_fn=<MinBackward1>),
 tensor(1.0583, grad_fn=<MaxBackward1>))


The convolution layer is initialized with random values, so the results will vary.

Exercise: Visualize the effect of the convolution operation

plt.rcParams['figure.figsize'] = [5, 5]

fig, ax = plt.subplots(1, 2)
ax[0].imshow(x[0].permute(1, 2, 0))
ax[1].imshow(fx.detach()[0, 0], cmap="gray")


By default, outputs from PyTorch modules are tracked for back-propagation.

To visualize it with matplotlib we have to .detach() the tensor first.

Exercise: Visualize the effect of the convolution operation

torch.Size([1, 3, 7, 7])
fig, ax = plt.subplots(2, 2)
ax[0, 0].imshow(conv_1.weight.detach()[0, 0], cmap="gray")
ax[0, 1].imshow(conv_1.weight.detach()[0, 1], cmap="gray")
ax[1, 0].imshow(conv_1.weight.detach()[0, 2], cmap="gray")
ax[1, 1].set_axis_off()

Exercise: Visualize the effect of the convolution operation

conv_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=3, padding=0, bias=False)[:] = torch.FloatTensor([
      [0, 0, 0],
      [0, 0, 0],
      [0, 0, 0],
      [0, 0, 0],
      [0, 1, 0],
      [0, 0, 0],
      [0, 0, 0],
      [0, 0, 0],
      [0, 0, 0],

Exercise: Visualize the effect of the convolution operation

fx = conv_1(x)

fig, ax = plt.subplots(1, 2)
ax[0].imshow(x[0].permute(1, 2, 0))
ax[1].imshow(fx.detach()[0].permute(1, 2, 0))

Experiment with different values and shapes of the kernel

Exercise: Visualize the effect of the convolution operation

conv_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=3, padding=0, bias=False)[:] = torch.FloatTensor([
  [[[0, -1, 0], [-1, 5, -1], [0, -1, 0]],
   [[0, 0, 0], [0, 0, 0], [0, 0, 0]],
   [[0, 0, 0], [0, 0, 0], [0, 0, 0]]]

fx = conv_1(x)

fig, ax = plt.subplots(1, 2)
ax[0].imshow(x[0].permute(1, 2, 0))
ax[1].imshow(fx.detach()[0, 0], cmap="gray")

Experiment with different values and shapes of the kernel

Exercise: Visualize the effect of the convolution operation

conv_1 = nn.Conv2d(in_channels=3, out_channels=1, kernel_size=3, padding=0, bias=False)[:] = torch.FloatTensor([
  [[[1, 0, -1], [1, 0, -1], [1, 0, -1]],
   [[1, 0, -1], [1, 0, -1], [1, 0, -1]],
   [[1, 0, -1], [1, 0, -1], [1, 0, -1]]]

fx = conv_1(x)

fig, ax = plt.subplots(1, 2)
ax[0].imshow(x[0].permute(1, 2, 0))
ax[1].imshow(fx.detach()[0, 0], cmap="gray")

Experiment with different values and shapes of the kernel

Exercise: Implement and train the LetNet-5 model with PyTorch

lenet_clf = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5, bias=True),
    nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, bias=True),
    nn.Linear(in_features=16*5*5, out_features=120, bias=True),
    nn.Linear(in_features=120, out_features=84, bias=True),
    nn.Linear(in_features=84, out_features=100, bias=True),


Pooling layers are used to downsample feature maps to summarize information from large regions.

Exercise: Implement and train the LetNet-5 model with PyTorch

y_hat = lenet_clf(x)

type(y_hat), y_hat.dtype, y_hat.shape, y_hat.min(), y_hat.max()
 torch.Size([128, 100]),
 tensor(-0.1779, grad_fn=<MinBackward1>),
 tensor(0.1641, grad_fn=<MaxBackward1>))

Exercise: Implement and train the LetNet-5 model with PyTorch

num_epochs = 10
train_loss = []
val_loss = []

if torch.cuda.is_available():

optimizer = optim.SGD(lenet_clf.parameters(), lr=0.01)

for e in range(num_epochs):
  train_loss_avg = 0
  total_train_samples = 0

  for x, y in cifar_train_dl:

    if torch.cuda.is_available():
      x = x.cuda()
    y_hat = lenet_clf( x ).cpu()

    loss = loss_fun(y_hat, y)

    train_loss_avg += loss.item() * len(x)
    total_train_samples += len(x)



  train_loss_avg /= total_train_samples

  val_loss_avg = 0
  total_val_samples = 0

  with torch.no_grad():
    for x, y in cifar_val_dl:
      if torch.cuda.is_available():
        x = x.cuda()
      y_hat = lenet_clf( x ).cpu()
      loss = loss_fun(y_hat, y)

      val_loss_avg += loss.item() * len(x)
      total_val_samples += len(x)

  val_loss_avg /= total_val_samples

Exercise: Implement and train the LetNet-5 model with PyTorch

plt.plot(train_loss, "b-", label="Average training loss")
plt.plot(val_loss, "r-", label="Average validation loss")

Exercise: Implement and train the LetNet-5 model with PyTorch


val_acc_metric = Accuracy(task="multiclass", num_classes=100)
test_acc_metric = Accuracy(task="multiclass", num_classes=100)
train_acc_metric = Accuracy(task="multiclass", num_classes=100)

with torch.no_grad():
  for x, y in cifar_train_dl:
    if torch.cuda.is_available():
      x = x.cuda()
    y_hat = lenet_clf( x ).cpu()
    train_acc_metric(y_hat.softmax(dim=1), y)

  train_acc = train_acc_metric.compute()

  for x, y in cifar_val_dl:
    if torch.cuda.is_available():
      x = x.cuda()
    y_hat = lenet_clf( x ).cpu()
    val_acc_metric(y_hat.softmax(dim=1), y)

  val_acc = val_acc_metric.compute()

  for x, y in cifar_test_dl:
    if torch.cuda.is_available():
      x = x.cuda()
    y_hat = lenet_clf( x ).cpu()
    test_acc_metric(y_hat.softmax(dim=1), y)

  test_acc = test_acc_metric.compute()

print(f"Training acc={train_acc}")
print(f"Validation acc={val_acc}")
print(f"Test acc={test_acc}")

Training acc=0.02437499910593033
Validation acc=0.020899999886751175
Test acc=0.02250000089406967