PyTorch Basics
I never thought I’d start training neural networks one day.
Main reference is this
PyTorch Basics
Tensors
The basic unit in numpy is ndarray, while in PyTorch it’s tensor. Tensors can be computed on GPU.
Basic operations are the same, like ones_like, zeros, size, and indexing methods are also the same.
There are five basic data types:
- 32-bit float: torch.FloatTensor (default)
- 64-bit integer: torch.LongTensor
- 32-bit integer: torch.IntTensor
- 16-bit integer: torch.ShortTensor
- 64-bit float: torch.DoubleTensor
Besides these numeric types, there are also byte and char types.
Special Tensor Operations
Addition method 1:
y = torch.rand(5, 3)
print(x + y)
tensor([[ 0.7808, -1.4388, 0.3151],
[-0.0076, 1.0716, -0.8465],
[-0.8175, 0.3625, -0.2005],
[ 0.2435, 0.8512, 0.7142],
[ 1.4737, -0.8545, 2.4833]])Addition method 2:
print(torch.add(x, y))
tensor([[ 0.7808, -1.4388, 0.3151],
[-0.0076, 1.0716, -0.8465],
[-0.8175, 0.3625, -0.2005],
[ 0.2435, 0.8512, 0.7142],
[ 1.4737, -0.8545, 2.4833]])Providing an output tensor as argument:
result = torch.empty(5, 3)
torch.add(x, y, out=result)
print(result)
tensor([[ 0.7808, -1.4388, 0.3151],
[-0.0076, 1.0716, -0.8465],
[-0.8175, 0.3625, -0.2005],
[ 0.2435, 0.8512, 0.7142],
[ 1.4737, -0.8545, 2.4833]])In-place replacement:
# adds x to y
y.add_(x)
print(y)
tensor([[ 0.7808, -1.4388, 0.3151],
[-0.0076, 1.0716, -0.8465],
[-0.8175, 0.3625, -0.2005],
[ 0.2435, 0.8512, 0.7142],
[ 1.4737, -0.8545, 2.4833]])Note
Any operation ending with _ will replace the original variable with the result. For example: x.copy_(y), x.t_(), will all modify x.
Changing dimensions:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8) # size -1 is inferred from other dimensions
print(x.size(), y.size(), z.size())
torch.Size([4, 4]) torch.Size([16]) torch.Size([2, 8])If you have a tensor with only one element, use .item() to get the Python numeric value:
x = torch.randn(1)
print(x)
print(x.item())
tensor([-0.2368])
-0.23680149018764496Conversion with NumPy
Torch Tensor and NumPy arrays share the underlying memory address, modifying one will cause the other to change.
Converting a Torch Tensor to NumPy array:
a = torch.ones(5)
print(a)
tensor([1., 1., 1., 1., 1.])b = a.numpy()
print(b)
[1. 1. 1. 1. 1.]Observe how the numpy array’s value changes:
a.add_(1)
print(a)
print(b)
tensor([2., 2., 2., 2., 2.])
[2. 2. 2. 2. 2.]Converting NumPy Array to Torch Tensor:
Use from_numpy for automatic conversion:
import numpy as np
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)
[2. 2. 2. 2. 2.]
tensor([2., 2., 2., 2., 2.], dtype=torch.float64)All Tensor types are CPU-based by default. CharTensor does not support conversion to NumPy.
Use the .to method to move Tensors to any device:
# is_available function checks if cuda is available
# ``torch.device`` moves tensors to the specified device
if torch.cuda.is_available():
device = torch.device("cuda") # a CUDA device object
y = torch.ones_like(x, device=device) # create tensor directly on GPU
x = x.to(device) # or use ``.to("cuda")`` to move tensor to cuda
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change the variable type
tensor([0.7632], device='cuda:0')
tensor([0.7632], dtype=torch.float64)Autograd
I didn’t quite understand what this is used for.
Neural Network Package & Optimizers
torch.nn is a modular interface specifically designed for neural networks. nn is built on top of Autograd and can be used to define and run neural networks. Here we mainly introduce some commonly used classes.
Convention: We set torch.nn alias as nn for convenience. Besides nn, there are other naming conventions in this chapter.
# First import the relevant packages
import torch
# Import torch.nn and set alias
import torch.nn as nn
# Print the version
torch.__version__'1.0.0'Besides the nn alias, we also import nn.functional, which contains some commonly used functions in neural networks. These functions are characterized by having no learnable parameters (such as ReLU, pool, DropOut, etc.). These functions can be placed in the constructor or not, but it’s recommended not to.
Generally, we set nn.functional as uppercase F for convenient calling:
import torch.nn.functional as FDefining a Network
PyTorch has ready-made network models for us. Just inherit nn.Module and implement its forward method. PyTorch will automatically implement the backward function based on autograd. In the forward function, you can use any function supported by tensor, and you can also use if, for loops, print, log, and other Python syntax, written the same as standard Python.
class Net(nn.Module):
def __init__(self):
# nn.Module subclass must execute parent class constructor in constructor
super(Net, self).__init__()
# Convolutional layer '1' means input image is single channel, '6' means output channels, '3' means 3*3 kernel
self.conv1 = nn.Conv2d(1, 6, 3)
# Linear layer, input 1350 features, output 10 features
self.fc1 = nn.Linear(1350, 10) # How is 1350 calculated? See the forward function below
# Forward propagation
def forward(self, x):
print(x.size()) # Result: [1, 1, 32, 32]
# Convolution -> Activation -> Pooling
x = self.conv1(x) # According to convolution size formula, result is 30
x = F.relu(x)
print(x.size()) # Result: [1, 6, 30, 30]
x = F.max_pool2d(x, (2, 2)) # Using pooling layer, result is 15
x = F.relu(x)
print(x.size()) # Result: [1, 6, 15, 15]
# reshape, '-1' means adaptive
# This flattens [1, 6, 15, 15] to [1, 1350]
x = x.view(x.size()[0], -1)
print(x.size()) # This is the input 1350 for fc1 layer
x = self.fc1(x)
return x
net = Net()
print(net)Net(
(conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
(fc1): Linear(in_features=1350, out_features=10, bias=True)
)The network’s learnable parameters are returned via net.parameters():
for parameters in net.parameters():
print(parameters)Parameter containing:
tensor([[[[ 0.2745, 0.2594, 0.0171],
[ 0.0429, 0.3013, -0.0208],
[ 0.1459, -0.3223, 0.1797]]],
...
[[[ 0.1691, -0.0790, 0.2617],
[ 0.1956, 0.1477, 0.0877],
[ 0.0538, -0.3091, 0.2030]]]], requires_grad=True)
...net.named_parameters can return both learnable parameters and their names:
for name,parameters in net.named_parameters():
print(name,':',parameters.size())conv1.weight : torch.Size([6, 1, 3, 3])
conv1.bias : torch.Size([6])
fc1.weight : torch.Size([10, 1350])
fc1.bias : torch.Size([10])input = torch.randn(1, 1, 32, 32) # The input here corresponds to 32 in forward
out = net(input)
out.size()torch.Size([1, 10])Before backpropagation, all parameter gradients must be zeroed:
net.zero_grad()
out.backward(torch.ones(1,10)) # Backpropagation is automatically implemented by PyTorchNote: torch.nn only supports mini-batches, not single sample input at a time, meaning it must be a batch.
In other words, even if we input one sample, it will be batched. So all inputs will have an additional dimension. Comparing with the input above, nn defines it as 3D, but we manually added one dimension to make it 4D, where the first 1 is the batch-size.
I feel I need to review coursera instead of looking at code here.
Forward Propagation, Backward Propagation, Neural Networks
Found this

This is a neural network without hidden layers. Input on the left, output on the right. Forward is from input to output, backward is passing the Loss obtained from the back to the front output layer (like logistic).

This is a neural network with hidden layers. Like before without hidden layers, we only needed to adjust one thing’s parameters. Now we also need to adjust the little girl’s parameters, so we use chain rule for gradient descent.
These intermediate layers and output are the legendary activation functions. Without activation functions, each layer’s output is a linear function of the previous layer’s input, unable to fit nonlinear functions.
Input data multiplied by weight plus a bias, then apply activation function to get the neuron’s output, then pass this output to the next layer’s neurons. These weights and biases, plus L1, L2, batch size, etc. are hyperparameters.
Loss Functions
Loss Functions
In nn, PyTorch also provides commonly used loss functions. Below we use MSELoss to calculate mean squared error:
y = torch.arange(0,10).view(1,10).float()
criterion = nn.MSELoss()
loss = criterion(out, y)
# loss is a scalar, we can directly use item to get its Python numeric value
print(loss.item())28.92203712463379
Optimizers
Optimizers
After backpropagation calculates all parameter gradients, we still need to use optimization methods to update network weights and parameters. For example, the update strategy for Stochastic Gradient Descent (SGD) is:
weight = weight - learning_rate * gradient
Most optimization methods are implemented in torch.optim, such as RMSProp, Adam, SGD, etc. Below we use SGD for a simple example:
import torch.optimout = net(input) # Calling here will print the size of x from our forward function
criterion = nn.MSELoss()
loss = criterion(out, y)
# Create a new optimizer, SGD only needs parameters to adjust and learning rate
optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
# Zero gradients first (same effect as net.zero_grad())
optimizer.zero_grad()
loss.backward()
# Update parameters
optimizer.step()torch.Size([1, 1, 32, 32])
torch.Size([1, 6, 30, 30])
torch.Size([1, 6, 15, 15])
torch.Size([1, 1350])This completes a full data propagation through the neural network using PyTorch.
I didn’t look at the rest in detail.
Fine Tuning
This happens to match my current needs.
For a certain task, if you don’t have much training data, what to do? No worries, we first find a similar model that someone else has trained, take their ready-made trained model, swap in our own data, adjust some parameters, and train again. This is fine-tuning. The classic network models provided in PyTorch are all pre-trained by officials using the Imagenet dataset. If our training data is insufficient, these can be used as base models.
- If the new dataset is similar to the original dataset, you can directly fine-tune the last FC layer or specify a new classifier
- If the new dataset is small and quite different from the original dataset, you can start training from the middle of the model, only fine-tuning the last few layers
- If the new dataset is small and quite different from the original dataset, and the above method still doesn’t work, it’s best to retrain, only using the pretrained model as initialization data for a new model
- The new dataset size must be the same as the original dataset. For example, in CNN, the input image size must be the same to avoid errors
- If dataset sizes are different, you can add convolution or pool layers before the last fc layer to make the final output match the fc layer, but this will significantly reduce accuracy, so it’s not recommended
- Different layers can have different learning rates. Generally, it’s recommended that layers using original data for initialization should have a learning rate smaller than (usually 10 times smaller) the initialization learning rate. This ensures that already initialized data won’t be distorted too quickly, while new layers using initialization learning rate can converge quickly.
For more details, see here
There’s also some visualization content, will add later when needed.