深度学习 Pytorch 初识Pytorch(三) -- 完整的模型训练流程 + GPU调用训练 山河忽晚 2025-06-29 1 构建训练模型
以CIFAR10数据集为例
1.1 导入torch模块
1 2 3 import torchvisionfrom torch import nnfrom torch.utils.data import DataLoader
1.2 准备数据集
1 2 3 4 5 6 7 8 9 10 train_data = torchvision.datasets.CIFAR10( root='./dataset' , train=True , download=True , transform=torchvision.transforms.ToTensor() ) test_data = torchvision.datasets.CIFAR10( root='./dataset' , train=False , download=True , transform=torchvision.transforms.ToTensor() )
1.3 查看数据集的大小
1 2 3 4 5 train_data_size = len (train_data) test_data_size = len (test_data) print ('训练数据集的长度为:{}' .format (train_data_size))print ('测试数据集的长度为:{}' .format (test_data_size))
1.4 加载数据集
1 2 3 train_dataloader = DataLoader(train_data, batch_size=64 ) test_dataloader = DataLoader(test_data, batch_size=64 )
1.5 搭建模型
以 CIFAR 10 的结构为例,神经网络模型结构如下
image-20250604122341166
创建model.py文件存放自定义模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import torchfrom torch import nnclass TdModel (nn.Module): def __init__ (self ): super (TdModel, self).__init__() self.model = nn.Sequential( nn.Conv2d(3 , 32 , kernel_size=5 , stride=1 , padding=2 ), nn.MaxPool2d(2 ), nn.Conv2d(32 , 32 , kernel_size=5 , stride=1 , padding=2 ), nn.MaxPool2d(2 ), nn.Conv2d(32 , 64 , kernel_size=5 , stride=1 , padding=2 ), nn.MaxPool2d(2 ), nn.Flatten(), nn.Linear(64 * 4 * 4 , 64 ), nn.Linear(64 , 10 ) ) def forward (self, x ): return self.model(x)
1.6 训练参数设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 from model import *td = TdModel() lose_fn = nn.CrossEntropyLoss() learning_rate = 1e-2 optimizer = torch.optim.SGD(td.parameters(), lr=learning_rate) total_train_step = 0 total_test_step = 0 epoch = 10
1.7 模型训练与测试
使用tensorboard 记录训练过程,查看模型是否训练达到自己的需求:
每次训练完一轮后,在测试数据集上跑一遍,用测试数据集的损失或正确率 来评估模型有没有训练好。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 writer = SummaryWriter('logs' ) for i in range (epoch): print ('--------------------第{}轮训练开始--------------------' .format (i+1 )) for data in train_dataloader: imgs, targets = data outputs = td(imgs) loss = lose_fn(outputs, targets) optimizer.zero_grad() loss.backward() optimizer.step() total_train_step += 1 if total_train_step % 100 == 0 : print ('训练次数:{0},Loss:{1}' .format (total_train_step, loss.item())) writer.add_scalar('train_loss' , loss.item(), total_train_step) total_test_loss = 0 with torch.no_grad(): for data in test_dataloader: imgs, targets = data outputs = td(imgs) loss = lose_fn(outputs, targets) total_test_loss += loss.item() print ('整体测试集上的Loss:{}' .format (total_test_loss)) writer.add_scalar('test_loss' , total_test_loss, total_test_step, ) total_test_step += 1 torch.save(td, './td_model{}.pt' .format (i)) print ('模型已保存' )
1.8 正确率(分类问题)
即便得到整体数据集上的Loss,也不能很好地说明模型在测试集上的效果
在分类问题中可以用正确率 来表示模型是否优秀
对于目标检测、语义分割等操作,可以直接把得到的输出在tensorboard中显示
正确率测试说明
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import torchoutputs = torch.tensor( [[0.1 ,0.2 ,0.8 ], [0.5 ,0.05 ,0.4 ]] ) print (outputs.argmax(0 )) print (outputs.argmax(1 )) preds = outputs.argmax(1 ) targets = torch.tensor([2 ,1 ]) true_number = (preds == targets) print (true_number.sum ())
将1.7代码优化 ,计算测试集中整体的正确个数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 total_test_loss = 0 total_test_accuracy = 0 with torch.no_grad(): for data in test_dataloader: imgs, targets = data outputs = td(imgs) loss = lose_fn(outputs, targets) total_test_loss += loss.item() accuracy = (outputs.argmax(dim=1 ) == targets).sum () total_test_accuracy += accuracy.item() print ('整体测试集上的正确率:{}' .format (total_test_accuracy/test_data_size)) writer.add_scalar('test_accuracy' , total_test_accuracy/test_data_size, total_test_step) print ('整体测试集上的Loss:{}' .format (total_test_loss)) writer.add_scalar('test_loss' , total_test_loss, total_test_step, ) total_test_step += 1 torch.save(td, './td_model_{}.pth' .format (i)) print ('第{}次训练模型已保存' .format (i))
1.9 完整代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 import torchvisionfrom torch import nnfrom torch.utils.data import DataLoaderfrom torch.utils.tensorboard import SummaryWritertrain_data = torchvision.datasets.CIFAR10( root='./dataset' , train=True , download=True , transform=torchvision.transforms.ToTensor() ) test_data = torchvision.datasets.CIFAR10( root='./dataset' , train=False , download=True , transform=torchvision.transforms.ToTensor() ) train_data_size = len (train_data) test_data_size = len (test_data) print ('训练数据集的长度为:{}' .format (train_data_size))print ('测试数据集的长度为:{}' .format (test_data_size))train_dataloader = DataLoader(train_data, batch_size=64 ) test_dataloader = DataLoader(test_data, batch_size=64 ) from model import * td = TdModel() lose_fn = nn.CrossEntropyLoss() learning_rate = 1e-2 optimizer = torch.optim.SGD(td.parameters(), lr=learning_rate) total_train_step = 0 total_test_step = 0 epoch = 50 writer = SummaryWriter('./logs' ) for i in range (epoch): print ('--------------------第{}轮训练开始--------------------' .format (i+1 )) for data in train_dataloader: imgs, targets = data outputs = td(imgs) loss = lose_fn(outputs, targets) optimizer.zero_grad() loss.backward() optimizer.step() total_train_step += 1 if total_train_step % 100 == 0 : print ('训练次数:{0},Loss:{1}' .format (total_train_step, loss.item())) writer.add_scalar('train_loss' , loss.item(), total_train_step) total_test_loss = 0 total_test_accuracy = 0 with torch.no_grad(): for data in test_dataloader: imgs, targets = data outputs = td(imgs) loss = lose_fn(outputs, targets) total_test_loss += loss.item() accuracy = (outputs.argmax(dim=1 ) == targets).sum () total_test_accuracy += accuracy.item() print ('整体测试集上的正确率:{}' .format (total_test_accuracy/test_data_size)) writer.add_scalar('test_accuracy' , total_test_accuracy/test_data_size, total_test_step) print ('整体测试集上的Loss:{}' .format (total_test_loss)) writer.add_scalar('test_loss' , total_test_loss, total_test_step, ) total_test_step += 1 torch.save(td, './td_model_{}.pth' .format (i)) print ('第{}次训练模型已保存' .format (i))
2 使用 GPU 训练
2.1 调用方式 1
只有网络模型 、损失函数 、数据(输入、输出) 可以设置为GPU运算
使用方式:在对象后添加 .cuda()
网络模型
1 2 3 4 5 from model import * td = TdModel() if torch.cuda.is_available(): td = td.cuda()
损失函数
1 2 3 4 lose_fn = nn.CrossEntropyLoss() if torch.cuda.is_available(): lose_fn = lose_fn.cuda()
输入、输出数据
1 2 3 4 5 6 7 8 for data in train_dataloader: imgs, targets = data if torch.cuda.is_available(): imgs, targets = imgs.cuda(), targets.cuda() outputs = td(imgs) loss = lose_fn(outputs, targets)
2.2 调用方式 2
使用方式:在对象后添加 .to(device)
1 2 3 4 5 6 device = torch.device('cpu' ) device = torch.device('cuda' ) device = torch.device('cuda:0' ) device = torch.device('cuda:1' )
指定设备
1 2 device = torch.device('cuda' if torch.cuda.is_available() else 'cpu' )
调用设备
1 2 3 4 5 6 7 8 9 10 11 12 13 from model import * td = TdModel() td = td.to(device) lose_fn = nn.CrossEntropyLoss() lose_fn = lose_fn.to(device) for data in train_dataloader: imgs, targets = data imgs, targets = imgs.to(device), targets.to(device)
3 测试模型结果
利用已经训练好的模型,给模型提供输入
从网络上下载一个图片,使用PIL读取图片
因为png格式是四个通道,除了RGB三通道外,还有一个透明度通道
调用 image = image.convert(‘RGB’) 保留图片的颜色通道
如果图片本来就是三个颜色通道,经过此操作后,无变化
加上这一步后可以适应png、jpg等格式的图片
1 2 3 4 5 6 7 8 9 from PIL import Imageimport torchvisionimage_path = 'images/dog.png' image = Image.open (image_path) image = image.convert('RGB' ) print (image)
调整图像大小
1 2 3 4 5 6 transform = torchvision.transforms.Compose([ torchvision.transforms.Resize((32 ,32 )), torchvision.transforms.ToTensor(), ]) image = transform(image) print (image.shape)
加载网络模型
1 2 3 4 5 from model import *model = torch.load("td_model_0.pth" ,map_location=torch.device('cpu' ), weights_only=False ) print (model)
输入图片,推理结果
1 2 3 4 5 6 image = torch.reshape(image, (1 ,3 ,32 ,32 )) model.eval () with torch.no_grad(): output = model(image) print (output)print (output.argmax(dim=1 ))