首页 > 科技周边 > 人工智能

DeepSeek用的GRPO占用大量内存？有人给出了些破解方法

时间：2025-02-07 20:04:23 361浏览收藏

大家好，今天本人给大家带来文章《DeepSeek用的GRPO占用大量内存？有人给出了些破解方法》，文中内容主要涉及到，如果你对科技周边方面的知识点感兴趣，那就请各位朋友继续看下去吧~希望能真正帮到你们，谢谢！

RTX 3080 移动版训练大型语言模型的实用指南

本文旨在指导 GPU 资源受限的开发者如何利用 GRPO (Group Relative Policy Optimization) 训练大型语言模型。DeepSeek-R1 的发布使得 GRPO 成为强化学习训练大型语言模型的热门方法，因为它高效且易于训练。 GRPO 通过利用模型自身生成的训练数据进行迭代改进，目标是最大化生成文本的优势函数，同时保持模型与参考策略的接近性。

选择合适的模型大小和训练方法（全参数微调或参数高效微调 - PEFT）是训练的关键。本文作者 Greg Schoeninger (Oxen.ai CEO) 使用配备 16GB 显存的 RTX 3080 笔记本电脑进行实验，并分享了其经验。

^{原文链接：https://www.oxen.ai/blog/grpo-vram-requirements-for-the-gpu-poor{{TABLE_PLACEHOLDER_1}}}

作者在使用 trl 库的 GRPO 实现时，遇到了显存不足 (OOM) 错误：

torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 1.90 GiB. GPU 0 has a total capacity of 15.73 GiB of which 1.28 GiB is free. 
Including non-PyTorch memory, this process has 14.43 GiB memory in use. Of the allocated memory 11.82 GiB is allocated by PyTorch, and 2.41 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

实验结果与内存需求分析

作者进行了一系列实验，测试不同模型大小（5亿到140亿参数）在 GSM8K 数据集上训练前 100 步的峰值内存使用情况，并比较了全参数微调和 PEFT 的内存需求。所有实验均在 Nvidia H100 上完成。

使用的模型包括：

GRPO 对内存需求高的原因在于其内部涉及多个模型（策略模型、参考模型、奖励模型）以及每个查询产生的多个输出。

优化内存使用

8位优化器和梯度检查点技术可以有效减少内存占用。8位优化器更高效地存储优化器状态，而梯度检查点则通过在训练过程中拍摄快照来减少内存使用，虽然会降低训练速度。

代码示例

trl 库简化了 GRPO 的使用。以下代码示例展示了如何使用 trl 训练小型模型：

import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
import re
SYSTEM_PROMPT = """
Respond in the following format:
...
...
"""
def extract_hash_answer(text: str) -> str | None:
if "####" not in text:
return None
return text.split("####")[1].strip()
def get_gsm8k_questions(split = "train") -> Dataset:
data = load_dataset('openai/gsm8k', 'main')[split]
data = data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_hash_answer(x['answer'])
})
return data
def extract_xml_answer(text: str) -> str:
answer = text.split("")[-1]
answer = answer.split("")[0]
return answer.strip()
def format_reward_func(completions, **kwargs) -> list[float]:
"""Reward function that checks if the completion has a specific format."""
pattern = r"^\n.*?\n\n\n.*?\n\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def accuracy_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
"""Reward function that extracts the answer from the xml tags and compares it to the correct answer."""
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def main():
dataset = get_gsm8k_questions()
model_name = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map=None
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
training_args = GRPOConfig(
output_dir="output",
learning_rate=5e-6,
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
logging_steps=1,
bf16=True,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_generations=4,
max_prompt_length=256,
max_completion_length=786,
num_train_epochs=1,
save_steps=100,
save_total_limit=1,
max_grad_norm=0.1,
log_on_each_node=False,
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
format_reward_func,
accuracy_reward_func
],
args=training_args,
train_dataset=dataset,
)
trainer.train()
if __name__ == "__main__":
main()

trl 项目地址：https://github.com/huggingface/trl?ref=ghost.oxen.ai

超参数调整与VRAM估算

num_generations 超参数会显著影响 VRAM 消耗。建议在内存瓶颈解决前使用 num_generations=4。

GitHub 问题讨论：https://github.com/huggingface/trl/issues/2709?ref=ghost.oxen.ai

其他影响 VRAM 的因素包括 batch_size、gradient_accumulation_steps、max_prompt_length、max_completion_length 和 LoRA 的 target_modules。

最后，作者分享了 10 亿参数 Llama 3.2 模型的训练结果，展示了 GRPO 在提高模型准确率方面的潜力。

通过合理的参数设置和优化技术，即使使用资源有限的 RTX 3080 移动版 GPU，也能有效训练大型语言模型。

到这里，我们也就讲完了《DeepSeek用的GRPO占用大量内存？有人给出了些破解方法》的内容了。个人认为，基础知识的学习和巩固，是为了更好的将其运用到项目中，欢迎关注golang学习网公众号，带你了解更多关于产业,GRPO的知识点！

产业 GRPO