微调大模型笔记（unsloth框架）

很多注释都为直接转载，笔者也不明白参数意义，若碰到不明白含义的参数建议使用默认值。

原文地址：

10G显存，使用Unsloth微调Qwen2并使用Ollama推理

使用Unsloth微调Llama3-Chinese-8B-Instruct中文开源大模型

聊聊ShareGPT格式的微调数据集

大模型的常用指令格式 --＞ ShareGPT 和 Alpaca (以 llama-factory 里的设置为例)

1 安装环境

-ubuntu22.04

-cuda12.1.0

-py310

-torch2.3.0-1.17.1

安装unsloth

pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"

2 下载模型

从此处开始所有代码皆为 python 语句，它的执行方式为：在有对应环境下输入 python 命令进入python命令行，输入命令回车直接运行。这点和其他编程语言固定编译运行的执行顺序不同。

from modelscope import snapshot_download

# 下载模型并设置目录，这里下载模型是从魔塔社区拉取的（如果从别的地方拉取可以查看对应平台的文档），会存入到 /root/models 中，如果不设置 cache_dir 参数则会存储在 .cache 目录中，调用时可以使用 model_dir 变量，也可以使用绝对路径
model_dir = snapshot_download("qwen/Qwen2-7B")

3 加载模型

from unsloth import FastLanguageModel
import torch

# 设置最大序列长度
max_seq_length = 2048

# 设置数据类型为 bfloat16
dtype = torch.bfloat16

# 是否加载 4bit 精度模型
load_in_4bit = True

# 加载模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_dir, # 模型路径，可以直接使用上文中的 model_dir变量，也可以用路径
    max_seq_length = max_seq_length, # 这决定了模型的上下文窗口长度，比如Qwen2-7B的上下文长度为32K，并可以通过yarn拓展到128K。本文从测试的角度，设置上下文长度为2048。
    dtype = dtype, # 根据A10的GPU选择torch.bfloat16（魔塔社区的机器为 A10 显卡，所以我用了这个）
    load_in_4bit = load_in_4bit, # 采用 4 位量化进行微调。这样可将内存使用量减少 4 倍。4 位量化本质上将权重转换为一组有限的数字以减少内存使用量。这样做的缺点是准确度会下降 1-2%。
)

4 设置LoRA训练参数

LoRA (Low-Rank Adaptation)是一种大语言模型的低阶适配器技术,可在模型微调过程中,只更新整个模型参数的1%到10%左右,而不是全部参数。通过这种方式实现有效的模型微调和优化,提高了模型在特定任务上的性能。

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # 选择任何大于0的数字！建议使用8、16、32、64、128。。微调过程的rank。数值越大，占用的内存越多，速度越慢，但可以提高复杂任务的准确性。通常建议数值为 8（用于快速微调），最高可达 128。数值过大可能会导致过度拟合，从而损害模型的质量。
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], #本文选择所有模块进行微调。您可以删除一些模块以减少内存使用量并加快训练速度，但强烈不建议这样做。
    lora_alpha = 16,   #微调的缩放因子。较大的数字将使微调更多地了解您的数据集，但可能会导致过度拟合。建议将其等于等级r，或将其加倍。
    lora_dropout = 0,  # 将其保留为 0 以加快训练速度！可以减少过度拟合，但效果不大。
    bias = "none",    # 将其保留为 none，以实现更快、更少的过度拟合训练！
    # [NEW] "unsloth" 使用的VRAM减少30%，适用于2倍更大的批处理大小！
    use_gradient_checkpointing = "unsloth", # True或"unsloth"适用于非常长的上下文。。选项包括True、False 和"unsloth"。本文建议这样做"unsloth"，因为unsloth将内存使用量减少了 30%，并支持极长的上下文微调。
    random_state = 3407,  # 确定确定性运行的次数。训练和微调需要随机数，因此设置此数字可使实验可重复
    use_rslora = False,  # 支持排名稳定的LoRA，，高级功能可自动设置lora_alpha = 16。
    loftq_config = None, # 高级功能可将 LoRA 矩阵初始化为权重的前 r 个奇异向量。可以在一定程度上提高准确度，但一开始会使内存使用量激增。
)

5 准备数据集

alpace 格式数据集（Alpaca 是基于 Meta 开源的 LLaMA 模型构建的一种微调数据集格式，特别用于 instruction-tuning，即指令微调。其数据格式的特点是提供了一个明确的任务描述（instruction）、输入（input）和输出（output）三部分）

[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "system": "系统提示词（选填）",
    "history": [
      ["第一轮指令（选填）", "第一轮回答（选填）"],
      ["第二轮指令（选填）", "第二轮回答（选填）"]
    ]
  }
]


示例：
[
    {
        "instruction": "基于输入的患者医案记录，直接给出你的证型诊断，无需给出原因。",
        "input": "患者1年前无明显诱因反复出现胸闷心慌，劳累后加重，持续数5-6分钟，休息后可缓解，未予重视，未予治疗。1年来患者胸闷心慌不适反复发作。1周前患者无明显诱因自觉胸闷心慌较前加重，乏力，伴头痛不适，休息后症状未见明显缓解，遂今日至我院门诊就诊，为进一步治疗收入我科。入院时：患者胸闷心慌，伴头痛不适，倦怠乏力，无头晕，无胸痛放射痛，无恶心呕吐，无恶寒发热，无黒朦晕厥，无意识障碍，纳食可，二便尚调，夜寐尚可。",
        "output": "气虚血瘀证"
    },
    {
        "instruction": "基于输入的患者医案记录，直接给出你的证型诊断，无需给出原因。",
        "input": "患者5年前无明显诱因出现腹泻，每日解7-8次黄色稀便，夹有少量粘液，无脓血，无明显腹痛，患者多次至我院及外院就诊，多次查肠镜示慢性结肠炎伴糜烂、结肠息肉，予调整肠道菌群、对症等治疗后症状时有好转，但停药易反复。后患者再次至我院就诊，查肠镜示直肠炎伴糜烂。予对症治疗后症状稍好转。现患者为求进一步系统诊治，遂来我院，由门诊收住入院。入院时：患者腹泻，每日解7-8次黄色稀便，夹有少量粘液，无脓血，无明显腹痛，无里急后重感，无肛门坠胀感，伴有嗳气反酸及上腹烧灼感，时有口酸，无口苦，食纳尚可，夜寐欠安，小便尚调，舌红，苔薄黄，脉弦细。",
        "output": "大肠湿热证"
    }
]

shareGPT 格式数据集（ShareGPT 格式来源于通过记录 ChatGPT 与用户对话的数据集，主要用于对话系统的训练。它更侧重于多轮对话数据的收集和组织，模拟用户与 AI 之间的交互）

与 alpaca 格式相比，sharegpt 格式支持更多的角色种类，如 human、gpt、observation、function 等。他们构成一个对象列表呈现在 conversations 列中。

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "I saw a dress that I liked. It was originally priced at $200 but it's on sale for 20% off. Can you tell me how much it will cost after the discount?"
      },
      {
        "from": "function_call",
        "value": "{\"name\": \"calculate_discount\", \"arguments\": {\"original_price\": 200, \"discount_percentage\": 20}}"
      },
      {
        "from": "observation",
        "value": "{\"discounted_price\": 160}"
      },
      {
        "from": "gpt",
        "value": "The dress will cost you $160 after the 20% discount."
      }
    ],
    "system": "系统提示词（选填）",
    "tools": "[{\"name\": \"calculate_discount\", \"description\": \"Calculate the discounted price\", \"parameters\": {\"type\": \"object\", \"properties\": {\"original_price\": {\"type\": \"number\", \"description\": \"The original price of the item\"}, \"discount_percentage\": {\"type\": \"number\", \"description\": \"The percentage of discount\"}}, \"required\": [\"original_price\", \"discount_percentage\"]}}]"
  }
]
function_call表示函数调用，什么是函数调用？其作用是什么？
由于大模型的数据一般都是截止于某个时间点之前的数据，不具备实时性。比如，我要问今天的天气，正常来说，由于模型参数的局限性，是不会知道的。但基于函数调用的功能，就解决了这个问题。
所谓的function_call，在某个程度来说，可以理解为API调用，这个API就是一个function，提供了某种功能。
observation表示观测结果，即function_call的执行结果。
tools表示工具，即对function_call的总结描述。
observation并不是新词汇，对于HMM模型如果有了解的话，在其模型算法的表述中，也有着observation的相关引用。
ShareGPT格式简单明了而且结构强大，不仅仅轻易的支持单轮对话、多轮对话；还引入了强大的函数调用，支持功能扩展。

示例：

{
  "conversations": [
    {
      "from": "human",
      "value": "你好，我出生于1990年5月15日。你能告诉我我今天几岁了吗？"
    },
    {
      "from": "function_call",
      "value": "{\"name\": \"calculate_age\", \"arguments\": {\"birthdate\": \"1990-05-15\"}}"
    },
    {
      "from": "observation",
      "value": "{\"age\": 31}"
    },
    {
      "from": "gpt",
      "value": "根据我的计算，你今天31岁了。"
    }
  ],
  "tools": "[{\"name\": \"calculate_age\", \"description\": \"根据出生日期计算年龄\", \"parameters\": {\"type\": \"object\", \"properties\": {\"birthdate\": {\"type\": \"string\", \"description\": \"出生日期以YYYY-MM-DD格式表示\"}}, \"required\": [\"birthdate\"]}}]"
}


{
  "conversations": [
    [
      {
        "from": "human",
        "value": "Escribe un discurso que pueda recitar como padrino de la boda de mi mejor amigo."
      },
      {
        "from": "gpt",
        "value": "Queridos invitados, amigos y familiares,\n\nMe siento muy honrado de estar aquí hoy como padrino de bodas de mi mejor amigo [Nombre del novio].\n\nRecuerdo con cariño los días en los que [Nombre del novio] y yo nos conocimos, cuando éramos solo dos jóvenes llenos de sueños y esperanza. Ahora, aquí estamos, celebrando su boda con la persona que ama, [Nombre de la novia].\n\n[Nombre de la novia], te aseguro que [Nombre del novio] es una persona increíble, llena de amor y lealtad. Juntos, han formado un equipo invencible y estoy seguro de que su amor perdurará por siempre.\n\n[Nombre del novio], mi amigo, te deseo todo lo mejor en esta nueva etapa de tu vida. Espero que tu matrimonio esté lleno de amor, alegría y felicidad, y que [Nombre de la novia] sea siempre tu compañera de vida y tu mejor amiga.\n\nA [Nombre de la novia], te deseo una vida llena de amor, paz y felicidad. Espero que [Nombre del novio] sea para ti un marido cariñoso y amoroso, y que juntos encuentren la felicidad que merecen.\n\nPor último, quiero desearles a ambos mucha suerte y que Dios les bendiga en su matrimonio. Espero que esta boda sea solo el comienzo de una larga y feliz vida juntos.\n\n¡Felicidades a los novios!\n\nAtentamente,\n[Tu nombre]"
      }
    ],
    [
      {
        "from": "human",
        "value": "Why are POSIX-shellscripts superior compared to Bash-shellscripts?"
      },
      {
        "from": "gpt",
        "value": "**POSIX-shellscripts** and **Bash-shellscripts** are both scripting languages that use the shell command-line interpreter. **POSIX** is a standard that defines the interface and behavior of many Unix-like operating systems, while **Bash** is a specific implementation of the shell that has many extensions and features beyond the POSIX specification.\n\nSome of the advantages of POSIX-shellscripts over Bash-shellscripts are:\n\n - They are more portable and can run on almost any Linux or Unix-like system, regardless of the shell implementation.\n - They are more compatible with other POSIX-compliant utilities and programs.\n - They are simpler and less prone to errors or unexpected behavior due to Bash-specific features.\n\nSome of the disadvantages of POSIX-shellscripts over Bash-shellscripts are:\n\n - They are less expressive and powerful than Bash-shellscripts, as they lack many features such as arrays, associative arrays, arithmetic expressions, extended pattern matching, etc.\n - They are less convenient and user-friendly than Bash-shellscripts, as they require more syntax and quoting rules.\n - They are less widely used and supported than Bash-shellscripts, as most GNU/Linux systems use Bash as the default shell.\n\nUltimately, the choice between POSIX-shellscripts and Bash-shellscripts depends on your needs and preferences. If you want to write scripts that are portable, compatible and simple, you may prefer POSIX-shellscripts. If you want to write scripts that are expressive, powerful and convenient, you may prefer Bash-shellscripts. However, you should also be aware of the differences and limitations of each language, and avoid using features that are not supported by the target shell."
      },
      {
        "from": "human",
        "value": "Why are POSIX-shellscripts less prone to errors than Bash-shellscripts?"
      },
      {
        "from": "gpt",
        "value": "Its not correct to state that POSIX-shellscripts are more or less prone to errors than Bash-shellscripts as either can be written with functions as well as commands that generate errors. Unlike POSIX-shellscripts, Bash-shellscripts may not be portable across unix platforms as readily as POSIX-shell scripts that provide the same purpose.  POSIX-shellscripts are usually less complex and so less prone to errors than more complex Bash-shellscripts."
      }
    ]
  ]
}

6 数据集处理

# 这里的是 alpaca 风格的数据集处理函数

alpaca_prompt = """下面是一项描述任务的说明，配有提供进一步背景信息的输入。写出一个适当完成请求的回应。

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# 从本地目录加载数据集
from datasets import load_dataset
dataset = load_dataset("json", data_files="./train.json", split = "train") 

# 使用格式化函数对数据进行处理
dataset = dataset.map(formatting_prompts_func, batched = True,)

# ==================================================================================================

# 这里是 ShareGPT 风格的数据集处理函数


from unsloth.chat_templates import get_chat_template

# 获取聊天模板
tokenizer = get_chat_template(
    tokenizer,
    chat_template="chatml",  # 支持 zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth 模式
    mapping={
        "role": "from",        # 角色映射
        "content": "value",    # 内容映射
        "user": "human",       # 用户映射
        "assistant": "gpt"     # 助手映射
    }
)

# 定义格式化函数
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [
        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
        for convo in convos
    ]
    return {"text": texts}

# 从 魔塔社区 中加载数据
from modelscope.msdatasets import MsDataset
dataset = MsDataset.load('OmniData/guanaco-sharegpt-style', split="train")

# 使用格式化函数对数据进行处理
dataset = dataset.map(formatting_prompts_func, batched=True)

7 训练参数配置

使用huggingface的trl库

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 使用 SFTTrainer 进行训练
trainer = SFTTrainer(
    model=model,  # 训练的模型（上文中的变量）
    tokenizer=tokenizer,  # 使用的 tokenizer （上文中的变量）
    train_dataset=dataset,  # 训练数据集（上文中的变量）
    dataset_text_field="text",  # 数据集中文本字段
    max_seq_length=2048,  # 最大序列长度
    dataset_num_proc=2,  # 使用的处理进程数
    packing=False,  # 是否启用数据打包（可以加速短序列的训练）
    args=TrainingArguments(
        per_device_train_batch_size=2,  # 每个设备的训练批次大小：如果想更多地利用 GPU 的内存，请增加批处理大小。同时增加批处理大小可以使训练更加流畅，并使过程不会过度拟合。
        gradient_accumulation_steps=4,  # 梯度累积的步数：相当于将批量大小增加到自身之上，但不会影响内存消耗！如果您想要更平滑的训练损失曲线，通常建议增加这个值。
        warmup_steps=5,  # 预热步数
        max_steps=60,  # 最大训练步数：我们将步骤设置为 60 以加快训练速度。对于可能需要数小时的完整训练运行，请注释掉max_steps，并将其替换为num_train_epochs = 1。将其设置为 1 表示对数据集进行 1 次完整传递。通常建议传递 1 到 3 次，不要更多，否则您的微调会过度拟合。
        learning_rate=2e-4,  # 学习率：如果您想让微调过程变慢，但同时又最有可能收敛到更高精度的结果，请降低学习率。我们通常建议尝试 2e-4、1e-4、5e-5、2e-5 作为数字。
        fp16=not is_bfloat16_supported(),  # 如果不支持 bfloat16，则使用 fp16
        bf16=is_bfloat16_supported(),  # 如果支持 bfloat16，则使用 bfloat16
        logging_steps=1,  # 日志记录的步数间隔
        optim="adamw_8bit",  # 使用的优化器类型，支持 8-bit adamw
        weight_decay=0.01,  # 权重衰减（L2正则化）
        lr_scheduler_type="linear",  # 学习率调度器类型
        seed=3407,  # 随机种子
        output_dir="outputs",  # 输出目录（一开始我不知道这个目录具体在哪，后面发现这个相对路径为你打开python命令行的位置）
    ),
)

8 开始训练

trainer_stats = trainer.train()

9 数据统计

# 计算总的GPU使用内存（单位：GB）
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# 计算LoRA模型使用的GPU内存（单位：GB）
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
# 计算总的GPU内存使用百分比
used_percentage = round(used_memory / max_memory * 100, 3)
# 计算LoRA模型的GPU内存使用百分比
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

10 推理模型（这里并不重要，我使用了第一种方式输出成功后就进行导出了,每个方法都莫名其妙失败或者成功，所以失败了就换一个）

三种推理都使用到了 model 变量与 tokenizer 变量，运用上文中加载模型的函数应该可以推理存储后的大模型。（未经过尝试）

# 这里的是 ShareGPT 风格的数据集的作者提供的

from unsloth.chat_templates import get_chat_template

# 获取用于聊天模板的tokenizer
tokenizer = get_chat_template(
    tokenizer,  # 传入的tokenizer
    chat_template="chatml",  # 支持的聊天模板类型：zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping={
        "role": "from",  # 映射角色字段
        "content": "value",  # 映射内容字段
        "user": "human",  # 用户角色
        "assistant": "gpt",  # 助手角色
    },  # ShareGPT风格的映射
)

# 启用原生推理加速（使推理速度加快2倍）
FastLanguageModel.for_inference(model)

# 模拟人类提问
messages = [
    {"from": "human", "value": "杭州的省会在哪里？"},  # 用户输入的问题
]

# 处理输入消息，生成模型输入
inputs = tokenizer.apply_chat_template(
    messages,  # 输入消息
    tokenize=True,  # 是否进行tokenize
    add_generation_prompt=True,  # 生成时必须添加生成提示符
    return_tensors="pt",  # 返回PyTorch张量
).to("cuda")  # 将数据移动到GPU

# 使用模型生成输出
outputs = model.generate(
    input_ids=inputs,  # 输入ID
    max_new_tokens=64,  # 生成的最大token数
    use_cache=True,  # 使用缓存
)

# 解码输出的tokens
tokenizer.batch_decode(outputs)  # 解码生成的输出


# ====================================================================================================

# 这里的是 ShareGPT 风格的数据集的作者提供的第二种方法（流式输出）


from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer

# 获取用于聊天模板的tokenizer
tokenizer = get_chat_template(
    tokenizer,  # 传入的tokenizer
    chat_template="chatml",  # 支持的聊天模板类型：zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping={
        "role": "from",  # 映射角色字段
        "content": "value",  # 映射内容字段
        "user": "human",  # 用户角色
        "assistant": "gpt",  # 助手角色
    },  # ShareGPT风格的映射
)

# 启用原生推理加速（使推理速度加快2倍）
FastLanguageModel.for_inference(model)

# 模拟人类提问
messages = [
    {"from": "human", "value": "杭州的省会在哪里？"},  # 用户输入的问题
]

# 处理输入消息，生成模型输入
inputs = tokenizer.apply_chat_template(
    messages,  # 输入消息
    tokenize=True,  # 是否进行tokenize
    add_generation_prompt=True,  # 生成时必须添加生成提示符
    return_tensors="pt",  # 返回PyTorch张量
).to("cuda")  # 将数据移动到GPU

# 创建TextStreamer对象，指定跳过提示符
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# 使用模型生成输出，使用streamer来逐步输出生成的文本
_ = model.generate(
    input_ids=inputs,  # 输入ID
    streamer=text_streamer,  # 使用streamer来控制输出流
    max_new_tokens=128,  # 生成的最大token数
    use_cache=True,  # 使用缓存
)

# ====================================================================================================

# 这里是 alpaca 风格的数据集作者提供的

FastLanguageModel.for_inference(model) # 启用原生推理速度快2倍
inputs = tokenizer(
[
    alpaca_prompt.format(
        "你好", # instruction
        "", # input
        "", # output
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

# ========================================================================================================
# GPT 给的结果

# 假设你已经正确加载了模型和tokenizer

# 定义 Alpaca prompt（确保格式化字符串中的占位符数量正确）
alpaca_prompt = "{0} # instruction {1} # input {2} # output"

# 正确格式化字符串并传入tokenizer
inputs = tokenizer(
    [
        alpaca_prompt.format("我肚子好疼", "", "")  # 传入3个参数
    ], return_tensors="pt"
).to("cuda")

# 使用模型进行推理
outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# 解码输出
decoded_output = tokenizer.batch_decode(outputs)
print(decoded_output)

11 导出模型

将微调后的模型保存到一个名叫LoRA的100MB小文件。

加载刚保存的LoRA适配器用于推断，他将自动加载整个模型及LoRA适配器。adapter_config.json定义了完整模型的路径。

# 保存模型到本地目录 "lora_model"
model.save_pretrained("lora_model")  # 本地保存模型

# 保存tokenizer到本地目录 "lora_model"
tokenizer.save_pretrained("lora_model")  # 本地保存tokenizer

保存完整模型

# 将模型和tokenizer合并为16位格式并保存到本地
model.save_pretrained_merged("models/Llama3", tokenizer, save_method = "merged_16bit",)
# 将合并后的模型和tokenizer推送到Hugging Face Hub （抱抱脸平台，这个平台是全球最大的大模型分享平台，它与魔塔社区的关系相当于GitHub与Gitee，，曾经上过央视，不过还是被墙了，很幽默）
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

## 将模型和tokenizer合并为4位格式并保存到本地
model.save_pretrained_merged("models/Llama3", tokenizer, save_method = "merged_4bit",)
# 将合并后的模型和tokenizer推送到Hugging Face Hub
model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

保存GGUF格式（GGUF就是一种二进制格式文件的规范，原始的大模型预训练结果经过转换后变成GGUF格式可以更快地被载入使用，也会消耗更低的资源。原因在于GGUF采用了多种技术来保存大模型预训练结果，包括采用紧凑的二进制编码格式、优化的数据结构、内存映射等。），保存为GGUF后可以放到 ollama 平台运行，ollama 可以在没有显卡的机器上使用 CPU 推理大模型。

此处可能会有报错，因为可能还不太兼容，不过我执行过程非常顺利，遇到错误可以必应检索试试。

# 保存到 16bit GGUF 体积大
model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# 保存到 8bit Q8_0 体积适中
model.save_pretrained_gguf("model", tokenizer,)
model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# 保存到 q4_k_m GGUF 体积小
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

12 导入到 Ollama

保存为 GGUF 格式后，编写一个Modelfile文件，即可用来创建大模型

~~参考官网文档~~ ~~Ollama Modelfile~~

命令	描述
`FROM` ~~(必需的)~~	~~定义使用的基模型~~
`PARAMETER(参数)`	~~设置Ollama运行模型的参数~~
`TEMPLATE(提示词模板)`	~~于发送给模型的完整提示模板~~
`SYSTEM`	~~指定将在模板中设置的系统消息~~
`ADAPTER`	~~定义适用于模型的微调 LoRA 配器~~
`LICENSE`	~~许可证~~
`MESSAGE`	~~指定消息历史~~

~~PARAMETER参数~~

参数	描述	~~值类型~~	~~示例用法~~
~~mirostat~~	~~启用 Mirostat 采样来控制困惑度。（默认：0，0 = 禁用，1 = Mirostat，2 = Mirostat 2.0）~~	~~int~~	~~mirostat 0~~
~~mirostat_eta~~	~~影响算法响应生成文本反馈的速度。较低的学习率将导致调整较慢，而较高的学习率使算法反应更快。（默认：0.1）~~	~~float~~	~~mirostat_eta 0.1~~
~~mirostat_tau~~	~~控制输出的连贯性与多样性之间的平衡。较低的值会导致文本更专注和连贯。（默认：5.0）~~	~~float~~	~~mirostat_tau 5.0~~
~~num_ctx~~	~~设置用于生成下一个 token 的上下文窗口大小。（默认：2048）~~	~~int~~	~~num_ctx 4096~~
~~repeat_last_n~~	~~设置模型回顾的范围，以防止重复。（默认：64，0 = 禁用，-1 = 使用 num_ctx）~~	~~int~~	~~repeat_last_n 64~~
~~repeat_penalty~~	~~设置惩罚重复的强度。较高的值（例如 1.5）会更强烈地惩罚重复，而较低的值（例如 0.9）会更宽松。（默认：1.1）~~	~~float~~	~~repeat_penalty 1.1~~
~~temperature~~	~~模型的温度。增大温度会使模型的回答更具创意。（默认：0.8）~~	~~float~~	~~temperature 0.7~~
~~seed~~	~~设置生成使用的随机数种子。设置为特定数字会使模型对相同的提示生成相同的文本。（默认：0）~~	~~int~~	~~seed 42~~
~~stop~~	~~设置停止序列。当遇到该模式时，LLM 将停止生成文本并返回。可以通过在 modelfile 中指定多个 stop 参数来设置多个停止模式。~~	~~string~~	~~stop "AI assistant:"~~
~~tfs_z~~	~~使用尾部自由采样来减少低概率 token 对输出的影响。较高的值（例如 2.0）将更强地减少这种影响，值为 1.0 则禁用此设置。（默认：1）~~	~~float~~	~~tfs_z 1~~
~~num_predict~~	~~生成文本时的最大 token 数量。（默认：128，-1 = 无限生成，-2 = 填充上下文）~~	~~int~~	~~num_predict 42~~
~~top_k~~	~~减少生成无意义文本的概率。较高的值（例如 100）会产生更多样的回答，而较低的值（例如 10）会更保守。（默认：40）~~	~~int~~	~~top_k 40~~
~~top_p~~	~~与 top_k 一起工作。较高的值（例如 0.95）会导致更具多样性的文本，而较低的值（例如 0.5）则会生成更专注和保守的文本。（默认：0.9）~~	~~float~~	~~top_p 0.9~~
~~min_p~~	top_p 的替代方法，旨在确保质量和多样性之间的平衡。参数 p 表示 token 被考虑的最小概率，相对于最可能 token 的概率。例如，当 p=0.05 时，如果最可能的 token 概率为 0.9，则小于 0.045 的 logits 会被过滤掉。（默认：0.0）	~~float~~	~~min_p 0.05~~

我实测发现Modelfile貌似不能自己编写，我编写后大模型出现了胡乱输出的状况，可以使用以下Modelfile模版

FROM llama2

TEMPLATE """下面是一项描述任务的说明，配有提供进一步背景信息的输入。写出一个适当完成请求的回应。{{ if .Prompt }}

### Instruction:
{{.Prompt}}

{{ end }}### Response:

{{ .Response }}<|end_of_text|>"""

PARAMETER stop "<|start_header_id|>"
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_header_id|>"
PARAMETER stop "<|end_of_text|>"
PARAMETER stop "<|reserved_special_token_"

使用 ollama 创建模型

ollama create 模型名称 -f /mnt/workspace/model/Modelfile