单机部署微调ChatGLM-6B #

ChatGLM-6B #

ChatGLM-6B可以部署在消费级显卡上，13GB显存可运行(部署和推理)，INT8量化后8GB，可在单张消费级显卡(如2080Ti)上部署。

NT8量化
INT8量化是一种减少神经网络模型大小和加速推理的技术，它通过将模型中的浮点数(FP32或FP16)转换为8位整数(INT8)来降低计算复杂度和内存需求。NT8量化的主要缺点是会导致精度下降和性能损失，尤其是在处理复杂模型时。

调用ChatGLM-6B模型来生成对话 #

1pip install protobuf transformers==4.27.1 cpm_kernels torch>=2.0 gradio mdtex2html sentencepiece accelerate

1from transformers import AutoTokenizer, AutoModel
2tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
3# model = AutoModel.from_pretrained("THUDM/chatglm-6b-int8", trust_remote_code=True).half().cuda()
4# 按需修改，支持4/8 bit量化
5# model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(8).cuda()
6# model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).quantize(4).cuda()
7model = AutoModel.from_pretrained("THUDM/chatglm2-6b-int4", trust_remote_code=True).half().cuda()
8model = model.eval()

1response, history = model.chat(tokenizer, "你好", history=[])
2print(response)
3response, history = model.chat(tokenizer, "请讲一个精短的笑话", history=history)
4print(response)

使用gradio快速开发一个测试的chatbot，省略…

ChatGLM-6B P-Tuning #

ChatGLM-6B 模型基于P-Tuning v2的微调。P-Tuning v2将需要微调的参数量减少到原来的 0.1%，再通过模型量化、Gradient Checkpoint 等方法，最低只需要 7GB 显存即可运行(推荐16GB~24GB)。

ChatGLM-6B-PT 使用方法

参考 #

https://huggingface.co/THUDM/chatglm-6b
https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/README.md
https://github.com/THUDM/P-tuning-v2