使用vllm单节点多卡分布式部署Qwen2.5-14B-Instruct

📅 2024-10-08 | 🖱️

🔖 aigc

实验环境 #

OS: Ubuntu 24.04
Python: 3.11
GPU: NVIDIA GeForce RTX 4090 (2个)
CUDA Version: 12.6

vLLM安装 #

见“使用pip安装vLLM”

模型下载 #

预先使用huggingface-cli下载Qwen/Qwen2.5-14B-Instruct。

Qwen2.5-14B-Instruct部署 #

启动为兼容OpenAI的API服务。

单机双卡设置CUDA_VISIBLE_DEVICES环境变量。

1export CUDA_VISIBLE_DEVICES=0,1

设置了HF_HUB_OFFLINE=1将不会向Hugging Face Hub发起任何HTTP调用。加快加载时间，这也特别适合服务器没有外网访问时。

1export HF_HUB_OFFLINE=1

启动服务：

1vllm serve Qwen/Qwen2.5-14B-Instruct \
2  --served-model-name qwen2.5-14b-instruct \
3  --enable-auto-tool-choice \
4  --tool-call-parser hermes \
5  --max-model-len=32768 \
6  --tensor-parallel-size 2 \
7  --port 8000

--tensor-parallel-size 2
--tensor-parallel-size 2表示使用Tensor Parallelism技术来分配模型跨两个GPU
Tensor Parallelism是一种分布式深度学习技术，用于处理大型模型。
当--tensor-parallel-size 设置为 2 时，模型的参数和计算会被分割成两部分，分别在两个GPU上进行处理。
这种方法可以有效地减少每个GPU上的内存使用，使得能够加载和运行更大的模型。
同时，它还可以在一定程度上提高计算速度，因为多个GPU可以并行处理模型的不同部分。
Tensor Parallelism对于大型语言模型（如 Qwen2.5-14B-Instruct）特别有用，因为这些模型通常太大，无法完全加载到单个GPU的内存中。

测试兼容OpenAI的API服务 #

通过curl 命令查看当前的模型列表:

 1curl -s http://localhost:8000/v1/models | jq .
 2
 3{
 4  "object": "list",
 5  "data": [
 6    {
 7      "id": "qwen2.5-14b-instruct",
 8      "object": "model",
 9      "created": 1728454502,
10      "owned_by": "vllm",
11      "root": "Qwen/Qwen2.5-14B-Instruct",
12      "parent": null,
13      "max_model_len": 32768,
14      "permission": [
15        {
16          "id": "modelperm-e269177fea994b4aa7364bfc40992219",
17          "object": "model_permission",
18          "created": 1728454502,
19          "allow_create_engine": false,
20          "allow_sampling": true,
21          "allow_logprobs": true,
22          "allow_search_indices": false,
23          "allow_view": true,
24          "allow_fine_tuning": false,
25          "organization": "*",
26          "group": null,
27          "is_blocking": false
28        }
29      ]
30    }
31  ]
32}

通过curl命令测试chat completions API:

 1curl -s http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
 2  "model": "qwen2.5-14b-instruct",
 3  "messages": [
 4    {"role": "system", "content": "你是一个数学家."},
 5    {"role": "user", "content": "9.11和9.8这两个小数谁比较大?"}
 6  ],
 7  "max_tokens": 512
 8}' | jq '.choices[0].message.content'
 9
10"比较两个小数9.11和9.8的大小，可以遵循以下步骤：\n\n1. **比较整数部分**：9.11和9.8的整数部分都是9，所以需要比较小数部分。\n2. **比较小数部分**：9.11的小数部分是0.11，而9.8的小数部分是0.8。\n\n为了更容易比较，可以将0.8写成0.80，这样两个数的小数部分就都有两位了。\n- 9.11的小数部分是0.11。\n- 9.8的小数部分是0.80。\n\n显然，0.80 > 0.11，因此9.8 > 9.11。\n\n所以，9.8比9.11大。"

通过curl命令测试tool calling:

 1curl -s http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
 2  "model": "qwen2.5-14b-instruct",
 3  "messages": [
 4    { "role": "user", "content": "What is 3 * 12? Also, what is 11 + 49?" }
 5  ],
 6  "parallel_tool_calls": false,
 7  "tools": [
 8    {
 9      "type": "function",
10      "function":  {
11        "name": "add",
12        "description": "Add two integers.",
13        "parameters": {
14            "type": "object",
15            "properties": {
16                "a": {"type": "integer"},
17                "b": {"type": "integer"}
18            },
19            "required": ["a", "b"]
20        }
21      }
22    },
23    {
24      "type": "function",
25      "function":  {
26        "name": "multiply",
27        "description": "Multiply two integers.",
28        "parameters": {
29            "type": "object",
30            "properties": {
31                "a": {"type": "integer"},
32                "b": {"type": "integer"}
33            },
34            "required": ["a", "b"]
35        }
36      }
37    }
38  ]
39}' | jq '.choices[0].message.tool_calls'

 1[
 2  {
 3    "id": "chatcmpl-tool-ef9f47970bbb40539df865e89fb6a347",
 4    "type": "function",
 5    "function": {
 6      "name": "multiply",
 7      "arguments": "{\"a\": 3, \"b\": 12}"
 8    }
 9  },
10  {
11    "id": "chatcmpl-tool-c37a4dadc5d94d0a9daa7fc4d9a3f7a4",
12    "type": "function",
13    "function": {
14      "name": "add",
15      "arguments": "{\"a\": 11, \"b\": 49}"
16    }
17  }
18]

使用systemd配置为系统服务 #

使用systemd将前面部署的qwen2.5-14b-instruct配置为系统服务。

/etc/systemd/system/qwen2.5-14b-instruct.service:

 1[Unit]
 2Description=qwen2.5-14b-instruct
 3After=network.target
 4
 5[Service]
 6Type=simple
 7Environment="CUDA_VISIBLE_DEVICES=0,1"
 8Environment="HF_HUB_OFFLINE=1"
 9WorkingDirectory=/home/<thuser>/vllm
10User=<theuser>
11ExecStart=/bin/bash -c 'source .venv/bin/activate && \
12    vllm serve Qwen/Qwen2.5-14B-Instruct \
13        --served-model-name qwen2.5-14b-instruct \
14        --enable-auto-tool-choice \
15        --tool-call-parser hermes \
16        --max-model-len=32768 \
17        --tensor-parallel-size 2 \
18        --port 8000'
19
20Restart=always
21RestartSec=3
22
23[Install]
24WantedBy=multi-user.target

1systemctl enable qwen2.5-14b-instruct

启动服务：

1systemctl start qwen2.5-14b-instruct

查看启动日志：

1journalctl -u qwen2.5-14b-instruct -f