7. 大模型编译#

本章节适用于平台

AX650A/AX650N/AX8850
AX630C

已验证模型

DeepSeek-R1-Distill
Qwen2.5、Qwen3
MiniCPM4
InternVL2_5、InternVL3
ChatGLM3
OpenBuddy
SmolLM2
Llama3.2
Gemma2
Phi2、Phi3
TinyLlama

本章节介绍如何将 Huggingface 上的模型转换的基本操作, 使用 pulsar2 工具将从 Huggingface 下载的项目中 *.safetensor 或 pytorch_model.bin 模型编译成 axmodel 模型. 请先参考《开发环境准备》章节完成开发环境搭建. 本节示例模型为 Qwen2.5-0.5B-Instruct-GPTQ-Int8.

版本约束

本文档基于 Pulsar2 4.1 版本进行编写。

LLM ModelZoo

不定期更新业内关注度较高的大语言模型适配，包括预编译模型和上板运行示例。

Huggingface

关联项目 AX-LLM

该项目用于探索业界常用 LLM(Large Language Model) 在已有芯片平台上落地的可行性和相关能力边界，方便社区开发者进行快速评估和二次开发自己的 LLM 应用。

AX-LLM

7.1. 命令说明#

Pulsar2 工具链中使用 pulsar2 llm_build 命令来完成 LLM 模型的转换.

root@xxx:/data# pulsar2 llm_build --help
usage: pulsar2 llm_build [-h] [--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--prefill_len PREFILL_LEN] [--parallel PARALLEL] [--model_config MODEL_CONFIG]
                         [--kv_cache_len KV_CACHE_LEN] [--post_topk POST_TOPK] [--post_weight_type {bf16,s8,fp8_e5m2,fp8_e4m3}] [-t {fp16,bf16,fp32}]
                         [-w {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}] [-c CHECK_LEVEL] [--chip {AX620E,AX650,LAMBERT}] [--prompt PROMPT] [--image_size IMAGE_SIZE]
                         [--last_kv_cache_len LAST_KV_CACHE_LEN] [--tensor_parallel_size TENSOR_PARALLEL_SIZE]

options:
  -h, --help            show this help message and exit
  --input_path INPUT_PATH
                        path of model or npy path (default: )
  --output_path OUTPUT_PATH
                        path of dumpped ax_model (default: .)
  --prefill_len PREFILL_LEN
                        token length of prefill (default: 0)
  --parallel PARALLEL   build parallel (default: 1)
  --model_config MODEL_CONFIG
                        config file (default: )
  --kv_cache_len KV_CACHE_LEN
                        length of kv_cache (default: 127)
  --post_topk POST_TOPK
                        post model output indices and prob (default: 0)
  --post_weight_type {bf16,s8,fp8_e5m2,fp8_e4m3}
                        post weight type (default: s8)
  -t {fp16,bf16,fp32}, --hidden_state_type {fp16,bf16,fp32}
                        hidden_state dtype (default: bf16)
  -w {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}, --weight_type {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}
                        weight dtype (default: s8)
  -c CHECK_LEVEL, --check_level CHECK_LEVEL
                        check level 0:run 1:layer_check 2: cal 1+1 (default: 0)
  --chip {AX620E,AX650,LAMBERT}
                        chip (default: AX650)
  --prompt PROMPT       prompt for check_level==2 (default: 1+1=)
  --image_size IMAGE_SIZE
                        vlm vision_part input_size (default: 224)
  --last_kv_cache_len LAST_KV_CACHE_LEN
                        last kv cache len (default: None)
  --tensor_parallel_size TENSOR_PARALLEL_SIZE
                        tensor parallel size (default: 0)

7.2. 下载 ax-llm-build 项目#

git clone https://github.com/AXERA-TECH/ax-llm-build.git

7.3. 下载 Qwen2.5-0.5B-Instruct-GPTQ-Int8#

cd ax-llm-build
pip install -U huggingface_hub
huggingface-cli download --resume-download Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8 --local-dir Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650

7.4. 编译执行#

pulsar2 llm_build --input_path Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8/  --output_path Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650 --hidden_state_type bf16 --kv_cache_len 1023 --prefill_len 128 --last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512  --chip AX650 -c 1 --parallel 8

7.4.1. log 参考信息#

pulsar2 llm_build --input_path Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8/  --output_path Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650 --hidden_state_type bf16 --kv_cache_len 1023 --prefill_len 128 --last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512  --chip AX650 -c 1 --parallel 8
Config(
    model_name='Qwen2.5-0.5B-Instruct-GPTQ-Int8',
    model_type='qwen2',
    num_hidden_layers=24,
    num_attention_heads=14,
    num_key_value_heads=2,
    hidden_size=896,
    head_dim=0,
    intermediate_size=4864,
    vocab_size=151936,
    rope_theta=1000000.0,
    max_position_embeddings=32768,
    rope_partial_factor=1.0,
    rms_norm_eps=1e-06,
    norm_type='rms_norm',
    hidden_act='silu',
    hidden_act_param=0.03,
    scale_depth=1.4,
    scale_emb=1,
    dim_model_base=256,
    origin_model_type='',
    quant=True,
    quant_sym=True,
    quant_bits=8,
    quant_group_size=128,
    rs_factor=32,
    rs_high_freq_factor=4.0,
    rs_low_freq_factor=1.0,
    rs_original_max_position_embeddings=8192,
    rs_rope_type='',
    rs_mrope_section=[16, 24, 24]
)
2025-06-17 19:43:58.341 | SUCCESS  | yamain.command.llm_build:llm_build:179 - prepare llm model done!
building llm decode layers   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 0:01:57
building llm post layer   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:01:24
2025-06-17 19:47:20.855 | SUCCESS  | yamain.command.llm_build:llm_build:275 - build llm model done!

备注

该示例所运行的主机配置为:

Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz

Memory 32G

全流程耗时大约 6min , 不同配置的主机转换时间略有差异.

7.4.2. embed 提取和优化#

chmod +x ./tools/fp32_to_bf16
chmod +x ./tools/embed_process.sh
./tools/embed_process.sh Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8/ Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650/

7.4.3. 输出文件说明#

root@xxx:/data/ax-llm-build# tree Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650/
Qwen/Qwen2.5-0.5B-Instruct-GPTQ-Int8-ctx-ax650/
├── model.embed_tokens.weight.bfloat16.bin
├── model.embed_tokens.weight.float32.bin # 临时文件，可删掉
├── model.embed_tokens.weight.npy # 临时文件，可删掉
├── qwen2_p128_l0_together.axmodel
├── qwen2_p128_l10_together.axmodel
├── qwen2_p128_l11_together.axmodel
├── qwen2_p128_l12_together.axmodel
├── qwen2_p128_l13_together.axmodel
├── qwen2_p128_l14_together.axmodel
├── qwen2_p128_l15_together.axmodel
├── qwen2_p128_l16_together.axmodel
├── qwen2_p128_l17_together.axmodel
├── qwen2_p128_l18_together.axmodel
├── qwen2_p128_l19_together.axmodel
├── qwen2_p128_l1_together.axmodel
├── qwen2_p128_l20_together.axmodel
├── qwen2_p128_l21_together.axmodel
├── qwen2_p128_l22_together.axmodel
├── qwen2_p128_l23_together.axmodel
├── qwen2_p128_l2_together.axmodel
├── qwen2_p128_l3_together.axmodel
├── qwen2_p128_l4_together.axmodel
├── qwen2_p128_l5_together.axmodel
├── qwen2_p128_l6_together.axmodel
├── qwen2_p128_l7_together.axmodel
├── qwen2_p128_l8_together.axmodel
├── qwen2_p128_l9_together.axmodel
└── qwen2_post.axmodel

0 directories, 28 files

其中 model.embed_tokens.weight.bfloat16.bin, qwen2_p128_l0_together.axmodel ~ qwen2_p128_l23_together.axmodel, qwen_post.axmodel 文件是上板运行所需要

7.5. 开发板运行#

本章节介绍如何在 AX650 开发板上运行 LLM 模型.

7.5.1. 使用 ax-llm 运行大模型#

运行该实例相关文件已上传网盘，请自行下载和参考

Huggingface

先运行 tokenizer 解析器

root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b-ctx# python3 qwen2.5_tokenizer_uid.py
Server running at http://0.0.0.0:12345

再运行示例

root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b-ctx# ./run_qwen2.5_0.5b_gptq_int8_ctx_ax650.sh
[I][                            Init][ 110]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: d9e84259-87a2-4c54-9b9b-7da266149e8b
bos_id: -1, eos_id: 151645
100% | ████████████████████████████████ |  27 /  27 [10.21s<10.21s, 2.65 count/s] init post axmodel ok,remain_cmm(11292 MB)
[I][                            Init][ 188]: max_token_len : 1023
[I][                            Init][ 193]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 201]: prefill_token_num : 128
[I][                            Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 205]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 205]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 205]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 205]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 209]: prefill_max_token_num : 512
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 271]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 308]: input_num_token:21
[I][                            main][ 230]: precompute_len: 21
[I][                            main][ 231]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you?
[I][                      SetKVCache][ 531]: prefill_grpid:2 kv_cache_num:128 precompute_len:21 input_num_token:12
[I][                      SetKVCache][ 534]: current prefill_max_token_num:384
[I][                             Run][ 660]: input token num : 12, prefill_split_num : 1
[I][                             Run][ 686]: input_num_token:12
[I][                             Run][ 829]: ttft: 135.66 ms
I am Qwen, a large language model created by Alibaba Cloud. I am a language model that can generate human-like text based on the input I receive.
I am designed to assist with a wide range of tasks, from simple questions to complex research papers, and I can even generate creative writing and speech.
I am here to help you with your queries and provide you with the information you need.

[N][                             Run][ 943]: hit eos,avg 34.04 token/s

[I][                      GetKVCache][ 500]: precompute_len:113, remaining:399
prompt >> q
root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b-ctx#

板端运行程序编译流程，请参考我们在 github 上的开源项目 AX-LLM

7.5.2. Tokenizer 解析器说明#

ax-llm 项目中的 Tokenizer 解析器采用本地模块与 HTTP Server 两种方案，其中本地方案又尝试了 sentencepiece、tiktoken 两种方案。但是我们在实际调试过程中发现 sentencepiece 对于不同 LLM 模型的 special tokens 支持不友好，需要用户自行处理 special tokens 的拆分，容易导致板端 token id 与 transformers 库中的 AutoTokenizer 获得的 token id 存在差异，最终影响 LLM 的输出结果正确性。因此我们建议前期调试的时候使用 Tokenizer HTTP Server 的方式直接调用 transformers 库中的 AutoTokenizer 模块进行测试。

Tokenizer HTTP Server 的特点：

保证 token id 正确
方便添加 chat template
支持本地、远端部署
支持多用户接入

以在网盘中已提供基于 Qwen2.5 0.5B 的相关文件为例

root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b-ctx# tree -L 1
.
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- qwen2.5-0.5b-gptq-int8-ctx-ax630c
|-- qwen2.5-0.5b-gptq-int8-ctx-ax650
|-- qwen2.5_tokenizer
|-- qwen2.5_tokenizer_uid.py
|-- run_qwen2.5_0.5b_gptq_int8_ctx_ax630c.sh
`-- run_qwen2.5_0.5b_gptq_int8_ctx_ax650.sh

qwen2.5_tokenizer：是 tokenizer 相关文件，从 Qwen/Qwen2.5-3B-Instruct/ 中提取
qwen2.5_tokenizer_uid.py：是用 python 实现的 Tokenizer HTTP Server

运行说明如下：

python qwen2.5_tokenizer_uid.py --host xxx.xxx.xxx.xxx --port 12345，其中 --host xxx.xxx.xxx.xxx 设置 tokenizer 解析服务器的 IP 地址，确保 AX650N 能正常访问该地址
可以在具备 python 环境的 AX650N 本地运行, 则直接运行 python qwen2.5_tokenizer_uid.py
修改 ./run_qwen2.5_0.5b_gptq_int8_ctx_ax650.sh 中 --filename_tokenizer_model 的 IP 信息和步骤1中的一致
运行 ./run_qwen2.5_0.5b_gptq_int8_ctx_ax650.sh 即可

root@ax650:/mnt/qtang/llm-test/qwen2.5-0.5b-ctx# cat run_qwen2.5_0.5b_gptq_int8_ctx_ax650.sh
./main_ax650 \
--system_prompt "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." \
--template_filename_axmodel "qwen2.5-0.5b-gptq-int8-ctx-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 24 \
--tokenizer_type 2 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "qwen2.5-0.5b-gptq-int8-ctx-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "qwen2.5-0.5b-gptq-int8-ctx-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 151936 \
--tokens_embed_size 896 \
--use_mmap_load_embed 0 \
--live_print 1

大模型编译

目录

7. 大模型编译#

7.1. 命令说明#

7.2. 下载 ax-llm-build 项目#

7.3. 下载 Qwen2.5-0.5B-Instruct-GPTQ-Int8#

7.4. 编译执行#

7.4.1. log 参考信息#

7.4.2. embed 提取和优化#

7.4.3. 输出文件说明#

7.5. 开发板运行#

7.5.1. 使用 ax-llm 运行大模型#

7.5.2. Tokenizer 解析器说明#