4. Large Model Compilation (Experimental Stage)
Platforms Applicable
AX650N
AX630C
Verified Models
Llama2、Llama3、Llama3.2
TinyLlama-1.1B
Qwen1.5、Qwen2、Qwen2.5
Phi2、Phi3
MiniCPM、MiniCPM-V 2.0
SmolLM
ChatGLM3
OpenBuddy
This chapter introduces basic operations for converting models(*.safetensor
or pytorch_model.bin
) from Huggingface into axmodel
using the pulsar2
tool. Please first refer to the 《Development Environment Preparation》 section to complete the setup of the development environment.
The example model in this section is Qwen2-0.5B-Instruct
.
Version Constraints
This document is written based on Pulsar2 version 3.2.
LLM ModelZoo
Related Project AX-LLM
This project explores the feasibility and related capability boundaries of landing commonly used LLMs (Large Language Models) on existing chip platforms, making it convenient for community developers to quickly evaluate and develop their own LLM applications.
4.1. Command Explanation
In the Pulsar2
toolchain, the pulsar2 llm_build
command is used to complete the conversion of LLM models.
root@xxx:/data# pulsar2 llm_build --help
usage: pulsar2 llm_build [-h] [--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--prefill_len PREFILL_LEN]
[--parallel PARALLEL] [--model_config MODEL_CONFIG] [--kv_cache_len KV_CACHE_LEN]
[--post_topk POST_TOPK] [--post_weight_type {bf16,s8}] [-t {fp16,bf16,fp32}]
[-w {fp16,bf16,fp32,s8,s4}] [-c CHECK_LEVEL] [--chip {AX620E,AX650}] [--prompt PROMPT]
optional arguments:
-h, --help show this help message and exit
--input_path INPUT_PATH
path of model or npy path
--output_path OUTPUT_PATH
path of dumpped ax_model
--prefill_len PREFILL_LEN
token length of prefill
--parallel PARALLEL build parallel
--model_config MODEL_CONFIG
config file
--kv_cache_len KV_CACHE_LEN
length of kv_cache
--post_topk POST_TOPK
post model output indices and prob
--post_weight_type {bf16,s8}
post weight type
-t {fp16,bf16,fp32}, --hidden_state_type {fp16,bf16,fp32}
hidden_state dtype
-w {fp16,bf16,fp32,s8,s4}, --weight_type {fp16,bf16,fp32,s8,s4}
weight dtype
-c CHECK_LEVEL, --check_level CHECK_LEVEL
check level 0:run 1:layer_check 2: cal 1+1
--chip {AX620E,AX650}
chip
--prompt PROMPT prompt for check_level==2
4.2. Download ax-llm-build
git clone https://github.com/AXERA-TECH/ax-llm-build.git
4.3. Download Qwen2-0.5B-Instruct
cd ax-llm-build
pip install -U huggingface_hub
huggingface-cli download --resume-download Qwen/Qwen2-0.5B-Instruct --local-dir Qwen/Qwen2-0.5B-Instruct
4.4. Compile it
pulsar2 llm_build --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650
4.4.1. Log Information
pulsar2 llm_build --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --model_config config/qwen2-0.5B.json --hidden_state_type bf16 --weight_type s8 --parallel 8
Config(
model_name='Qwen2-0.5B-Instruct',
model_type='qwen2',
num_hidden_layers=24,
num_attention_heads=14,
num_key_value_heads=2,
hidden_size=896,
intermediate_size=4864,
vocab_size=151936,
rope_theta=1000000.0,
max_position_embeddings=32768,
rope_partial_factor=1.0,
rms_norm_eps=1e-06,
norm_type='rms_norm',
hidden_act='silu',
hidden_act_param=0.03,
scale_depth=1.4,
scale_emb=1
)
2024-08-22 16:16:04.364 | SUCCESS | yamain.command.llm_build:llm_build:100 - prepare llm model done!
building llm decode layers ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 0:05:03
building llm post layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:01:25
2024-08-22 16:22:33.485 | SUCCESS | yamain.command.llm_build:llm_build:160 - build llm model done!
2024-08-22 16:22:47.861 | SUCCESS | yamain.command.llm_build:llm_build:337 - check llm model done!
Note
The example runs on a host configured to:
Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz
Memory 32G
The whole process takes about 6min
, and the conversion time varies slightly with different host configurations.
4.4.2. Embed file extract and optimize
chmod +x ./tools/fp32_to_bf16
chmod +x ./tools/embed_process.sh
./tools/embed_process.sh Qwen/Qwen2-0.5B-Instruct/ Qwen/Qwen2-0.5B-w8a16/
4.4.3. Output files
root@xxx:/data/ax-llm-build# tree Qwen/Qwen2-0.5B-w8a16
Qwen/Qwen2-0.5B-w8a16
├── model.embed_tokens.weight.bfloat16.bin
├── model.embed_tokens.weight.float32.bin # temp file, it can be deleted
├── model.embed_tokens.weight.npy # temp file, it can be deleted
├── qwen2_p128_l0_together.axmodel
├── qwen2_p128_l10_together.axmodel
├── qwen2_p128_l11_together.axmodel
├── qwen2_p128_l12_together.axmodel
├── qwen2_p128_l13_together.axmodel
├── qwen2_p128_l14_together.axmodel
├── qwen2_p128_l15_together.axmodel
├── qwen2_p128_l16_together.axmodel
├── qwen2_p128_l17_together.axmodel
├── qwen2_p128_l18_together.axmodel
├── qwen2_p128_l19_together.axmodel
├── qwen2_p128_l1_together.axmodel
├── qwen2_p128_l20_together.axmodel
├── qwen2_p128_l21_together.axmodel
├── qwen2_p128_l22_together.axmodel
├── qwen2_p128_l23_together.axmodel
├── qwen2_p128_l2_together.axmodel
├── qwen2_p128_l3_together.axmodel
├── qwen2_p128_l4_together.axmodel
├── qwen2_p128_l5_together.axmodel
├── qwen2_p128_l6_together.axmodel
├── qwen2_p128_l7_together.axmodel
├── qwen2_p128_l8_together.axmodel
├── qwen2_p128_l9_together.axmodel
└── qwen2_post.axmodel
The files model.embed_tokens.weight.bfloat16.bin
, qwen_p128_l0.axmodel ~ qwen_p128_l23.axmodel
, qwen_post.axmodel
are required for running on the board.
4.5. Development board run
This section describes how to run the LLM model on the AX650
development board.
4.5.1. Run large models using ax-llm
The files related to running the example have been uploaded to the web disk. Please download and refer to them
root@ax650:/mnt/qtang/llama_axera_cpp# ./run_qwen2_0.5B.sh
[I][ Init][ 128]: LLM init start
3% | ██ | 1 / 27 [0.27s<7.29s, 3.70 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 27 / 27 [6.88s<6.88s, 3.92 count/s] init post axmodel ok,remain_cmm(11317 MB)
[I][ Init][ 244]: max_token_len : 1023
[I][ Init][ 249]: kv_cache_size : 128, kv_cache_num: 1023
[I][ Init][ 257]: prefill_token_num : 128
[I][ Init][ 266]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?
[I][ Run][ 464]: ttft: 129.16 ms
I am a large language model created by Alibaba Cloud. I am called Qwen.
[N][ Run][ 603]: hit eos,avg 27.22 token/s
For the board run program compilation process, please refer to our open source project on github AX-LLM
4.5.2. Tokenizer Parser Explanation
The Tokenizer parser in the ax-llm project uses both local modules and an HTTP Server. The local solution has tried two schemes: sentencepiece and tiktoken.
However, during actual debugging, we found that sentencepiece does not support special tokens well for different LLM models, requiring users to handle the splitting of special tokens themselves, which can easily lead to differences between the token ids on the board and those obtained from the AutoTokenizer module in the transformers library, ultimately affecting the correctness of the LLM output results.
Therefore, we recommend using the Tokenizer HTTP Server method for initial debugging to directly call the AutoTokenizer module in the transformers library for testing.
Features of the Tokenizer HTTP Server:
Ensures correct token ids
Facilitates the addition of chat templates
Supports local and remote deployment
Supports multi-user access
Example with the provided files for Qwen2.5 3B on the netdisk:
root@xxx:/data/ax-llm-build# tree qwen2.5-3b-prefill-ax650/
qwen2.5-3b-prefill-ax650/
├── main_prefill
├── qwen2.5-3B-prefill-ax650
│ ├── model.embed_tokens.weight.bfloat16.bin
│ ├── qwen2_p128_l0_together.axmodel
...
│ ├── qwen2_p128_l12_together.axmodel
│ └── qwen2_post.axmodel
├── qwen2.5_tokenizer
│ ├── merges.txt
│ ├── tokenizer_config.json
│ ├── tokenizer.json
│ └── vocab.json
├── qwen2.5_tokenizer.py
├── qwen.tiktoken
├── readme.txt
└── run_qwen2.5_3B_prefill_ax650.sh
qwen2.5_tokenizer: file related to tokenizer, be extracted from Qwen/Qwen2.5-3B-Instruct/
qwen2.5_tokenizer.py: Tokenizer HTTP Server implemented in python
The running instructions are as follows:
python qwen2.5_tokenizer.py –host xxx.xxx.xxx.xxx –port 12345, where –host xxx.xxx.xxx.xxx sets the IP address of the tokenizer parsing server. Ensure that the AX650N can access this address properly. It can be run natively on AX650N with python environment
Change the IP address of –filename_tokenizer_model in run_qwen2.5_3B_prefill_ax650.sh to the same as that in Step 1
Run run_qwen2.5_3B_prefill_ax650.sh
root@xxx:/data/ax-llm-build# cat qwen2.5-3b-prefill-ax650/run_qwen2.5_3B_prefill_ax650.sh
./main_prefill \
--template_filename_axmodel "qwen2.5-3B-prefill-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 36 \
--tokenizer_type 2 \
--filename_tokenizer_model http://xxx.xxx.xxx.xxx:12345 \
--bos 0 --eos 0 \
--filename_post_axmodel "qwen2.5-3B-prefill-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "qwen2.5-3B-prefill-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 151936 \
--tokens_embed_size 2048 \
--use_mmap_load_embed 1 \
--live_print 1 \
--continue 1 \
--prompt "$1"
4.6. Other examples
4.6.1. MiniCPM-V 2.0
Download MiniCPM-V 2.0
cd ax-llm-build
pip install -U huggingface_hub
huggingface-cli download --resume-download openbmb/MiniCPM-V-2 --local-dir openbmb/MiniCPM-V-2
Get axmodel
pulsar2 llm_build --input_path openbmb/MiniCPM-V-2/ --output_path openbmb/MiniCPM-V-2-ax650 --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650
Log Information
pulsar2 llm_build --input_path openbmb/MiniCPM-V-2/ --output_path openbmb/MiniCPM-V-2-ax650 --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650 --parallel 8
Config(
model_name='openbmb/MiniCPM-V-2',
model_type='minicpmv',
num_hidden_layers=40,
num_attention_heads=36,
num_key_value_heads=36,
hidden_size=2304,
intermediate_size=5760,
vocab_size=122753,
rope_theta=10000.0,
max_position_embeddings=4096,
rope_partial_factor=1.0,
rms_norm_eps=1e-05,
norm_type='rms_norm',
hidden_act='silu',
hidden_act_param=0.03,
scale_depth=1.4,
scale_emb=12,
dim_model_base=256
)
2024-10-07 15:18:38.605 | SUCCESS | yamain.command.llm_build:llm_build:101 - prepare llm model done!
tiling op... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3287/3287 0:00:44
build op serially... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7610/7610 0:04:09
build op... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11485/11485 0:00:00
add ddr swap... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 253160/253160 0:00:42
calc input dependencies... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289230/289230 0:00:31
calc output dependencies... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289230/289230 0:00:42
assign eu heuristic ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289230/289230 0:00:51
assign eu onepass ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289230/289230 0:00:10
assign eu greedy ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 289230/289230 0:00:12
building vision model ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:14:51
building llm decode layers ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40/40 0:04:24
building llm post layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:02:19
2024-10-07 15:40:14.676 | SUCCESS | yamain.command.llm_build:llm_build:170 - build llm model done!
2024-10-07 15:40:48.246 | SUCCESS | yamain.command.llm_build:llm_build:349 - check llm model done!
Get embed file
chmod +x ./tools/fp32_to_bf16
chmod +x ./tools/embed_process.sh
./tools/embed_process_vl.sh openbmb/MiniCPM-V-2 openbmb/MiniCPM-V-2-ax650
The output file is:
root@xxx: tree openbmb/MiniCPM-V-2-ax650/
openbmb/MiniCPM-V-2-ax650/
├── minicpmv_p128_l0_together.axmodel
├── minicpmv_p128_l10_together.axmodel
...
├── minicpmv_p128_l19_together.axmodel
├── minicpmv_p128_l1_together.axmodel
├── minicpmv_p128_l20_together.axmodel
...
├── minicpmv_p128_l29_together.axmodel
├── minicpmv_p128_l2_together.axmodel
├── minicpmv_p128_l30_together.axmodel
...
├── minicpmv_p128_l39_together.axmodel
├── minicpmv_p128_l3_together.axmodel
...
├── minicpmv_p128_l8_together.axmodel
├── minicpmv_p128_l9_together.axmodel
├── minicpmv_post.axmodel
├── model.embed_tokens.weight.bfloat16.bin
└── vpm_resampler.axmodel
上板运行
The upboard deployment project for MiniCPM-V requires a branch of minicpmv using ax-llm

root@ax650:/llm-test/minicpm-v-2.0# ./run_minicpmv-2.sh
[I][ Init][ 125]: LLM init start
2% | █ | 1 / 44 [0.21s<9.11s, 4.83 count/s] tokenizer init ok
[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 44 / 44 [33.54s<33.54s, 1.31 count/s] init vpm axmodel ok,remain_cmm(8086 MB)
[I][ Init][ 284]: max_token_len : 1023
[I][ Init][ 289]: kv_cache_size : 2304, kv_cache_num: 1023
[I][ Init][ 297]: prefill_token_num : 128
[I][ Init][ 306]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述下图片
image >> ssd_dog.jpg
[I][ Encode][ 365]: image encode time : 728.507019 ms
[I][ Run][ 589]: ttft: 520.94 ms
这幅图片展示了一只大而毛茸茸的狗,可能是拉布拉多或类似品种,坐在黄色和红色相间的门廊上。这只狗看起来在休息,它的目光朝向相机,表情平静。在狗的后面,有一辆红色自行车,车架上有黑色的装饰,停放在门廊上。自行车上挂着几个行李袋,表明它可能用于旅行或运输。背景中,可以看到一辆白色车辆,可能是汽车,停在门廊的后面。整个场景暗示了一个家庭环境,可能是在住宅区。
[N][ Run][ 728]: hit eos,avg 5.55 token/s
4.7. Debugging instruction
pulsar2 llm_build
enables debug precision debugging by using --check_level
in the compile command
--check_level 1
: Tests the similarity of the first layer.--check_level 2
: Specifies the contents of the prompt input to simulate the model file generated by the run compilation.
4.7.1. –check_level 1
pulsar2 llm_build --check_level 1 --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650
LOG:
pulsar2 llm_build --check_level 1 --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650 --parallel 8
Config(
model_name='Qwen2-0.5B-Instruct',
model_type='qwen2',
num_hidden_layers=24,
num_attention_heads=14,
num_key_value_heads=2,
hidden_size=896,
intermediate_size=4864,
vocab_size=151936,
rope_theta=1000000.0,
max_position_embeddings=32768,
rope_partial_factor=1.0,
rms_norm_eps=1e-06,
norm_type='rms_norm',
hidden_act='silu',
hidden_act_param=0.03,
scale_depth=1.4,
scale_emb=1,
dim_model_base=256
)
2024-10-07 01:23:28.414 | SUCCESS | yamain.command.llm_build:llm_build:101 - prepare llm model done!
building llm decode layers ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 0:00:39
building llm post layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:01:26
2024-10-07 01:25:34.765 | SUCCESS | yamain.command.llm_build:llm_build:170 - build llm model done!
2024-10-07 01:25:38.740 | INFO | yamain.command.llm_build:llm_build:294 - decode layer0_gt layer0_got cos_sim is: 0.9986067835921196
2024-10-07 01:25:45.421 | INFO | yamain.command.llm_build:llm_build:325 - prefill layer0_gt layer0_got cos_sim is: 0.9986067835921196
2024-10-07 01:25:45.421 | SUCCESS | yamain.command.llm_build:llm_build:349 - check llm model done!
4.7.2. –check_level 2
pulsar2 llm_build --check_level 2 --prompt "<|im_start|>user\n1+1=?<|im_end|>\n<|im_start|>assistant\n" --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650
Because the debugging information of each hidden_layer is printed, the amount of information is a bit large, and only the more critical content is displayed here.
pulsar2 llm_build --check_level 2 --prompt "<|im_start|>user\n1+1=?<|im_end|>\n<|im_start|>assistant\n" --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --hidden_state_type bf16 --prefill_len 128 --chip AX650
Config(
model_name='Qwen2-0.5B-Instruct',
model_type='qwen2',
num_hidden_layers=24,
num_attention_heads=14,
num_key_value_heads=2,
hidden_size=896,
intermediate_size=4864,
vocab_size=151936,
rope_theta=1000000.0,
max_position_embeddings=32768,
rope_partial_factor=1.0,
rms_norm_eps=1e-06,
norm_type='rms_norm',
hidden_act='silu',
hidden_act_param=0.03,
scale_depth=1.4,
scale_emb=1,
dim_model_base=256
)
2024-10-07 01:04:57.881 | SUCCESS | yamain.command.llm_build:llm_build:101 - prepare llm model done!
building llm decode layers ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 0:00:39
building llm post layer ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:01:26
2024-10-07 01:07:04.398 | SUCCESS | yamain.command.llm_build:llm_build:170 - build llm model done!
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l0_together
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l1_together
...
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l22_together
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l23_together
2024-10-07 01:07:05.499 | INFO | yasched.llm_utils:run:497 - simulate layer 0
2024-10-07 01:07:11.902 | INFO | yasched.llm_utils:run:503 - end simulate
[[[-0.24707 0.0883789 -0.232422 ... -0.294922 0.0644531 -0.65625]
[0.0649414 -0.183594 -0.251953 ... -0.248047 -0.0231934 -0.138672]
[0.0766602 -0.0961914 0.152344 ... -0.0125732 0.106445 0.15625]
...
[-0.0737305 -0.210938 -0.455078 ... -0.640625 0.0429688 -0.263672]
[-0.0737305 -0.210938 -0.455078 ... -0.640625 0.0429688 -0.263672]
[-0.0737305 -0.210938 -0.455078 ... -0.640625 0.0429688 -0.263672]]]
2024-10-07 01:07:11.903 | INFO | yasched.llm_utils:run:497 - simulate layer 1
...
2024-10-07 01:09:35.992 | INFO | yasched.llm_utils:run:497 - simulate layer 23
2024-10-07 01:09:42.591 | INFO | yasched.llm_utils:run:503 - end simulate
[[[-1.25 0.222656 2.375 ... 2.07812 -0.410156 1.84375]
[-0.289062 -1.08594 0.234375 ... 1.07812 -0.257812 -1.96094]
[-0.0839844 -0.542969 0.636719 ... 3.21875 -0.351562 -2.01562]
...
[-3.21875 -0.478516 1.42188 ... 4.8125 1.21875 -0.294922]
[-3.21875 -0.478516 1.42188 ... 4.8125 1.21875 -0.294922]
[-3.21875 -0.478516 1.42188 ... 4.8125 1.21875 -0.294922]]]
2
posibile ('\n', 0.0),('答案', 0.0),('Result', 0.0),('0', 0.0),('3', 0.0),('2', 1.0),('1', 0.0),('Answer', 0.0),('\\', 0.0),('4', 0.0)
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l0_together
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l1_together
load Qwen/Qwen2-0.5B-w8a16/qwen2_p128_l2_together
...
start_indice = 12
2024-10-07 01:10:37.005 | INFO | yasched.llm_utils:run:556 - simulate layer 23
2024-10-07 01:10:38.859 | INFO | yasched.llm_utils:run:562 - end simulate
[-0.310547 -2.21875 0.871094 -1.86719 -0.546875]
start_indice = 12
<|im_end|>
posibile ('\n', 0.0),('\\t', 0.0),('<|im_start|>', 0.0),(' \\', 0.0),('.', 0.0),('\n\n', 0.0),(' ', 0.0),('\\', 0.0),('<|im_end|>', 1.0),('\\n', 0.0)
====================================================================================================
<|im_start|>user\n1+1=?<|im_end|>\n<|im_start|>assistant\n2<|im_end|>
====================================================================================================
hit eos!
2024-10-07 01:10:51.637 | SUCCESS | yamain.command.llm_build:llm_build:349 - check llm model done!