7. LLM Build (Compile)#

Supported platforms

  • AX650A/AX650N/AX8850 - SDK ≥ v3.6.2

  • AX630C - SDK ≥ v3.0.0

Verified models

  • Qwen3, Qwen2.5

  • DeepSeek-R1-Distill

  • MiniCPM4

  • InternVL2_5, InternVL3

  • ChatGLM3

  • OpenBuddy

  • SmolLM2

  • Llama3.2

  • Gemma2

  • Phi2, Phi3

  • TinyLlama

This chapter explains the basic workflow for converting models from Hugging Face. With the pulsar2 tool, you can compile *.safetensor or pytorch_model.bin from a Hugging Face project into an axmodel model. Please first follow 《Development environment preparation》 to set up the development environment.

The example model in this chapter is Qwen3-0.6B.

Version constraint

This document is written based on Pulsar2 version 5.2.

LLM ModelZoo

We periodically adapt popular LLMs in the community, including prebuilt models and on-board running examples.

Related project: AX-LLM

This project explores what common LLMs (Large Language Models) can do on existing chip platforms, so developers can quickly evaluate and build their own LLM applications.

7.1. Command reference#

In the Pulsar2 toolchain, use pulsar2 llm_build to convert an LLM model.

root@xxx:/data# pulsar2 llm_build --help
usage: pulsar2 llm_build [-h] [--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--prefill_len PREFILL_LEN] [--parallel PARALLEL]
                         [--model_config MODEL_CONFIG] [--model_type MODEL_TYPE] [--kv_cache_len KV_CACHE_LEN] [--post_topk POST_TOPK]
                         [--post_weight_type {bf16,s8,fp8_e5m2,fp8_e4m3}] [-t {fp16,bf16,fp32}] [-w {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}] [-c CHECK_LEVEL]
                         [--chip {AX620E,AX650,LAMBERT}] [--prompt PROMPT] [--image_size IMAGE_SIZE] [--last_kv_cache_len LAST_KV_CACHE_LEN]
                         [--tensor_parallel_size TENSOR_PARALLEL_SIZE] [--ret_postnorm] [--ld_param_opt] [--npu_mode {NPU1,NPU2,NPU3}]

options:
  -h, --help            show this help message and exit
  --input_path INPUT_PATH
                        path of model or npy path (default: )
  --output_path OUTPUT_PATH
                        path of dumpped ax_model (default: .)
  --prefill_len PREFILL_LEN
                        token length of prefill (default: 0)
  --parallel PARALLEL   build parallel (default: 1)
  --model_config MODEL_CONFIG
                        config file (default: )
  --model_type MODEL_TYPE
                        config file (default: )
  --kv_cache_len KV_CACHE_LEN
                        length of kv_cache (default: 127)
  --post_topk POST_TOPK
                        post model output indices and prob (default: 0)
  --post_weight_type {bf16,s8,fp8_e5m2,fp8_e4m3}
                        post weight type (default: s8)
  -t {fp16,bf16,fp32}, --hidden_state_type {fp16,bf16,fp32}
                        hidden_state dtype (default: bf16)
  -w {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}, --weight_type {fp16,bf16,fp32,s8,s4,fp8_e5m2,fp8_e4m3}
                        weight dtype (default: s8)
  -c CHECK_LEVEL, --check_level CHECK_LEVEL
                        check level 0:run 1:layer_check 2: cal 1+1 (default: 0)
  --chip {AX620E,AX650,LAMBERT}
                        chip (default: AX650)
  --prompt PROMPT       prompt for check_level==2 (default: 1+1=)
  --image_size IMAGE_SIZE
                        vlm vision_part input_size (default: 224)
  --last_kv_cache_len LAST_KV_CACHE_LEN
                        last kv cache len (default: None)
  --tensor_parallel_size TENSOR_PARALLEL_SIZE
                        tensor parallel size (default: 0)
  --ret_postnorm        weather to return post_norm value in post layer (default: False)
  --ld_param_opt        ld_param_opt (default: False)
  --npu_mode {NPU1,NPU2,NPU3}

7.2. Download the ax-llm-build project#

If you want to compile a raw Hugging Face model into an axmodel file yourself, you can use the helper scripts in ax-llm-build to download the model, process embeddings, and so on. If you are running a prebuilt model from AXERA-TECH on the board directly, you can skip this step.

git clone https://github.com/AXERA-TECH/ax-llm-build.git

7.3. Download Qwen3-0.6B#

cd ax-llm-build
pip install -U huggingface_hub
hf download Qwen/Qwen3-0.6B --local-dir Qwen/Qwen3-0.6B

7.4. Build (compile)#

pulsar2 llm_build --input_path Qwen/Qwen3-0.6B/  --output_path Qwen/Qwen3-0.6B-ax650 --hidden_state_type bf16 --kv_cache_len 1023 --prefill_len 128 --last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512  --chip AX650 -c 1 --parallel 8

7.4.1. Log example#

pulsar2 llm_build --input_path Qwen/Qwen3-0.6B/  --output_path Qwen/Qwen3-0.6B-ax650 --hidden_state_type bf16 --kv_cache_len 1023 --prefill_len 128 --last_kv_cache_len 128 --last_kv_cache_len 256 --last_kv_cache_len 384 --last_kv_cache_len 512  --chip AX650 -c 1 --parallel 8
Config(
    model_name='Qwen3-0.6B',
    model_type='qwen3',
    num_hidden_layers=28,
    num_attention_heads=16,
    num_key_value_heads=8,
    hidden_size=1024,
    head_dim=128,
    intermediate_size=3072,
    vocab_size=151936,
    rope_theta=1000000,
    max_position_embeddings=40960,
    rope_partial_factor=1.0,
    rope_local_base_freq=None,
    rms_norm_eps=1e-06,
    norm_type='rms_norm',
    hidden_act='silu',
    hidden_act_param=0.03,
    scale_depth=1.4,
    scale_emb=1,
    dim_model_base=256,
    origin_model_type='',
    quant=False,
    quant_sym=False,
    quant_bits=4,
    quant_group_size=128,
    rs_factor=32,
    rs_high_freq_factor=4.0,
    rs_low_freq_factor=1.0,
    rs_original_max_position_embeddings=8192,
    rs_rope_type='',
    rs_alpha=None,
    rs_beta_fast=None,
    rs_beta_slow=None,
    rs_mscale=None,
    rs_mscale_all_dim=None,
    rs_mrope_section=[16, 24, 24],
    interleaved_mrope=False,
    use_qk_norm=False,
    qk_norm_after_rope=False,
    layer_types=[],
    kv_cache_len=1023
)
2026-03-23 21:05:42.252 | SUCCESS  | yamain.command.llm_build:llm_build:258 - prepare llm model done!
building llm decode layers   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 28/28     0:02:50
building llm post layer   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1     0:01:23
2026-03-23 21:09:56.131 | SUCCESS  | yamain.command.llm_build:llm_build:368 - build llm model done!
2026-03-23 21:10:01.591 | INFO     | yamain.command.llm_build:llm_build:519 - decode layer0_gt layer0_got cos_sim is: 1.0
2026-03-23 21:10:12.356 | INFO     | yamain.command.llm_build:llm_build:553 - prefill layer0_gt layer0_got cos_sim is: 1.0
2026-03-23 21:10:12.357 | SUCCESS  | yamain.command.llm_build:llm_build:578 - check llm model done!

Note

The host configuration used in this example:

  • Intel(R) Xeon(R) Gold 6336Y CPU @ 2.40GHz

  • Memory 32G

The whole process takes about 5min. The conversion time can vary slightly across different host machines.

7.4.2. Extract and optimize embeddings#

chmod +x ./tools/fp32_to_bf16
chmod +x ./tools/embed_process.sh
./tools/embed_process.sh Qwen/Qwen3-0.6B/ Qwen/Qwen3-0.6B-ax650/

7.4.3. Output files#

root@xxx:/data/ax-llm-build# tree Qwen/Qwen3-0.6B-ax650/
Qwen/Qwen3-0.6B-ax650/
├── model.embed_tokens.weight.bfloat16.bin
├── model.embed_tokens.weight.float32.bin   # Temporary file, can be deleted
├── model.embed_tokens.weight.npy           # Temporary file, can be deleted
├── qwen3_p128_l0_together.axmodel
├── qwen3_p128_l10_together.axmodel
├── qwen3_p128_l11_together.axmodel
├── qwen3_p128_l12_together.axmodel
├── qwen3_p128_l13_together.axmodel
├── qwen3_p128_l14_together.axmodel
├── qwen3_p128_l15_together.axmodel
├── qwen3_p128_l16_together.axmodel
├── qwen3_p128_l17_together.axmodel
├── qwen3_p128_l18_together.axmodel
├── qwen3_p128_l19_together.axmodel
├── qwen3_p128_l1_together.axmodel
├── qwen3_p128_l20_together.axmodel
├── qwen3_p128_l21_together.axmodel
├── qwen3_p128_l22_together.axmodel
├── qwen3_p128_l23_together.axmodel
├── qwen3_p128_l24_together.axmodel
├── qwen3_p128_l25_together.axmodel
├── qwen3_p128_l26_together.axmodel
├── qwen3_p128_l27_together.axmodel
├── qwen3_p128_l2_together.axmodel
├── qwen3_p128_l3_together.axmodel
├── qwen3_p128_l4_together.axmodel
├── qwen3_p128_l5_together.axmodel
├── qwen3_p128_l6_together.axmodel
├── qwen3_p128_l7_together.axmodel
├── qwen3_p128_l8_together.axmodel
├── qwen3_p128_l9_together.axmodel
└── qwen3_post.axmodel

0 directories, 32 files

Among them, model.embed_tokens.weight.bfloat16.bin, qwen3_p128_l0_together.axmodel ~ qwen3_p128_l27_together.axmodel, and qwen3_post.axmodel are required for running on the board.

7.5. Run on the development board#

This section shows how to run an LLM model on an AX650 development board.

7.5.1. Install axllm#

We recommend using the installation script provided by the ax-llm project to install directly on the board:

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

After installation, run the following command to confirm axllm is installed successfully:

root@ax650:~/llm-test# axllm --help
Usage:
  axllm run <model_path> [options]    Run interactive chat mode
  axllm serve <model_path> [options]  Run HTTP API server mode

Arguments:
  model_path    Path to model directory containing config.json and model files

Serve options:
  --port <port> Server port (default: 8080)

Model directory structure:
  model_path/
    ├── config.json          # Model configuration
    ├── tokenizer.txt        # Tokenizer model
    ├── *.axmodel            # AXera model files
    └── post_config.json     # Post-processing config (optional)

If you want to learn how to build manually, see the AX-LLM project documentation.

7.5.2. Run the LLM with ax-llm#

You can download all files for this example directly from Hugging Face. The current ModelZoo already includes the tokenizer files in each model repo, for example:

So in the current version you do not need to run a separate tokenizer parser anymore. Just download the Hugging Face model directory and pass that directory to axllm run or axllm serve. It will load and run automatically, which is simpler than older versions.

Take AXERA-TECH/Qwen3-0.6B as an example:

pip install -U huggingface_hub
hf download AXERA-TECH/Qwen3-0.6B --local-dir Qwen3-0.6B

7.5.3. Run in CLI#

root@ax650:~/llm-test# axllm run Qwen3-0.6B/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [3.25s<3.36s, 9.23 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][                            Init][ 199]: max_token_len : 2559
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 214]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [3.25s<3.25s, 9.54 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
----------------------------------------
prompt >> who are you?
[I][                      SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:23
[I][                      SetKVCache][ 408]: current prefill_max_token_num:2048
[I][                      SetKVCache][ 409]: first run
[I][                             Run][ 457]: input token num : 23, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=23
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=0
[I][                             Run][ 627]: ttft: 173.71 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Let me start by acknowledging their question. I should mention that I'm an AI assistant     designed to help with various tasks. It's important to keep the response friendly and open-ended so they feel comfortable sharing more. I should make sure to     highlight that I'm here to assist and that I'm not a person. Let me check if there's any additional information I should include to make the response more     helpful. Alright, that should cover it.
</think>

I'm an AI assistant designed to help with a wide range of tasks and questions. I'm here to assist you with anything you need! Let me know how I can help!

[N][                             Run][ 709]: hit eos,avg 15.68 token/s

[I][                      GetKVCache][ 380]: precompute_len:168, remaining:1880
prompt >> q

7.5.4. Run as a service#

axllm can start a model directory directly as an OpenAI API-compatible service. This is convenient for integration and secondary development.

root@ax650:~/llm-test# axllm serve Qwen3-0.6B/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [2.65s<2.74s, 11.30 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][                            Init][ 199]: max_token_len : 2559
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 214]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [2.65s<2.65s, 11.68 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-0.6B'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-0.6B

After the service starts, you can call it through the standard OpenAI-compatible API.

7.5.5. API call example#

After the service starts, you can send standard HTTP requests to the OpenAI-compatible endpoints. The simplest example is:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"AXERA-TECH/Qwen3-0.6B","messages":[{"role":"user","content":"你好"}]}'

If you want to test with the example script in the ax-llm project, you can do the following:

root@ax650:~/llm-test# curl -sOL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/refs/heads/axllm/scripts/openai_demo.py

root@ax650:~/llm-test# python openai_demo.py --model AXERA-TECH/Qwen3-0.6B --api_url http://127.0.0.1:8000/v1
assistant:
<think>
Okay, the user just said "hello". I need to respond appropriately. Since they're greeting me, I should acknowledge their greeting. Maybe say "Hello!" in a friendly way. Let me check if there's any specific context I should consider, but the user didn't mention anything else. I should keep it simple and welcoming. Alright, time to send a response.
</think>

Hello! How can I assist you today? 😊

To customize the prompt, refer to:

root@ax650:~/llm-test# python openai_demo.py --model AXERA-TECH/Qwen3-0.6B --api_url http://127.0.0.1:8000/v1 --prompt "Please introduce yourself."

Note: openai_demo.py is only an example for calling the API. In real applications, we recommend integrating directly according to the OpenAI API specification.

For the board-side build flow of the runtime program, and more details about run / serve / API usage, see our open-source project on GitHub: AX-LLM