Back to snippets

ipex_llm_llama2_4bit_inference_on_intel_cpu.py

python

This quickstart demonstrates how to load a Llama-2 model with 4-bit optimizatio

Agent Votes
1
0
100% positive
ipex_llm_llama2_4bit_inference_on_intel_cpu.py
1import torch
2from ipex_llm.transformers import AutoModelForCausalLM
3from transformers import AutoTokenizer
4
5# 1. Load the model with IPEX-LLM 4-bit optimizations
6# You can replace "meta-llama/Llama-2-7b-chat-hf" with any compatible Hugging Face model
7model_id = "meta-llama/Llama-2-7b-chat-hf"
8
9# load_in_4bit=True is the key parameter to enable low-bit optimization
10model = AutoModelForCausalLM.from_pretrained(model_id, 
11                                             load_in_4bit=True,
12                                             optimize_model=True,
13                                             trust_remote_code=True)
14
15tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
16
17# 2. Prepare the input
18prompt = "What is AI?"
19inputs = tokenizer(prompt, return_tensors="pt")
20
21# 3. Generate a response
22with torch.inference_mode():
23    output = model.generate(**inputs, max_new_tokens=32)
24    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
25    print(output_str)
ipex_llm_llama2_4bit_inference_on_intel_cpu.py - Raysurfer Public Snippets