Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Published February 29, 2024

With the Generative AI (GenAI) revolution in full swing, text-generation with open-source transformer models like Llama 2 has become the talk of the town. AI enthusiasts as well as developers are looking to leverage the generative abilities of such models for their own use cases and applications. This article shows how easy it is to generate text with the Llama 2 family of models (7b, 13b and 70b) using Optimum Habana and a custom pipeline class – you'll be able to run the models with just a few lines of code!

This custom pipeline class has been designed to offer great flexibility and ease of use. Moreover, it provides a high level of abstraction and performs end-to-end text-generation which involves pre-processing and post-processing. There are multiple ways to use the pipeline - you can run the run_pipeline.py script from the Optimum Habana repository, add the pipeline class to your own python scripts, or initialize LangChain classes with it.

Prerequisites

Since the Llama 2 models are part of a gated repo, you need to request access if you haven't done it already. First, you have to visit the Meta website and accept the terms and conditions. After you are granted access by Meta (it can take a day or two), you have to request access in Hugging Face, using the same email address you provided in the Meta form.

After you are granted access, please login to your Hugging Face account by running the following command (you will need an access token, which you can get from your user profile page):

huggingface-cli login

You also need to install the latest version of Optimum Habana and clone the repo to access the pipeline script. Here are the commands to do so:

pip install optimum-habana==1.10.4
git clone -b v1.10-release https://github.com/huggingface/optimum-habana.git

In case you are planning to run distributed inference, install DeepSpeed depending on your SynapseAI version. In this case, I am using SynapseAI 1.14.0.

pip install git+https://github.com/HabanaAI/[email protected]

Now you are all set to perform text-generation with the pipeline!

Using the Pipeline

First, go to the following directory in your optimum-habana checkout where the pipeline scripts are located, and follow the instructions in the README to update your PYTHONPATH.

cd optimum-habana/examples/text-generation
pip install -r requirements.txt
cd text-generation-pipeline

If you wish to generate a sequence of text from a prompt of your choice, here is a sample command.

python run_pipeline.py  --model_name_or_path meta-llama/Llama-2-7b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "Here is my prompt"

You can also pass multiple prompts as input and change the temperature and top_p values for generation as follows.

python run_pipeline.py --model_name_or_path meta-llama/Llama-2-13b-hf --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?"

For generating text with large models such as Llama-2-70b, here is a sample command to launch the pipeline with DeepSpeed.

python ../../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py --model_name_or_path meta-llama/Llama-2-70b-hf --max_new_tokens 100 --bf16 --use_hpu_graphs --use_kv_cache --do_sample --temperature 0.5 --top_p 0.95 --prompt "Hello world" "How are you?" "Here is my prompt" "Once upon a time"

Usage in Python Scripts

You can use the pipeline class in your own scripts as shown in the example below. Run the following sample script from optimum-habana/examples/text-generation/text-generation-pipeline.

import argparse
import logging

from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser

# Define a logger
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)
logger = logging.getLogger(__name__)

# Set up an argument parser
parser = argparse.ArgumentParser()
args = setup_parser(parser)

# Define some pipeline arguments. Note that --model_name_or_path is a required argument for this script
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-7b-hf"
args.max_new_tokens = 100
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True

# Initialize the pipeline
pipe = GaudiTextGenerationPipeline(args, logger)

# You can provide input prompts as strings
prompts = ["He is working on", "Once upon a time", "Far far away"]

# Generate text with pipeline
for prompt in prompts:
    print(f"Prompt: {prompt}")
    output = pipe(prompt)
    print(f"Generated Text: {repr(output)}")

You will have to run the above script with python <name_of_script>.py --model_name_or_path a_model_name as --model_name_or_path is a required argument. However, the model name can be programatically changed as shown in the python snippet.

This shows us that the pipeline class operates on a string input and performs data pre-processing as well as post-processing for us.

LangChain Compatibility

The text-generation pipeline can be fed as input to LangChain classes via the use_with_langchain constructor argument. You can install LangChain as follows.

pip install langchain==0.0.191

Here is a sample script that shows how the pipeline class can be used with LangChain.

import argparse
import logging

from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

from pipeline import GaudiTextGenerationPipeline
from run_generation import setup_parser

# Define a logger
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)
logger = logging.getLogger(__name__)

# Set up an argument parser
parser = argparse.ArgumentParser()
args = setup_parser(parser)

# Define some pipeline arguments. Note that --model_name_or_path is a required argument for this script
args.num_return_sequences = 1
args.model_name_or_path = "meta-llama/Llama-2-13b-chat-hf"
args.max_input_tokens = 2048
args.max_new_tokens = 1000
args.use_hpu_graphs = True
args.use_kv_cache = True
args.do_sample = True
args.temperature = 0.2
args.top_p = 0.95

# Initialize the pipeline
pipe = GaudiTextGenerationPipeline(args, logger, use_with_langchain=True)

# Create LangChain object
llm = HuggingFacePipeline(pipeline=pipe)

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer,\
just say that you don't know, don't try to make up an answer.

Context: Large Language Models (LLMs) are the latest models used in NLP.
Their superior performance over smaller models has made them incredibly
useful for developers building NLP enabled applications. These models
can be accessed via Hugging Face's `transformers` library, via OpenAI
using the `openai` library, and via Cohere using the `cohere` library.

Question: {question}
Answer: """

prompt = PromptTemplate(input_variables=["question"], template=template)
llm_chain = LLMChain(prompt=prompt, llm=llm)

# Use LangChain object
question = "Which libraries and model providers offer LLMs?"
response = llm_chain(prompt.format(question=question))
print(f"Question 1: {question}")
print(f"Response 1: {response['text']}")

question = "What is the provided context about?"
response = llm_chain(prompt.format(question=question))
print(f"\nQuestion 2: {question}")
print(f"Response 2: {response['text']}")

The pipeline class has been validated for LangChain version 0.0.191 and may not work with other versions of the package.

Conclusion

We presented a custom text-generation pipeline on Intel® Gaudi® 2 AI accelerator that accepts single or multiple prompts as input. This pipeline offers great flexibility in terms of model size as well as parameters affecting text-generation quality. Furthermore, it is also very easy to use and to plug into your scripts, and is compatible with LangChain.

Use of the pretrained model is subject to compliance with third party licenses, including the “Llama 2 Community License Agreement” (LLAMAV2). For guidance on the intended use of the LLAMA2 model, what will be considered misuse and out-of-scope uses, who are the intended users and additional terms please review and read the instructions in this link https://ai.meta.com/llama/license/. Users bear sole liability and responsibility to follow and comply with any third party licenses, and Habana Labs disclaims and will bear no liability with respect to users’ use or compliance with third party licenses. To be able to run gated models like this Llama-2-70b-hf, you need the following:

Have a HuggingFace account

Agree to the terms of use of the model in its model card on the HF Hub

set a read token

Login to your account using the HF CLI: run huggingface-cli login before launching your script

Fast Inference on Large Language Models: BLOOMZ on Habana Gaudi2 Accelerator

By March 28, 2023 • 1

Faster Training and Inference: Habana Gaudi®2 vs Nvidia A100 80GB

By December 14, 2022 • 1

Upvote