mgornergoogle's picture
Update README.md (#1)
e3865e5 verified
metadata
license: gemma
library_name: keras-nlp
extra_gated_heading: Access PaliGemma on Hugging Face
extra_gated_prompt: >-
  To access PaliGemma on Hugging Face, you’re required to review and agree to
  Google’s usage license. To do this, please ensure you’re logged-in to Hugging
  Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
pipeline_tag: image-text-to-text

PaliGemma is a set of multi-modal large language models published by Google based on the Gemma model. Both a pre-trained and instruction tuned models are available. See the model card below for benchmarks, data sources, and intended use cases.

Links

Installation

Keras and KerasNLP can be installed with:

pip install -U -q keras-nlp
pip install -U -q keras>=3

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instruction on installing them in another environment see the Keras Getting Started page.

Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset name Parameters Description
paligemma-3b-224-mix-keras 2.92B image size 224, mix fine tuned, text sequence length is 256
paligemma-3b-448-mix-keras 2.92B image size 448, mix fine tuned, text sequence length is 512
paligemma-3b-224-keras 2.92B image size 224, pre trained, text sequence length is 128
paligemma-3b-448-keras 2.92B image size 448, pre trained, text sequence length is 512
paligemma-3b-896-keras 2.93B image size 896, pre trained, text sequence length is 512

Prompts

The PaliGemma "mix" models can handle a number of prompting structures out of the box. It is important to stick exactly to these prompts, including the newline. Lang can be a language code such as "en" or "fr". Support for languages outside of English will vary depending on the prompt type.

  • "cap {lang}\n": very raw short caption (from WebLI-alt).
  • "caption {lang}\n": coco-like short captions.
  • "describe {lang}\n": somewhat longer more descriptive captions.
  • "ocr\n": optical character recognition.
  • "answer en {question}\n": question answering about the image contents.
  • "question {lang} {answer}\n": question generation for a given answer.
  • "detect {thing} ; {thing}\n": count objects in a scene.

Not "mix" presets should be fine-tuned for a specific task.

!pip install -U -q keras-nlp

Pick a backend of your choice

import os
os.environ["KERAS_BACKEND"] = "jax"

Now we can load the PaliGemma "causal language model" from the Kaggle Models hub. A causal language model is just a LLM that is ready for generation, by training with a causal mask, and running generation a token at a time in a recurrent loop.

keras.config.set_floatx("bfloat16")
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset(
    "hf://google/paligemma-3b-448-keras"
)

Function that reads an image from a given URL

def read_image(url):
    contents = io.BytesIO(requests.get(url).content)
    image = PIL.Image.open(contents)
    image = np.array(image)
    # Remove alpha channel if neccessary.
    if image.shape[2] == 4:
        image = image[:, :, :3]
    return image
image_url = 'https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png'
image = read_image(image_url)

Use generate() call with a single image and prompt. The text prompt has to end with \n.

prompt = 'answer en where is the cow standing?\n'
output = pali_gemma_lm.generate(
    inputs={
        "images": image,
        "prompts": prompt,
    }
)
print(output)

Use generate() call with a batched images and prompts.

prompts = [
    'answer en where is the cow standing?\n',
    'answer en what color is the cow?\n',
    'describe en\n',
    'detect cow\n',
    'segment cow\n',
]
images = [image, image, image, image, image]
outputs = pali_gemma_lm.generate(
    inputs={
        "images": images,
        "prompts": prompts,
    }
)
for output in outputs:
    print(output)

There's a few other style of prompts this model can handle out of the box...

cap {lang}\n: very raw short caption (from WebLI-alt).

caption {lang}\n: nice, coco-like short captions.

describe {lang}\n: somewhat longer more descriptive captions.

ocr\n: optical character recognition.

answer en {question}\n: question answering about the image contents.

question {lang} {answer}\n: question generation for a given answer.

detect {thing} ; {thing}\n: count objects in a scene.

Call fit() on a single batch

import numpy as np
image = np.random.uniform(-1, 1, size=(224, 224, 3))
x = {
    "images": [image, image],
    "prompts": ["answer en Where is the cow standing?\n", "caption en\n"],
}
y = {
    "responses": ["beach", "A brown cow standing on a beach next to the ocean."],
}
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset("hf://google/paligemma-3b-448-keras")
pali_gemma_lm.fit(x=x, y=y, batch_size=2)