"Diffusers Image Fill" guide

Community Article Published September 13, 2024

This guide was an idea I had for a while but was asked by pietrobolcato here so finally made the decision to do it before it gets too old or I forget it.

So the basic idea is to do a simple object remover or fill a selected part of the image that you want to change, for this we will use a controlnet and some easy techniques.

To be able to do this, we need to use two key models, one is the ControlNetPlus Promax and the second one is to use the lighting models, in this case, since I want to do photorealism, I'll use RealVis 5.0 Lighting.

The controlnet is not part of the diffusers core, but the official repository has all the instructions to make it work, you'll need the StableDiffusionXLControlNetUnionPipeline.

I also set up a space as a PoC of this guide, for this, I did a custom pipeline with just what we need to make it work. You can test it here, if you use the app locally you can see the cool effect on how the image generates and fills the mask.

First we need an image, I downloaded some from unsplash.com. Lets use a demo the car in the mountains, the original image is here which was taken by jeffersonsees

Since this guide uses a custom pipeline with a custom controlnet that is not part of the core, I can't post the full code or it will be to big, so I'll try to give the key parts of what's needed to make it work. Also I will simplify the process by assuming square images of 1024x1024 which is not ideal in a real world scenario, this should be adapted to be used with any image in any aspect ratio and resolution.

I'll use pillow to avoid doing too many conversions between formats, so let's make the image a square:

from PIL import Image

from diffusers.utils import load_image


source_image = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/diffusers_fill/jefferson-sees-OCQjiB4tG5c-unsplash.jpg"
)

width, height = source_image.size
min_dimension = min(width, height)

left = (width - min_dimension) / 2
top = (height - min_dimension) / 2
right = (width + min_dimension) / 2
bottom = (height + min_dimension) / 2

final_source = source_image.crop((left, top, right, bottom))
final_source = final_source.resize((1024, 1024), Image.LANCZOS)

Then we need a mask, you can use any method to get it, you can use SAM2, BiRefNet or any of the newer models that lets you mask objects or it can be done manually, since this isn't about masking, I'll use the inpaint mask maker to generate one.

Now that we have the two images, what we need to do is to delete the masked part from the original, the result is the image we're going to feed the controlnet.

mask = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/diffusers_fill/car_mask.png"
).convert("L")

inverted_mask = ImageChops.invert(mask)
cnet_image = final_source.copy()
cnet_image.putalpha(inverted_mask)

This is the first part of the technique, the second one is to use a lighting model with less steps than a non-distilled model and also to use the controlnet tile mode at full strength for the whole steps so it preserves as much of the original image as possible.

I'll assume for this part the following:

You downloaded the ControlNetModel_Union model python file and have it in the same directory as the script.
You have downloaded the controlnet model weights locally and renamed the files accordingly.

The reason for the second one is that the official repo doesn't have easy to use format for the promax version of the model, if you want to see how to load it directly from the hub, you can read the official repository or look at the app code in the space.

import torch
from controlnet_union import ControlNetModel_Union
from diffusers import AutoencoderKL, StableDiffusionXLControlNetPipeline, TCDScheduler

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda")

controlnet_model = ControlNetModel_Union.from_pretrained(
    "./controlnet-union-sdxl-1.0",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "SG161222/RealVisXL_V5.0_Lightning",
    torch_dtype=torch.float16,
    vae=vae,
    custom_pipeline="OzzyGT/pipeline_sdxl_fill",
    controlnet=controlnet_model,
    variant="fp16",
).to("cuda")
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

prompt = "high quality"
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(prompt, "cuda", True)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    image=cnet_image,
)

With this, we get something like this image:

I did on purpose a bad mask which leaves some details of the original car and makes the generation weird or bad, sometimes we get the borders of the car, or something like this one, I even got a buffalo!!!

So now that we know that the mask affects a lot the result, I'll do a more detailed one that I know it works with GIMP, since the mask will not be a pure white and black mask, we need to convert it to a binary mask.

mask = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/diffusers_fill/car_mask_good.png"
).convert("L")

binary_mask = mask.point(lambda p: 255 if p > 0 else 0)
inverted_mask = ImageChops.invert(binary_mask)

My custom pipeline does a lot of the stuff you normally have to do, under the hood, so this means, set the steps, scales and the appropiate mode and image for the controlnet. You can use it if you want, but keep in mind that it's very restrictive and can be used mostly for what I use it in this guide.

Also take note that I use the TCD Scheduler for this, since is what I think works best with the lighting models, I also tried using PAG but it made the results worse for some reason.

Now we get something like this:

The last step for this guide is that, if you look closely, the image still gets changes, if the original image like this one has a good quality, you can see how it loses quality and some of the smaller details gets blurry, so to fix this we simply paste over the original alpha image and the beauty of this technique is that it merges seamesly, most people won't know it was inpainted if you don't tell them (I tested this).

generation	merged

For this example, since the bushes has a lot of details, if you look closely you can see the transition, so dependending on you use case, it would be better to not do the final paste but again, most people won't even notice this.

Here are some more examples (you can use the space to test them too):

original	fill

Credits for the images:

First one: original by Leonardo Iribe

Second one: original by Raymond Petrik

There's also an added benefit that I plan to use for a new outpanting guide, and that is that you can expand an image, so this is ideal for generation a temporal background that we can use to add detail later.

original	expanded

To improve the results, I encourage you to use some more advanced techniques like:

Use differential diffusion to merge the seams with the original image
Upscale the masked final generation, use it with img2img to add more details and then paste it back on the original.
Adapt this with better models (SD3 or Flux) when the controlnets gets as good as this one.

That's it for this guide, I hope it helps to learn how to use this awesome controlnet and to give you a headstart on how to get good quality images that you can use in your work.

The final full code with the final merge:

import torch
from PIL import Image, ImageChops

from controlnet_union import ControlNetModel_Union
from diffusers import AutoencoderKL, StableDiffusionXLControlNetPipeline, TCDScheduler
from diffusers.utils import load_image


source_image = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/diffusers_fill/jefferson-sees-OCQjiB4tG5c-unsplash.jpg"
)

width, height = source_image.size
min_dimension = min(width, height)

left = (width - min_dimension) / 2
top = (height - min_dimension) / 2
right = (width + min_dimension) / 2
bottom = (height + min_dimension) / 2

final_source = source_image.crop((left, top, right, bottom))
final_source = final_source.resize((1024, 1024), Image.LANCZOS).convert("RGBA")

mask = load_image(
    "https://huggingface.co/datasets/OzzyGT/testing-resources/resolve/main/diffusers_fill/car_mask_good.png"
).convert("L")

binary_mask = mask.point(lambda p: 255 if p > 0 else 0)
inverted_mask = ImageChops.invert(binary_mask)

alpha_image = Image.new("RGBA", final_source.size, (0, 0, 0, 0))
cnet_image = Image.composite(final_source, alpha_image, inverted_mask)

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16).to("cuda")

controlnet_model = ControlNetModel_Union.from_pretrained(
    "./controlnet-union-sdxl-1.0",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "SG161222/RealVisXL_V5.0_Lightning",
    torch_dtype=torch.float16,
    vae=vae,
    custom_pipeline="OzzyGT/pipeline_sdxl_fill",
    controlnet=controlnet_model,
    variant="fp16",
).to("cuda")
pipe.scheduler = TCDScheduler.from_config(pipe.scheduler.config)

prompt = "high quality"
(
    prompt_embeds,
    negative_prompt_embeds,
    pooled_prompt_embeds,
    negative_pooled_prompt_embeds,
) = pipe.encode_prompt(prompt, "cuda", True)

image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    negative_pooled_prompt_embeds=negative_pooled_prompt_embeds,
    image=cnet_image,
)

image = image.convert("RGBA")
cnet_image.paste(image, (0, 0), binary_mask)

cnet_image.save("final_generation.png")

Upvote