Jonathan Chang

Add examples

d84df28 unverified over 1 year ago

No virus

11.4 kB

	---
	license: openrail++
	tags:
	- stable-diffusion
	- text-to-image
	pinned: true
	---

	# Model Card for flex-diffusion-2-1

	<!-- Provide a quick summary of what the model is/does. [Optional] -->
	stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.

	## TLDR:

	### There are 2 models in this repo:
	- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
	- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.

	For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)

	### It aims to solve the following issues:
	1. Generated images looks like they are cropped from a larger image.

	2. Generating non-square images creates weird results, due to the model being trained on square images.
	Examples:

	\| resolution \| model \| stable diffusion \| flex diffusion \|
	\|:---------------:\|:-------:\|:----------------------------:\|:-----------------------------:\|
	\| 576x1024 (9:16) \| v2-1 \| ![img](imgs/21-576-1024.png) \| ![img](imgs/21f-576-1024.png) \|
	\| 576x1024 (9:16) \| v2-base \| ![img](imgs/2b-576-1024.png) \| ![img](imgs/2bf-576-1024.png) \|
	\| 1024x576 (16:9) \| v2-1 \| ![img](imgs/21-1024-576.png) \| ![img](imgs/21f-1024-576.png) \|
	\| 1024x576 (16:9) \| v2-base \| ![img](imgs/2b-1024-576.png) \| ![img](imgs/2bf-1024-576.png) \|

	### Limitations:
	1. It's trained on a small dataset, so it's improvements may be limited.
	2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
	For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.

	### Potential improvements:
	1. Train on a larger dataset.
	2. Train on different resolutions even for the same aspect ratio.
	3. Train on specific aspect ratios, instead of a range of aspect ratios.


	# Table of Contents

	- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
	- [Table of Contents](#table-of-contents)
	- [Table of Contents](#table-of-contents-1)
	- [Model Details](#model-details)
	- [Model Description](#model-description)
	- [Uses](#uses)
	- [Direct Use](#direct-use)
	- [Downstream Use [Optional]](#downstream-use-optional)
	- [Out-of-Scope Use](#out-of-scope-use)
	- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
	- [Recommendations](#recommendations)
	- [Training Details](#training-details)
	- [Training Data](#training-data)
	- [Training Procedure](#training-procedure)
	- [Preprocessing](#preprocessing)
	- [Speeds, Sizes, Times](#speeds-sizes-times)
	- [Evaluation](#evaluation)
	- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
	- [Testing Data](#testing-data)
	- [Factors](#factors)
	- [Metrics](#metrics)
	- [Results](#results)
	- [Model Examination](#model-examination)
	- [Environmental Impact](#environmental-impact)
	- [Technical Specifications [optional]](#technical-specifications-optional)
	- [Model Architecture and Objective](#model-architecture-and-objective)
	- [Compute Infrastructure](#compute-infrastructure)
	- [Hardware](#hardware)
	- [Software](#software)
	- [Citation](#citation)
	- [Glossary [optional]](#glossary-optional)
	- [More Information [optional]](#more-information-optional)
	- [Model Card Authors [optional]](#model-card-authors-optional)
	- [Model Card Contact](#model-card-contact)
	- [How to Get Started with the Model](#how-to-get-started-with-the-model)


	# Model Details

	## Model Description

	<!-- Provide a longer summary of what this model is/does. -->
	stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.

	finetuned resolutions:
	\| \| width \| height \| aspect ratio \|
	\|---:\|--------:\|---------:\|:---------------\|
	\| 0 \| 512 \| 1024 \| 1:2 \|
	\| 1 \| 576 \| 1024 \| 9:16 \|
	\| 2 \| 576 \| 960 \| 3:5 \|
	\| 3 \| 640 \| 1024 \| 5:8 \|
	\| 4 \| 512 \| 768 \| 2:3 \|
	\| 5 \| 640 \| 896 \| 5:7 \|
	\| 6 \| 576 \| 768 \| 3:4 \|
	\| 7 \| 512 \| 640 \| 4:5 \|
	\| 8 \| 640 \| 768 \| 5:6 \|
	\| 9 \| 640 \| 704 \| 10:11 \|
	\| 10 \| 512 \| 512 \| 1:1 \|
	\| 11 \| 704 \| 640 \| 11:10 \|
	\| 12 \| 768 \| 640 \| 6:5 \|
	\| 13 \| 640 \| 512 \| 5:4 \|
	\| 14 \| 768 \| 576 \| 4:3 \|
	\| 15 \| 896 \| 640 \| 7:5 \|
	\| 16 \| 768 \| 512 \| 3:2 \|
	\| 17 \| 1024 \| 640 \| 8:5 \|
	\| 18 \| 960 \| 576 \| 5:3 \|
	\| 19 \| 1024 \| 576 \| 16:9 \|
	\| 20 \| 1024 \| 512 \| 2:1 \|

	- Developed by: Jonathan Chang
	- Model type: Diffusion-based text-to-image generation model
	- Language(s): English
	- License: creativeml-openrail-m
	- Parent Model: https://huggingface.co/stabilityai/stable-diffusion-2-1
	- Resources for more information: More information needed

	# Uses

	- see https://huggingface.co/stabilityai/stable-diffusion-2-1


	# Training Details

	## Training Data

	- LAION aesthetic dataset, subset of it with 6+ rating
	- https://laion.ai/blog/laion-aesthetics/
	- https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
	- I only used a small portion of that, see [Preprocessing](#preprocessing)


	- most common aspect ratios in the dataset (before preprocessing)

	\| \| aspect_ratio \| counts \|
	\|---:\|:---------------\|---------:\|
	\| 0 \| 1:1 \| 154727 \|
	\| 1 \| 3:2 \| 119615 \|
	\| 2 \| 2:3 \| 61197 \|
	\| 3 \| 4:3 \| 52276 \|
	\| 4 \| 16:9 \| 38862 \|
	\| 5 \| 400:267 \| 21893 \|
	\| 6 \| 3:4 \| 16893 \|
	\| 7 \| 8:5 \| 16258 \|
	\| 8 \| 4:5 \| 15684 \|
	\| 9 \| 6:5 \| 12228 \|
	\| 10 \| 1000:667 \| 12097 \|
	\| 11 \| 2:1 \| 11006 \|
	\| 12 \| 800:533 \| 10259 \|
	\| 13 \| 5:4 \| 9753 \|
	\| 14 \| 500:333 \| 9700 \|
	\| 15 \| 250:167 \| 9114 \|
	\| 16 \| 5:3 \| 8460 \|
	\| 17 \| 200:133 \| 7832 \|
	\| 18 \| 1024:683 \| 7176 \|
	\| 19 \| 11:10 \| 6470 \|

	- predefined aspect ratios

	\| \| width \| height \| aspect ratio \|
	\|---:\|--------:\|---------:\|:---------------\|
	\| 0 \| 512 \| 1024 \| 1:2 \|
	\| 1 \| 576 \| 1024 \| 9:16 \|
	\| 2 \| 576 \| 960 \| 3:5 \|
	\| 3 \| 640 \| 1024 \| 5:8 \|
	\| 4 \| 512 \| 768 \| 2:3 \|
	\| 5 \| 640 \| 896 \| 5:7 \|
	\| 6 \| 576 \| 768 \| 3:4 \|
	\| 7 \| 512 \| 640 \| 4:5 \|
	\| 8 \| 640 \| 768 \| 5:6 \|
	\| 9 \| 640 \| 704 \| 10:11 \|
	\| 10 \| 512 \| 512 \| 1:1 \|
	\| 11 \| 704 \| 640 \| 11:10 \|
	\| 12 \| 768 \| 640 \| 6:5 \|
	\| 13 \| 640 \| 512 \| 5:4 \|
	\| 14 \| 768 \| 576 \| 4:3 \|
	\| 15 \| 896 \| 640 \| 7:5 \|
	\| 16 \| 768 \| 512 \| 3:2 \|
	\| 17 \| 1024 \| 640 \| 8:5 \|
	\| 18 \| 960 \| 576 \| 5:3 \|
	\| 19 \| 1024 \| 576 \| 16:9 \|
	\| 20 \| 1024 \| 512 \| 2:1 \|


	## Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	### Preprocessing


	1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
	- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
	2. use img2dataset to convert to webdataset
	- https://github.com/rom1504/img2dataset
	- I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
	- the output folder is `/mnt/aesthetics6plus`, change this to your own folder

	```bash
	echo INPUT_FOLDER=first-file
	echo OUTPUT_FOLDER=/mnt/aesthetics6plus
	img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
	--url_col "URL" --caption_col "TEXT" --output_format webdataset\
	--output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
	--save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
	```

	3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
	- use webdataset to load the data
	- calculate the aspect ratio of each image
	- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
	- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
	- random crop the image to the associated resolution. E.g. crop to 512x1024
	- if more than 10% of the image is lost in the cropping, discard this example.
	- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio


	### Speeds, Sizes, Times

	- Dataset size: 100k image-caption pairs, before filtering.
	- I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.

	- Hardware: 1 RTX3090 GPUs

	- Optimizer: 8bit Adam

	- Batch size: 32
	- actual batch size: 2
	- gradient_accumulation_steps: 16
	- effective batch size: 32

	- Learning rate: warmup to 2e-6 for 500 steps and then kept constant

	- Learning rate: 2e-6
	- Training steps: 6k
	- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
	- Each example is seen 1.92 times on average.

	- Training time: approximately 1 day

	## Results

	More information needed

	# Model Card Authors

	Jonathan Chang


	# How to Get Started with the Model

	Use the code below to get started with the model.


	```python
	from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel

	def use_DPM_solver(pipe):
	pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
	return pipe

	pipe = StableDiffusionPipeline.from_pretrained(
	"stabilityai/stable-diffusion-2-1",
	unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
	torch_dtype=torch.float16,
	)
	# for v2-base, use the following line instead
	#pipe = StableDiffusionPipeline.from_pretrained(
	# "stabilityai/stable-diffusion-2-base",
	# unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
	# torch_dtype=torch.float16)
	pipe = use_DPM_solver(pipe).to("cuda")
	pipe = pipe.to("cuda")

	prompt = "a professional photograph of an astronaut riding a horse"
	image = pipe(prompt).images[0]

	image.save("astronaut_rides_horse.png")
	```