--- license: openrail++ tags: - stable-diffusion - text-to-image pinned: true --- # Model Card for flex-diffusion-2-1 stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios. ## TLDR: ### There are 2 models in this repo: - One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps. - One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset. For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model) ### It aims to solve the following issues: 1. Generated images looks like they are cropped from a larger image. 2. Generating non-square images creates weird results, due to the model being trained on square images. Examples: | resolution | model | stable diffusion | flex diffusion | |:---------------:|:-------:|:----------------------------:|:-----------------------------:| | 576x1024 (9:16) | v2-1 | ![img](imgs/21-576-1024.png) | ![img](imgs/21f-576-1024.png) | | 576x1024 (9:16) | v2-base | ![img](imgs/2b-576-1024.png) | ![img](imgs/2bf-576-1024.png) | | 1024x576 (16:9) | v2-1 | ![img](imgs/21-1024-576.png) | ![img](imgs/21f-1024-576.png) | | 1024x576 (16:9) | v2-base | ![img](imgs/2b-1024-576.png) | ![img](imgs/2bf-1024-576.png) | ### Limitations: 1. It's trained on a small dataset, so it's improvements may be limited. 2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions. For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768. ### Potential improvements: 1. Train on a larger dataset. 2. Train on different resolutions even for the same aspect ratio. 3. Train on specific aspect ratios, instead of a range of aspect ratios. # Table of Contents - [Model Card for flex-diffusion-2-1](#model-card-for--model_id-) - [Table of Contents](#table-of-contents) - [Table of Contents](#table-of-contents-1) - [Model Details](#model-details) - [Model Description](#model-description) - [Uses](#uses) - [Direct Use](#direct-use) - [Downstream Use [Optional]](#downstream-use-optional) - [Out-of-Scope Use](#out-of-scope-use) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Recommendations](#recommendations) - [Training Details](#training-details) - [Training Data](#training-data) - [Training Procedure](#training-procedure) - [Preprocessing](#preprocessing) - [Speeds, Sizes, Times](#speeds-sizes-times) - [Evaluation](#evaluation) - [Testing Data, Factors & Metrics](#testing-data-factors--metrics) - [Testing Data](#testing-data) - [Factors](#factors) - [Metrics](#metrics) - [Results](#results) - [Model Examination](#model-examination) - [Environmental Impact](#environmental-impact) - [Technical Specifications [optional]](#technical-specifications-optional) - [Model Architecture and Objective](#model-architecture-and-objective) - [Compute Infrastructure](#compute-infrastructure) - [Hardware](#hardware) - [Software](#software) - [Citation](#citation) - [Glossary [optional]](#glossary-optional) - [More Information [optional]](#more-information-optional) - [Model Card Authors [optional]](#model-card-authors-optional) - [Model Card Contact](#model-card-contact) - [How to Get Started with the Model](#how-to-get-started-with-the-model) # Model Details ## Model Description stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios. finetuned resolutions: | | width | height | aspect ratio | |---:|--------:|---------:|:---------------| | 0 | 512 | 1024 | 1:2 | | 1 | 576 | 1024 | 9:16 | | 2 | 576 | 960 | 3:5 | | 3 | 640 | 1024 | 5:8 | | 4 | 512 | 768 | 2:3 | | 5 | 640 | 896 | 5:7 | | 6 | 576 | 768 | 3:4 | | 7 | 512 | 640 | 4:5 | | 8 | 640 | 768 | 5:6 | | 9 | 640 | 704 | 10:11 | | 10 | 512 | 512 | 1:1 | | 11 | 704 | 640 | 11:10 | | 12 | 768 | 640 | 6:5 | | 13 | 640 | 512 | 5:4 | | 14 | 768 | 576 | 4:3 | | 15 | 896 | 640 | 7:5 | | 16 | 768 | 512 | 3:2 | | 17 | 1024 | 640 | 8:5 | | 18 | 960 | 576 | 5:3 | | 19 | 1024 | 576 | 16:9 | | 20 | 1024 | 512 | 2:1 | - **Developed by:** Jonathan Chang - **Model type:** Diffusion-based text-to-image generation model - **Language(s)**: English - **License:** creativeml-openrail-m - **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1 - **Resources for more information:** More information needed # Uses - see https://huggingface.co/stabilityai/stable-diffusion-2-1 # Training Details ## Training Data - LAION aesthetic dataset, subset of it with 6+ rating - https://laion.ai/blog/laion-aesthetics/ - https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus - I only used a small portion of that, see [Preprocessing](#preprocessing) - most common aspect ratios in the dataset (before preprocessing) | | aspect_ratio | counts | |---:|:---------------|---------:| | 0 | 1:1 | 154727 | | 1 | 3:2 | 119615 | | 2 | 2:3 | 61197 | | 3 | 4:3 | 52276 | | 4 | 16:9 | 38862 | | 5 | 400:267 | 21893 | | 6 | 3:4 | 16893 | | 7 | 8:5 | 16258 | | 8 | 4:5 | 15684 | | 9 | 6:5 | 12228 | | 10 | 1000:667 | 12097 | | 11 | 2:1 | 11006 | | 12 | 800:533 | 10259 | | 13 | 5:4 | 9753 | | 14 | 500:333 | 9700 | | 15 | 250:167 | 9114 | | 16 | 5:3 | 8460 | | 17 | 200:133 | 7832 | | 18 | 1024:683 | 7176 | | 19 | 11:10 | 6470 | - predefined aspect ratios | | width | height | aspect ratio | |---:|--------:|---------:|:---------------| | 0 | 512 | 1024 | 1:2 | | 1 | 576 | 1024 | 9:16 | | 2 | 576 | 960 | 3:5 | | 3 | 640 | 1024 | 5:8 | | 4 | 512 | 768 | 2:3 | | 5 | 640 | 896 | 5:7 | | 6 | 576 | 768 | 3:4 | | 7 | 512 | 640 | 4:5 | | 8 | 640 | 768 | 5:6 | | 9 | 640 | 704 | 10:11 | | 10 | 512 | 512 | 1:1 | | 11 | 704 | 640 | 11:10 | | 12 | 768 | 640 | 6:5 | | 13 | 640 | 512 | 5:4 | | 14 | 768 | 576 | 4:3 | | 15 | 896 | 640 | 7:5 | | 16 | 768 | 512 | 3:2 | | 17 | 1024 | 640 | 8:5 | | 18 | 960 | 576 | 5:3 | | 19 | 1024 | 576 | 16:9 | | 20 | 1024 | 512 | 2:1 | ## Training Procedure ### Preprocessing 1. download files with url & caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus - I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet` 2. use img2dataset to convert to webdataset - https://github.com/rom1504/img2dataset - I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file` - the output folder is `/mnt/aesthetics6plus`, change this to your own folder ```bash echo INPUT_FOLDER=first-file echo OUTPUT_FOLDER=/mnt/aesthetics6plus img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\ --url_col "URL" --caption_col "TEXT" --output_format webdataset\ --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \ --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True ``` 3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following: - use webdataset to load the data - calculate the aspect ratio of each image - find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024. - keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536 - random crop the image to the associated resolution. E.g. crop to 512x1024 - if more than 10% of the image is lost in the cropping, discard this example. - batch examples by aspect ratio, so all examples in a batch have the same aspect ratio ### Speeds, Sizes, Times - Dataset size: 100k image-caption pairs, before filtering. - I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger. - Hardware: 1 RTX3090 GPUs - Optimizer: 8bit Adam - Batch size: 32 - actual batch size: 2 - gradient_accumulation_steps: 16 - effective batch size: 32 - Learning rate: warmup to 2e-6 for 500 steps and then kept constant - Learning rate: 2e-6 - Training steps: 6k - Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering) - Each example is seen 1.92 times on average. - Training time: approximately 1 day ## Results More information needed # Model Card Authors Jonathan Chang # How to Get Started with the Model Use the code below to get started with the model. ```python from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel def use_DPM_solver(pipe): pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) return pipe pipe = StableDiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1", unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16), torch_dtype=torch.float16, ) # for v2-base, use the following line instead #pipe = StableDiffusionPipeline.from_pretrained( # "stabilityai/stable-diffusion-2-base", # unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16), # torch_dtype=torch.float16) pipe = use_DPM_solver(pipe).to("cuda") pipe = pipe.to("cuda") prompt = "a professional photograph of an astronaut riding a horse" image = pipe(prompt).images[0] image.save("astronaut_rides_horse.png") ```