File size: 11,430 Bytes
021f48c
fdb04eb
 
 
 
 
021f48c
fdb04eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d84df28
 
 
 
 
 
fdb04eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
license: openrail++
tags:
- stable-diffusion
- text-to-image
pinned: true
---

# Model Card for flex-diffusion-2-1

<!-- Provide a quick summary of what the model is/does. [Optional] -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.

## TLDR:

### There are 2 models in this repo:
- One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
- One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.

For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)

### It aims to solve the following issues:
1. Generated images looks like they are cropped from a larger image.

2. Generating non-square images creates weird results, due to the model being trained on square images.
Examples:

| resolution      | model   |   stable diffusion           |   flex diffusion              |
|:---------------:|:-------:|:----------------------------:|:-----------------------------:|
| 576x1024 (9:16) | v2-1    | ![img](imgs/21-576-1024.png) | ![img](imgs/21f-576-1024.png) |
| 576x1024 (9:16) | v2-base | ![img](imgs/2b-576-1024.png) | ![img](imgs/2bf-576-1024.png) |
| 1024x576 (16:9) | v2-1    | ![img](imgs/21-1024-576.png) | ![img](imgs/21f-1024-576.png) |
| 1024x576 (16:9) | v2-base | ![img](imgs/2b-1024-576.png) | ![img](imgs/2bf-1024-576.png) |

### Limitations:
1. It's trained on a small dataset, so it's improvements may be limited.
2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.

### Potential improvements:
1. Train on a larger dataset.
2. Train on different resolutions even for the same aspect ratio.
3. Train on specific aspect ratios, instead of a range of aspect ratios.


#  Table of Contents

- [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Table of Contents](#table-of-contents-1)
- [Model Details](#model-details)
  - [Model Description](#model-description)
- [Uses](#uses)
  - [Direct Use](#direct-use)
  - [Downstream Use [Optional]](#downstream-use-optional)
  - [Out-of-Scope Use](#out-of-scope-use)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
  - [Recommendations](#recommendations)
- [Training Details](#training-details)
  - [Training Data](#training-data)
  - [Training Procedure](#training-procedure)
    - [Preprocessing](#preprocessing)
    - [Speeds, Sizes, Times](#speeds-sizes-times)
- [Evaluation](#evaluation)
  - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
    - [Testing Data](#testing-data)
    - [Factors](#factors)
    - [Metrics](#metrics)
  - [Results](#results)
- [Model Examination](#model-examination)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
  - [Model Architecture and Objective](#model-architecture-and-objective)
  - [Compute Infrastructure](#compute-infrastructure)
    - [Hardware](#hardware)
    - [Software](#software)
- [Citation](#citation)
- [Glossary [optional]](#glossary-optional)
- [More Information [optional]](#more-information-optional)
- [Model Card Authors [optional]](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)


# Model Details

## Model Description

<!-- Provide a longer summary of what this model is/does. -->
stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.

finetuned resolutions:
|    |   width |   height | aspect ratio   |
|---:|--------:|---------:|:---------------|
|  0 |     512 |     1024 | 1:2            |
|  1 |     576 |     1024 | 9:16           |
|  2 |     576 |      960 | 3:5            |
|  3 |     640 |     1024 | 5:8            |
|  4 |     512 |      768 | 2:3            |
|  5 |     640 |      896 | 5:7            |
|  6 |     576 |      768 | 3:4            |
|  7 |     512 |      640 | 4:5            |
|  8 |     640 |      768 | 5:6            |
|  9 |     640 |      704 | 10:11          |
| 10 |     512 |      512 | 1:1            |
| 11 |     704 |      640 | 11:10          |
| 12 |     768 |      640 | 6:5            |
| 13 |     640 |      512 | 5:4            |
| 14 |     768 |      576 | 4:3            |
| 15 |     896 |      640 | 7:5            |
| 16 |     768 |      512 | 3:2            |
| 17 |    1024 |      640 | 8:5            |
| 18 |     960 |      576 | 5:3            |
| 19 |    1024 |      576 | 16:9           |
| 20 |    1024 |      512 | 2:1            |

- **Developed by:** Jonathan Chang
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s)**: English
- **License:** creativeml-openrail-m
- **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
- **Resources for more information:** More information needed

# Uses

- see https://huggingface.co/stabilityai/stable-diffusion-2-1


# Training Details

## Training Data

- LAION aesthetic dataset, subset of it with 6+ rating
  - https://laion.ai/blog/laion-aesthetics/
  - https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used a small portion of that, see [Preprocessing](#preprocessing)


- most common aspect ratios in the dataset (before preprocessing)

|    | aspect_ratio   |   counts |
|---:|:---------------|---------:|
|  0 | 1:1            |   154727 |
|  1 | 3:2            |   119615 |
|  2 | 2:3            |    61197 |
|  3 | 4:3            |    52276 |
|  4 | 16:9           |    38862 |
|  5 | 400:267        |    21893 |
|  6 | 3:4            |    16893 |
|  7 | 8:5            |    16258 |
|  8 | 4:5            |    15684 |
|  9 | 6:5            |    12228 |
| 10 | 1000:667       |    12097 |
| 11 | 2:1            |    11006 |
| 12 | 800:533        |    10259 |
| 13 | 5:4            |     9753 |
| 14 | 500:333        |     9700 |
| 15 | 250:167        |     9114 |
| 16 | 5:3            |     8460 |
| 17 | 200:133        |     7832 |
| 18 | 1024:683       |     7176 |
| 19 | 11:10          |     6470 |

- predefined aspect ratios

|    |   width |   height | aspect ratio   |
|---:|--------:|---------:|:---------------|
|  0 |     512 |     1024 | 1:2            |
|  1 |     576 |     1024 | 9:16           |
|  2 |     576 |      960 | 3:5            |
|  3 |     640 |     1024 | 5:8            |
|  4 |     512 |      768 | 2:3            |
|  5 |     640 |      896 | 5:7            |
|  6 |     576 |      768 | 3:4            |
|  7 |     512 |      640 | 4:5            |
|  8 |     640 |      768 | 5:6            |
|  9 |     640 |      704 | 10:11          |
| 10 |     512 |      512 | 1:1            |
| 11 |     704 |      640 | 11:10          |
| 12 |     768 |      640 | 6:5            |
| 13 |     640 |      512 | 5:4            |
| 14 |     768 |      576 | 4:3            |
| 15 |     896 |      640 | 7:5            |
| 16 |     768 |      512 | 3:2            |
| 17 |    1024 |      640 | 8:5            |
| 18 |     960 |      576 | 5:3            |
| 19 |    1024 |      576 | 16:9           |
| 20 |    1024 |      512 | 2:1            |


## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

### Preprocessing


1. download files with url &amp; caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
- I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
2. use img2dataset to convert to webdataset
    - https://github.com/rom1504/img2dataset
    - I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
    - the output folder is `/mnt/aesthetics6plus`, change this to your own folder

```bash
echo INPUT_FOLDER=first-file
echo OUTPUT_FOLDER=/mnt/aesthetics6plus
img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
        --url_col "URL" --caption_col "TEXT" --output_format webdataset\
        --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
        --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
```

3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
- use webdataset to load the data
- calculate the aspect ratio of each image
- find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
- keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
- random crop the image to the associated resolution. E.g. crop to 512x1024
- if more than 10% of the image is lost in the cropping, discard this example.
- batch examples by aspect ratio, so all examples in a batch have the same aspect ratio


### Speeds, Sizes, Times

- Dataset size: 100k image-caption pairs, before filtering.
  - I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.

- Hardware: 1 RTX3090 GPUs

- Optimizer: 8bit Adam

- Batch size: 32
  - actual batch size: 2
  - gradient_accumulation_steps: 16
  - effective batch size: 32

- Learning rate: warmup to 2e-6 for 500 steps and then kept constant

- Learning rate: 2e-6
- Training steps: 6k
- Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
  - Each example is seen 1.92 times on average.

- Training time: approximately 1 day

## Results

More information needed

# Model Card Authors

Jonathan Chang


# How to Get Started with the Model

Use the code below to get started with the model.


```python
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel

def use_DPM_solver(pipe):
    pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
    return pipe

pipe = StableDiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-2-1",
    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
    torch_dtype=torch.float16,
    )
# for v2-base, use the following line instead
#pipe = StableDiffusionPipeline.from_pretrained(
#  "stabilityai/stable-diffusion-2-base",
#    unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
#    torch_dtype=torch.float16)
pipe = use_DPM_solver(pipe).to("cuda")
pipe = pipe.to("cuda")

prompt = "a professional photograph of an astronaut riding a horse"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")
```