Jonathan Chang commited on
Commit
fdb04eb
1 Parent(s): 4101a24

Update model card

Browse files
Files changed (1) hide show
  1. README.md +278 -1
README.md CHANGED
@@ -1,3 +1,280 @@
1
  ---
2
- license: creativeml-openrail-m
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: openrail++
3
+ tags:
4
+ - stable-diffusion
5
+ - text-to-image
6
+ pinned: true
7
  ---
8
+
9
+ # Model Card for flex-diffusion-2-1
10
+
11
+ <!-- Provide a quick summary of what the model is/does. [Optional] -->
12
+ stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned with different aspect ratios.
13
+
14
+ ## TLDR:
15
+
16
+ ### There are 2 models in this repo:
17
+ - One based on stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for 6k steps.
18
+ - One based on stable-diffusion-2-base (stabilityai/stable-diffusion-2-base) finetuned for 6k steps, on the same dataset.
19
+
20
+ For usage, see - [How to Get Started with the Model](#how-to-get-started-with-the-model)
21
+
22
+ ### It aims to solve the following issues:
23
+ 1. Generated images looks like they are cropped from a larger image.
24
+ Examples:
25
+
26
+ 2. Generating non-square images creates weird results, due to the model being trained on square images.
27
+ Examples:
28
+
29
+
30
+ ### Limitations:
31
+ 1. It's trained on a small dataset, so it's improvements may be limited.
32
+ 2. For each aspect ratio, it's trained on only a fixed resolution. So it may not be able to generate images of different resolutions.
33
+ For 1:1 aspect ratio, it's fine-tuned at 512x512, although flex-diffusion-2-1 was last finetuned at 768x768.
34
+
35
+ ### Potential improvements:
36
+ 1. Train on a larger dataset.
37
+ 2. Train on different resolutions even for the same aspect ratio.
38
+ 3. Train on specific aspect ratios, instead of a range of aspect ratios.
39
+
40
+
41
+ # Table of Contents
42
+
43
+ - [Model Card for flex-diffusion-2-1](#model-card-for--model_id-)
44
+ - [Table of Contents](#table-of-contents)
45
+ - [Table of Contents](#table-of-contents-1)
46
+ - [Model Details](#model-details)
47
+ - [Model Description](#model-description)
48
+ - [Uses](#uses)
49
+ - [Direct Use](#direct-use)
50
+ - [Downstream Use [Optional]](#downstream-use-optional)
51
+ - [Out-of-Scope Use](#out-of-scope-use)
52
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
53
+ - [Recommendations](#recommendations)
54
+ - [Training Details](#training-details)
55
+ - [Training Data](#training-data)
56
+ - [Training Procedure](#training-procedure)
57
+ - [Preprocessing](#preprocessing)
58
+ - [Speeds, Sizes, Times](#speeds-sizes-times)
59
+ - [Evaluation](#evaluation)
60
+ - [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
61
+ - [Testing Data](#testing-data)
62
+ - [Factors](#factors)
63
+ - [Metrics](#metrics)
64
+ - [Results](#results)
65
+ - [Model Examination](#model-examination)
66
+ - [Environmental Impact](#environmental-impact)
67
+ - [Technical Specifications [optional]](#technical-specifications-optional)
68
+ - [Model Architecture and Objective](#model-architecture-and-objective)
69
+ - [Compute Infrastructure](#compute-infrastructure)
70
+ - [Hardware](#hardware)
71
+ - [Software](#software)
72
+ - [Citation](#citation)
73
+ - [Glossary [optional]](#glossary-optional)
74
+ - [More Information [optional]](#more-information-optional)
75
+ - [Model Card Authors [optional]](#model-card-authors-optional)
76
+ - [Model Card Contact](#model-card-contact)
77
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
78
+
79
+
80
+ # Model Details
81
+
82
+ ## Model Description
83
+
84
+ <!-- Provide a longer summary of what this model is/does. -->
85
+ stable-diffusion-2-1 (stabilityai/stable-diffusion-2-1) finetuned for dynamic aspect ratios.
86
+
87
+ finetuned resolutions:
88
+ | | width | height | aspect ratio |
89
+ |---:|--------:|---------:|:---------------|
90
+ | 0 | 512 | 1024 | 1:2 |
91
+ | 1 | 576 | 1024 | 9:16 |
92
+ | 2 | 576 | 960 | 3:5 |
93
+ | 3 | 640 | 1024 | 5:8 |
94
+ | 4 | 512 | 768 | 2:3 |
95
+ | 5 | 640 | 896 | 5:7 |
96
+ | 6 | 576 | 768 | 3:4 |
97
+ | 7 | 512 | 640 | 4:5 |
98
+ | 8 | 640 | 768 | 5:6 |
99
+ | 9 | 640 | 704 | 10:11 |
100
+ | 10 | 512 | 512 | 1:1 |
101
+ | 11 | 704 | 640 | 11:10 |
102
+ | 12 | 768 | 640 | 6:5 |
103
+ | 13 | 640 | 512 | 5:4 |
104
+ | 14 | 768 | 576 | 4:3 |
105
+ | 15 | 896 | 640 | 7:5 |
106
+ | 16 | 768 | 512 | 3:2 |
107
+ | 17 | 1024 | 640 | 8:5 |
108
+ | 18 | 960 | 576 | 5:3 |
109
+ | 19 | 1024 | 576 | 16:9 |
110
+ | 20 | 1024 | 512 | 2:1 |
111
+
112
+ - **Developed by:** Jonathan Chang
113
+ - **Model type:** Diffusion-based text-to-image generation model
114
+ - **Language(s)**: English
115
+ - **License:** creativeml-openrail-m
116
+ - **Parent Model:** https://huggingface.co/stabilityai/stable-diffusion-2-1
117
+ - **Resources for more information:** More information needed
118
+
119
+ # Uses
120
+
121
+ - see https://huggingface.co/stabilityai/stable-diffusion-2-1
122
+
123
+
124
+ # Training Details
125
+
126
+ ## Training Data
127
+
128
+ - LAION aesthetic dataset, subset of it with 6+ rating
129
+ - https://laion.ai/blog/laion-aesthetics/
130
+ - https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
131
+ - I only used a small portion of that, see [Preprocessing](#preprocessing)
132
+
133
+
134
+ - most common aspect ratios in the dataset (before preprocessing)
135
+
136
+ | | aspect_ratio | counts |
137
+ |---:|:---------------|---------:|
138
+ | 0 | 1:1 | 154727 |
139
+ | 1 | 3:2 | 119615 |
140
+ | 2 | 2:3 | 61197 |
141
+ | 3 | 4:3 | 52276 |
142
+ | 4 | 16:9 | 38862 |
143
+ | 5 | 400:267 | 21893 |
144
+ | 6 | 3:4 | 16893 |
145
+ | 7 | 8:5 | 16258 |
146
+ | 8 | 4:5 | 15684 |
147
+ | 9 | 6:5 | 12228 |
148
+ | 10 | 1000:667 | 12097 |
149
+ | 11 | 2:1 | 11006 |
150
+ | 12 | 800:533 | 10259 |
151
+ | 13 | 5:4 | 9753 |
152
+ | 14 | 500:333 | 9700 |
153
+ | 15 | 250:167 | 9114 |
154
+ | 16 | 5:3 | 8460 |
155
+ | 17 | 200:133 | 7832 |
156
+ | 18 | 1024:683 | 7176 |
157
+ | 19 | 11:10 | 6470 |
158
+
159
+ - predefined aspect ratios
160
+
161
+ | | width | height | aspect ratio |
162
+ |---:|--------:|---------:|:---------------|
163
+ | 0 | 512 | 1024 | 1:2 |
164
+ | 1 | 576 | 1024 | 9:16 |
165
+ | 2 | 576 | 960 | 3:5 |
166
+ | 3 | 640 | 1024 | 5:8 |
167
+ | 4 | 512 | 768 | 2:3 |
168
+ | 5 | 640 | 896 | 5:7 |
169
+ | 6 | 576 | 768 | 3:4 |
170
+ | 7 | 512 | 640 | 4:5 |
171
+ | 8 | 640 | 768 | 5:6 |
172
+ | 9 | 640 | 704 | 10:11 |
173
+ | 10 | 512 | 512 | 1:1 |
174
+ | 11 | 704 | 640 | 11:10 |
175
+ | 12 | 768 | 640 | 6:5 |
176
+ | 13 | 640 | 512 | 5:4 |
177
+ | 14 | 768 | 576 | 4:3 |
178
+ | 15 | 896 | 640 | 7:5 |
179
+ | 16 | 768 | 512 | 3:2 |
180
+ | 17 | 1024 | 640 | 8:5 |
181
+ | 18 | 960 | 576 | 5:3 |
182
+ | 19 | 1024 | 576 | 16:9 |
183
+ | 20 | 1024 | 512 | 2:1 |
184
+
185
+
186
+ ## Training Procedure
187
+
188
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
189
+
190
+ ### Preprocessing
191
+
192
+
193
+ 1. download files with url &amp; caption from https://huggingface.co/datasets/ChristophSchuhmann/improved_aesthetics_6plus
194
+ - I only used the first file `train-00000-of-00007-29aec9150af50f9f.parquet`
195
+ 2. use img2dataset to convert to webdataset
196
+ - https://github.com/rom1504/img2dataset
197
+ - I put train-00000-of-00007-29aec9150af50f9f.parquet in a folder called `first-file`
198
+ - the output folder is `/mnt/aesthetics6plus`, change this to your own folder
199
+
200
+ ```bash
201
+ echo INPUT_FOLDER=first-file
202
+ echo OUTPUT_FOLDER=/mnt/aesthetics6plus
203
+ img2dataset --url_list $INPUT_FOLDER --input_format "parquet"\
204
+ --url_col "URL" --caption_col "TEXT" --output_format webdataset\
205
+ --output_folder $OUTPUT_FOLDER --processes_count 3 --thread_count 6 --image_size 1024 --resize_only_if_bigger --resize_mode=keep_ratio_largest \
206
+ --save_additional_columns '["WIDTH","HEIGHT","punsafe","similarity"]' --enable_wandb True
207
+ ```
208
+
209
+ 3. The data-loading code will do preprocessing on the fly, so no need to do anything else. But it's not optimized for speed, the GPU utilization fluctuates between 80% and 100%. And it's not written for multi-GPU training, so use it with caution. The code will do the following:
210
+ - use webdataset to load the data
211
+ - calculate the aspect ratio of each image
212
+ - find the closest aspect ratio & it's associated resolution from the predefined resolutions: `argmin(abs(aspect_ratio - predefined_aspect_ratios))`. E.g. if the aspect ratio is 1:3, the closest resolution is 1:2. and it's associated resolution is 512x1024.
213
+ - keeping the aspect ratio, resize the image such that it's larger or equal to the associated resolution on each side. E.g. resize to 512x(512*3) = 512x1536
214
+ - random crop the image to the associated resolution. E.g. crop to 512x1024
215
+ - if more than 10% of the image is lost in the cropping, discard this example.
216
+ - batch examples by aspect ratio, so all examples in a batch have the same aspect ratio
217
+
218
+
219
+ ### Speeds, Sizes, Times
220
+
221
+ - Dataset size: 100k image-caption pairs, before filtering.
222
+ - I didn't wait for the whole dataset to be downloaded, I copied the first 10 tar files and their index files to a new folder called `aesthetics6plus-small`, with 100k image-caption pairs in total. The full dataset is a lot bigger.
223
+
224
+ - Hardware: 1 RTX3090 GPUs
225
+
226
+ - Optimizer: 8bit Adam
227
+
228
+ - Batch size: 32
229
+ - actual batch size: 2
230
+ - gradient_accumulation_steps: 16
231
+ - effective batch size: 32
232
+
233
+ - Learning rate: warmup to 2e-6 for 500 steps and then kept constant
234
+
235
+ - Learning rate: 2e-6
236
+ - Training steps: 6k
237
+ - Epoch size (approximate): 32 * 6k / 100k = 1.92 (not accounting for the filtering)
238
+ - Each example is seen 1.92 times on average.
239
+
240
+ - Training time: approximately 1 day
241
+
242
+ ## Results
243
+
244
+ More information needed
245
+
246
+ # Model Card Authors
247
+
248
+ Jonathan Chang
249
+
250
+
251
+ # How to Get Started with the Model
252
+
253
+ Use the code below to get started with the model.
254
+
255
+
256
+ ```python
257
+ from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler, UNet2DConditionModel
258
+
259
+ def use_DPM_solver(pipe):
260
+ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
261
+ return pipe
262
+
263
+ pipe = StableDiffusionPipeline.from_pretrained(
264
+ "stabilityai/stable-diffusion-2-1",
265
+ unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-1/unet", torch_dtype=torch.float16),
266
+ torch_dtype=torch.float16,
267
+ )
268
+ # for v2-base, use the following line instead
269
+ #pipe = StableDiffusionPipeline.from_pretrained(
270
+ # "stabilityai/stable-diffusion-2-base",
271
+ # unet = UNet2DConditionModel.from_pretrained("ttj/flex-diffusion-2-1", subfolder="2-base/unet", torch_dtype=torch.float16),
272
+ # torch_dtype=torch.float16)
273
+ pipe = use_DPM_solver(pipe).to("cuda")
274
+ pipe = pipe.to("cuda")
275
+
276
+ prompt = "a professional photograph of an astronaut riding a horse"
277
+ image = pipe(prompt).images[0]
278
+
279
+ image.save("astronaut_rides_horse.png")
280
+ ```