Isaak Carter Augustus commited on
Commit
adb2018
1 Parent(s): bf0fed4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +705 -1
README.md CHANGED
@@ -105,4 +105,708 @@ I welcome contributions from the you! To contribute to J.O.S.I.E., please fork t
105
 
106
  ## License
107
 
108
- J.O.S.I.E. is licensed under the Apache2 License. See the [LICENSE](LICENSE) file for more details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  ## License
107
 
108
+ J.O.S.I.E. is licensed under the Apache2 License. See the [LICENSE](LICENSE) file for more details.
109
+
110
+
111
+
112
+
113
+
114
+ # Big Updates!
115
+
116
+ I have finaly trained the Vision and Audio encoder part, big thangs to FaceBook Research for the ImageBind model, wich is what I have build it on top of.
117
+
118
+ What I did was, I copied the weights from the original ImageBind model into a second 'downcycled' ImageBindVisionAudioHuge model.
119
+ After that I have continued to trained the model on a custom Vision and Audio dataset using the contrastive learning Algorythm introduced by Google with Pali Gemma with the text embeddings from the origional ImageBind model.
120
+
121
+ After mergind the encoder with the test reasoner (Qwen2-0.5B-Instruct), I got succesfull inference on both video, image and audio.
122
+ I will slowly start writing the training scrypt, creating the new dataset, and optimizing the model and inference code a litle bit more, and lastly train the model.
123
+
124
+ Here are the actual model layers:
125
+
126
+ ```txt
127
+ ImageBindModelAudioVision(
128
+ (modality_preprocessors): ModuleDict(
129
+ (vision): RGBDTPreprocessor(
130
+ (cls_token): tensor((1, 1, 1280), requires_grad=True)
131
+
132
+ (rgbt_stem): PatchEmbedGeneric(
133
+ (proj): Sequential(
134
+ (0): PadIm2Video()
135
+ (1): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
136
+ )
137
+ )
138
+ (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
139
+ (pos_embed): tensor((1, 257, 1280), requires_grad=True)
140
+
141
+ )
142
+ )
143
+ (audio): AudioPreprocessor(
144
+ (cls_token): tensor((1, 1, 768), requires_grad=True)
145
+
146
+ (rgbt_stem): PatchEmbedGeneric(
147
+ (proj): Conv2d(1, 768, kernel_size=(16, 16), stride=(10, 10), bias=False)
148
+ (norm_layer): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
149
+ )
150
+ (pos_embedding_helper): SpatioTemporalPosEmbeddingHelper(
151
+ (pos_embed): tensor((1, 229, 768), requires_grad=True)
152
+
153
+ )
154
+ )
155
+ )
156
+ (modality_trunks): ModuleDict(
157
+ (vision): SimpleTransformer(
158
+ (pre_transformer_layer): Sequential(
159
+ (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
160
+ (1): EinOpsRearrange()
161
+ )
162
+ (blocks): Sequential(
163
+ (0): BlockWithMasking(
164
+ (attn): MultiheadAttention(
165
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
166
+ )
167
+ (drop_path): Identity()
168
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
169
+ (mlp): Mlp(
170
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
171
+ (act): GELU(approximate='none')
172
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
173
+ (drop): Dropout(p=0.0, inplace=False)
174
+ )
175
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
176
+ )
177
+ (1): BlockWithMasking(
178
+ (attn): MultiheadAttention(
179
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
180
+ )
181
+ (drop_path): Identity()
182
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
183
+ (mlp): Mlp(
184
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
185
+ (act): GELU(approximate='none')
186
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
187
+ (drop): Dropout(p=0.0, inplace=False)
188
+ )
189
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
190
+ )
191
+ (2): BlockWithMasking(
192
+ (attn): MultiheadAttention(
193
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
194
+ )
195
+ (drop_path): Identity()
196
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
197
+ (mlp): Mlp(
198
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
199
+ (act): GELU(approximate='none')
200
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
201
+ (drop): Dropout(p=0.0, inplace=False)
202
+ )
203
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
204
+ )
205
+ (3): BlockWithMasking(
206
+ (attn): MultiheadAttention(
207
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
208
+ )
209
+ (drop_path): Identity()
210
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
211
+ (mlp): Mlp(
212
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
213
+ (act): GELU(approximate='none')
214
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
215
+ (drop): Dropout(p=0.0, inplace=False)
216
+ )
217
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
218
+ )
219
+ (4): BlockWithMasking(
220
+ (attn): MultiheadAttention(
221
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
222
+ )
223
+ (drop_path): Identity()
224
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
225
+ (mlp): Mlp(
226
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
227
+ (act): GELU(approximate='none')
228
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
229
+ (drop): Dropout(p=0.0, inplace=False)
230
+ )
231
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
232
+ )
233
+ (5): BlockWithMasking(
234
+ (attn): MultiheadAttention(
235
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
236
+ )
237
+ (drop_path): Identity()
238
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
239
+ (mlp): Mlp(
240
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
241
+ (act): GELU(approximate='none')
242
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
243
+ (drop): Dropout(p=0.0, inplace=False)
244
+ )
245
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
246
+ )
247
+ (6): BlockWithMasking(
248
+ (attn): MultiheadAttention(
249
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
250
+ )
251
+ (drop_path): Identity()
252
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
253
+ (mlp): Mlp(
254
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
255
+ (act): GELU(approximate='none')
256
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
257
+ (drop): Dropout(p=0.0, inplace=False)
258
+ )
259
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
260
+ )
261
+ (7): BlockWithMasking(
262
+ (attn): MultiheadAttention(
263
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
264
+ )
265
+ (drop_path): Identity()
266
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
267
+ (mlp): Mlp(
268
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
269
+ (act): GELU(approximate='none')
270
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
271
+ (drop): Dropout(p=0.0, inplace=False)
272
+ )
273
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
274
+ )
275
+ (8): BlockWithMasking(
276
+ (attn): MultiheadAttention(
277
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
278
+ )
279
+ (drop_path): Identity()
280
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
281
+ (mlp): Mlp(
282
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
283
+ (act): GELU(approximate='none')
284
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
285
+ (drop): Dropout(p=0.0, inplace=False)
286
+ )
287
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
288
+ )
289
+ (9): BlockWithMasking(
290
+ (attn): MultiheadAttention(
291
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
292
+ )
293
+ (drop_path): Identity()
294
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
295
+ (mlp): Mlp(
296
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
297
+ (act): GELU(approximate='none')
298
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
299
+ (drop): Dropout(p=0.0, inplace=False)
300
+ )
301
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
302
+ )
303
+ (10): BlockWithMasking(
304
+ (attn): MultiheadAttention(
305
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
306
+ )
307
+ (drop_path): Identity()
308
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
309
+ (mlp): Mlp(
310
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
311
+ (act): GELU(approximate='none')
312
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
313
+ (drop): Dropout(p=0.0, inplace=False)
314
+ )
315
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
316
+ )
317
+ (11): BlockWithMasking(
318
+ (attn): MultiheadAttention(
319
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
320
+ )
321
+ (drop_path): Identity()
322
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
323
+ (mlp): Mlp(
324
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
325
+ (act): GELU(approximate='none')
326
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
327
+ (drop): Dropout(p=0.0, inplace=False)
328
+ )
329
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
330
+ )
331
+ (12): BlockWithMasking(
332
+ (attn): MultiheadAttention(
333
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
334
+ )
335
+ (drop_path): Identity()
336
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
337
+ (mlp): Mlp(
338
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
339
+ (act): GELU(approximate='none')
340
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
341
+ (drop): Dropout(p=0.0, inplace=False)
342
+ )
343
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
344
+ )
345
+ (13): BlockWithMasking(
346
+ (attn): MultiheadAttention(
347
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
348
+ )
349
+ (drop_path): Identity()
350
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
351
+ (mlp): Mlp(
352
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
353
+ (act): GELU(approximate='none')
354
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
355
+ (drop): Dropout(p=0.0, inplace=False)
356
+ )
357
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
358
+ )
359
+ (14): BlockWithMasking(
360
+ (attn): MultiheadAttention(
361
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
362
+ )
363
+ (drop_path): Identity()
364
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
365
+ (mlp): Mlp(
366
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
367
+ (act): GELU(approximate='none')
368
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
369
+ (drop): Dropout(p=0.0, inplace=False)
370
+ )
371
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
372
+ )
373
+ (15): BlockWithMasking(
374
+ (attn): MultiheadAttention(
375
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
376
+ )
377
+ (drop_path): Identity()
378
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
379
+ (mlp): Mlp(
380
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
381
+ (act): GELU(approximate='none')
382
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
383
+ (drop): Dropout(p=0.0, inplace=False)
384
+ )
385
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
386
+ )
387
+ (16): BlockWithMasking(
388
+ (attn): MultiheadAttention(
389
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
390
+ )
391
+ (drop_path): Identity()
392
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
393
+ (mlp): Mlp(
394
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
395
+ (act): GELU(approximate='none')
396
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
397
+ (drop): Dropout(p=0.0, inplace=False)
398
+ )
399
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
400
+ )
401
+ (17): BlockWithMasking(
402
+ (attn): MultiheadAttention(
403
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
404
+ )
405
+ (drop_path): Identity()
406
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
407
+ (mlp): Mlp(
408
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
409
+ (act): GELU(approximate='none')
410
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
411
+ (drop): Dropout(p=0.0, inplace=False)
412
+ )
413
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
414
+ )
415
+ (18): BlockWithMasking(
416
+ (attn): MultiheadAttention(
417
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
418
+ )
419
+ (drop_path): Identity()
420
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
421
+ (mlp): Mlp(
422
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
423
+ (act): GELU(approximate='none')
424
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
425
+ (drop): Dropout(p=0.0, inplace=False)
426
+ )
427
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
428
+ )
429
+ (19): BlockWithMasking(
430
+ (attn): MultiheadAttention(
431
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
432
+ )
433
+ (drop_path): Identity()
434
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
435
+ (mlp): Mlp(
436
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
437
+ (act): GELU(approximate='none')
438
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
439
+ (drop): Dropout(p=0.0, inplace=False)
440
+ )
441
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
442
+ )
443
+ (20): BlockWithMasking(
444
+ (attn): MultiheadAttention(
445
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
446
+ )
447
+ (drop_path): Identity()
448
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
449
+ (mlp): Mlp(
450
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
451
+ (act): GELU(approximate='none')
452
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
453
+ (drop): Dropout(p=0.0, inplace=False)
454
+ )
455
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
456
+ )
457
+ (21): BlockWithMasking(
458
+ (attn): MultiheadAttention(
459
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
460
+ )
461
+ (drop_path): Identity()
462
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
463
+ (mlp): Mlp(
464
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
465
+ (act): GELU(approximate='none')
466
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
467
+ (drop): Dropout(p=0.0, inplace=False)
468
+ )
469
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
470
+ )
471
+ (22): BlockWithMasking(
472
+ (attn): MultiheadAttention(
473
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
474
+ )
475
+ (drop_path): Identity()
476
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
477
+ (mlp): Mlp(
478
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
479
+ (act): GELU(approximate='none')
480
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
481
+ (drop): Dropout(p=0.0, inplace=False)
482
+ )
483
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
484
+ )
485
+ (23): BlockWithMasking(
486
+ (attn): MultiheadAttention(
487
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
488
+ )
489
+ (drop_path): Identity()
490
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
491
+ (mlp): Mlp(
492
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
493
+ (act): GELU(approximate='none')
494
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
495
+ (drop): Dropout(p=0.0, inplace=False)
496
+ )
497
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
498
+ )
499
+ (24): BlockWithMasking(
500
+ (attn): MultiheadAttention(
501
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
502
+ )
503
+ (drop_path): Identity()
504
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
505
+ (mlp): Mlp(
506
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
507
+ (act): GELU(approximate='none')
508
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
509
+ (drop): Dropout(p=0.0, inplace=False)
510
+ )
511
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
512
+ )
513
+ (25): BlockWithMasking(
514
+ (attn): MultiheadAttention(
515
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
516
+ )
517
+ (drop_path): Identity()
518
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
519
+ (mlp): Mlp(
520
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
521
+ (act): GELU(approximate='none')
522
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
523
+ (drop): Dropout(p=0.0, inplace=False)
524
+ )
525
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
526
+ )
527
+ (26): BlockWithMasking(
528
+ (attn): MultiheadAttention(
529
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
530
+ )
531
+ (drop_path): Identity()
532
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
533
+ (mlp): Mlp(
534
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
535
+ (act): GELU(approximate='none')
536
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
537
+ (drop): Dropout(p=0.0, inplace=False)
538
+ )
539
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
540
+ )
541
+ (27): BlockWithMasking(
542
+ (attn): MultiheadAttention(
543
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
544
+ )
545
+ (drop_path): Identity()
546
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
547
+ (mlp): Mlp(
548
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
549
+ (act): GELU(approximate='none')
550
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
551
+ (drop): Dropout(p=0.0, inplace=False)
552
+ )
553
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
554
+ )
555
+ (28): BlockWithMasking(
556
+ (attn): MultiheadAttention(
557
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
558
+ )
559
+ (drop_path): Identity()
560
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
561
+ (mlp): Mlp(
562
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
563
+ (act): GELU(approximate='none')
564
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
565
+ (drop): Dropout(p=0.0, inplace=False)
566
+ )
567
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
568
+ )
569
+ (29): BlockWithMasking(
570
+ (attn): MultiheadAttention(
571
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
572
+ )
573
+ (drop_path): Identity()
574
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
575
+ (mlp): Mlp(
576
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
577
+ (act): GELU(approximate='none')
578
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
579
+ (drop): Dropout(p=0.0, inplace=False)
580
+ )
581
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
582
+ )
583
+ (30): BlockWithMasking(
584
+ (attn): MultiheadAttention(
585
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
586
+ )
587
+ (drop_path): Identity()
588
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
589
+ (mlp): Mlp(
590
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
591
+ (act): GELU(approximate='none')
592
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
593
+ (drop): Dropout(p=0.0, inplace=False)
594
+ )
595
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
596
+ )
597
+ (31): BlockWithMasking(
598
+ (attn): MultiheadAttention(
599
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=1280, out_features=1280, bias=True)
600
+ )
601
+ (drop_path): Identity()
602
+ (norm_1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
603
+ (mlp): Mlp(
604
+ (fc1): Linear(in_features=1280, out_features=5120, bias=True)
605
+ (act): GELU(approximate='none')
606
+ (fc2): Linear(in_features=5120, out_features=1280, bias=True)
607
+ (drop): Dropout(p=0.0, inplace=False)
608
+ )
609
+ (norm_2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
610
+ )
611
+ )
612
+ (post_transformer_layer): EinOpsRearrange()
613
+ )
614
+ (audio): SimpleTransformer(
615
+ (pre_transformer_layer): Sequential(
616
+ (0): Identity()
617
+ (1): EinOpsRearrange()
618
+ )
619
+ (blocks): Sequential(
620
+ (0): BlockWithMasking(
621
+ (attn): MultiheadAttention(
622
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
623
+ )
624
+ (drop_path): Identity()
625
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
626
+ (mlp): Mlp(
627
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
628
+ (act): GELU(approximate='none')
629
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
630
+ (drop): Dropout(p=0.0, inplace=False)
631
+ )
632
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
633
+ )
634
+ (1): BlockWithMasking(
635
+ (attn): MultiheadAttention(
636
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
637
+ )
638
+ (drop_path): DropPath(drop_prob=0.009)
639
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
640
+ (mlp): Mlp(
641
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
642
+ (act): GELU(approximate='none')
643
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
644
+ (drop): Dropout(p=0.0, inplace=False)
645
+ )
646
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
647
+ )
648
+ (2): BlockWithMasking(
649
+ (attn): MultiheadAttention(
650
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
651
+ )
652
+ (drop_path): DropPath(drop_prob=0.018)
653
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
654
+ (mlp): Mlp(
655
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
656
+ (act): GELU(approximate='none')
657
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
658
+ (drop): Dropout(p=0.0, inplace=False)
659
+ )
660
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
661
+ )
662
+ (3): BlockWithMasking(
663
+ (attn): MultiheadAttention(
664
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
665
+ )
666
+ (drop_path): DropPath(drop_prob=0.027)
667
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
668
+ (mlp): Mlp(
669
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
670
+ (act): GELU(approximate='none')
671
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
672
+ (drop): Dropout(p=0.0, inplace=False)
673
+ )
674
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
675
+ )
676
+ (4): BlockWithMasking(
677
+ (attn): MultiheadAttention(
678
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
679
+ )
680
+ (drop_path): DropPath(drop_prob=0.036)
681
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
682
+ (mlp): Mlp(
683
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
684
+ (act): GELU(approximate='none')
685
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
686
+ (drop): Dropout(p=0.0, inplace=False)
687
+ )
688
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
689
+ )
690
+ (5): BlockWithMasking(
691
+ (attn): MultiheadAttention(
692
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
693
+ )
694
+ (drop_path): DropPath(drop_prob=0.045)
695
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
696
+ (mlp): Mlp(
697
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
698
+ (act): GELU(approximate='none')
699
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
700
+ (drop): Dropout(p=0.0, inplace=False)
701
+ )
702
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
703
+ )
704
+ (6): BlockWithMasking(
705
+ (attn): MultiheadAttention(
706
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
707
+ )
708
+ (drop_path): DropPath(drop_prob=0.055)
709
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
710
+ (mlp): Mlp(
711
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
712
+ (act): GELU(approximate='none')
713
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
714
+ (drop): Dropout(p=0.0, inplace=False)
715
+ )
716
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
717
+ )
718
+ (7): BlockWithMasking(
719
+ (attn): MultiheadAttention(
720
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
721
+ )
722
+ (drop_path): DropPath(drop_prob=0.064)
723
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
724
+ (mlp): Mlp(
725
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
726
+ (act): GELU(approximate='none')
727
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
728
+ (drop): Dropout(p=0.0, inplace=False)
729
+ )
730
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
731
+ )
732
+ (8): BlockWithMasking(
733
+ (attn): MultiheadAttention(
734
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
735
+ )
736
+ (drop_path): DropPath(drop_prob=0.073)
737
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
738
+ (mlp): Mlp(
739
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
740
+ (act): GELU(approximate='none')
741
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
742
+ (drop): Dropout(p=0.0, inplace=False)
743
+ )
744
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
745
+ )
746
+ (9): BlockWithMasking(
747
+ (attn): MultiheadAttention(
748
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
749
+ )
750
+ (drop_path): DropPath(drop_prob=0.082)
751
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
752
+ (mlp): Mlp(
753
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
754
+ (act): GELU(approximate='none')
755
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
756
+ (drop): Dropout(p=0.0, inplace=False)
757
+ )
758
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
759
+ )
760
+ (10): BlockWithMasking(
761
+ (attn): MultiheadAttention(
762
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
763
+ )
764
+ (drop_path): DropPath(drop_prob=0.091)
765
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
766
+ (mlp): Mlp(
767
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
768
+ (act): GELU(approximate='none')
769
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
770
+ (drop): Dropout(p=0.0, inplace=False)
771
+ )
772
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
773
+ )
774
+ (11): BlockWithMasking(
775
+ (attn): MultiheadAttention(
776
+ (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
777
+ )
778
+ (drop_path): DropPath(drop_prob=0.100)
779
+ (norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
780
+ (mlp): Mlp(
781
+ (fc1): Linear(in_features=768, out_features=3072, bias=True)
782
+ (act): GELU(approximate='none')
783
+ (fc2): Linear(in_features=3072, out_features=768, bias=True)
784
+ (drop): Dropout(p=0.0, inplace=False)
785
+ )
786
+ (norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
787
+ )
788
+ )
789
+ (post_transformer_layer): EinOpsRearrange()
790
+ )
791
+ )
792
+ (modality_heads): ModuleDict(
793
+ (vision): Sequential(
794
+ (0): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
795
+ (1): SelectElement()
796
+ (2): Linear(in_features=1280, out_features=1024, bias=False)
797
+ )
798
+ (audio): Sequential(
799
+ (0): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
800
+ (1): SelectElement()
801
+ (2): Linear(in_features=768, out_features=1024, bias=False)
802
+ )
803
+ )
804
+ (modality_postprocessors): ModuleDict(
805
+ (vision): Normalize()
806
+ (audio): Sequential(
807
+ (0): Normalize()
808
+ (1): LearnableLogitScaling(logit_scale_init=20.0,learnable=False, max_logit_scale=100)
809
+ )
810
+ )
811
+ )
812
+ ```