Context Size of RNNT variants of Parakeet

#1
by puvvadasaikiran - opened

any particular bounds on context size of Parakeet like their RNNT counterparts (Meta's Emformer RNNT )

NVIDIA org

Parakeet uses Fast Conformer encoder. In limited context mode, we've done inference on upto 8 hours audio sample (synthetically concatenated) in a single forward pass on a single 80GB A100

Hi @smajumdar94 ,

Awesome work with parakeet and canary model. In fact awesome work with nemo ASR in general, it's been invaluable.

I had a couple of quick questions I was hoping I could get your help on:

I checked the config of this model, and it shows that it is using regular context_style with unlimited context size ([-1,-1]). see below:

{'_target_': 'nemo.collections.asr.modules.ConformerEncoder', 'feat_in': 80, 'feat_out': -1, 'n_layers': 24, 'd_model': 1024, 'subsampling': 'dw_striding', 'subsampling_factor': 8, 'subsampling_conv_channels': 256, 'causal_downsampling': False, 'reduction': None, 'reduction_position': None, 'reduction_factor': 1, 'ff_expansion_factor': 4, 'self_attention_model': 'rel_pos', 'n_heads': 8, 'att_context_size': [-1, -1], 'att_context_style': 'regular', 'xscaling': True, 'untie_biases': True, 'pos_emb_max_len': 5000, 'conv_kernel_size': 9, 'conv_norm_type': 'batch_norm', 'conv_context_size': None, 'dropout': 0.1, 'dropout_pre_encoder': 0.1, 'dropout_emb': 0.0, 'dropout_att': 0.1}

Am I misunderstanding what you mean by this model is using limited context mode? Is limited context mode not the same as the streaming-cache-aware range of models? i.e using some limited context size with chunk limited style: 'att_context_size': [-70, 13], 'att_context_style': 'chunk-limited'

Also do you have the smaller versions of the parakeet-tdt models that you can share, i.e parakeet-tdt-0.6b and parakeet-tdt-0.1b (i.e the XL and L variants)? This would be a great resource to add for those with limited resource constraints.

Thanks again for all the work!

Sign up or log in to comment