File size: 2,287 Bytes
20fdad1
 
37ffde2
20fdad1
 
37ffde2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20fdad1
 
 
37ffde2
 
 
 
 
 
20fdad1
37ffde2
20fdad1
de5ba50
20fdad1
37ffde2
20fdad1
37ffde2
20fdad1
 
37ffde2
de5ba50
 
 
 
 
 
37ffde2
20fdad1
 
 
37ffde2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
---
library_name: transformers
license: apache-2.0
---

## Using Caduceus
To use the pre-trained model for masked language modeling, use the following snippet:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

# See the `Caduceus` collection page on the hub for list of available models.
model_name = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
```

Alternatively, you can instantiate a model from scratch to train on your own data as follows:
```python
from transformers import AutoConfig, AutoModelForMaskedLM

# Add any config overrides here, see the `config.json` file on the hub for details.
config_overrides = {}
# See the `Caduceus` collection page on the hub for list of available models.
config = AutoConfig.from_pretrained(
 "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16",
 **config_overrides,
) 
model = AutoModelForMaskedLM.from_config(config)
```

## Model Details

This is the Caduceus-Ph model with hidden dimension 256 and 16 MambaDNA layers.
This model is not inherently reverse complement (RC) equivariant.
Rather, it was pre-trained using RC data augmentation.
Its intended usage is as follows: for downstream tasks, the model should be trained with RC data augmentation.
At downstream task inference, the model should be run twice: once on a sequence and once on its RC.
The output of these two applications should be combined (averaged) to form the downstream task prediction.

This model was pre-trained on the human reference genome with sequence length 131,072 for 50k steps (each step contained ~1M base pairs / tokens).

For more details, please see our paper: [Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling](https://arxiv.org/abs/2403.03234).

## Citation

Please cite our work using the bibtex below:

**BibTeX:**
```
@article{schiff2024caduceus,
  title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling},
  author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr},
  journal={arXiv preprint arXiv:2403.03234},
  year={2024}
}
```

## Model Card Contact

Yair Schiff ([email protected])