Manli commited on
Commit
7d0f9ec
•
1 Parent(s): ff00ecb

update readme

Browse files
Files changed (1) hide show
  1. README.md +11 -12
README.md CHANGED
@@ -7,9 +7,7 @@ pipeline_tag: image-text-to-text
7
 
8
 
9
  # Model description
10
- We are excited to announce the continuation and rebranding of our **BLIP series** into **XGen-MM**, to be better aligned with Salesforce's unified XGen initiative for large foundation models! This rebranding marks a significant step in our ongoing development of cutting-edge multimodal technologies.
11
-
12
- `XGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
13
 
14
  In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
15
  - [🤗 xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
@@ -23,7 +21,7 @@ In addition to the models, we are also releasing a series of datasets for multi-
23
  - [🤗 BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
24
  - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
25
 
26
- For more details, check out our [tech report]() and project page (coming soon).
27
 
28
  # Data
29
  The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
@@ -61,7 +59,7 @@ Below are some qualitative examples below of the mutli-modal in-context learning
61
 
62
  # How to use
63
 
64
- Please check out our [inference notebook](demo.ipynb) for example code to use our model. We also provide example script for [batch inference](batch_inference.ipynb).
65
 
66
  # Reproducibility:
67
 
@@ -77,7 +75,7 @@ We strongly recommend users assess safety and fairness before applying to downst
77
 
78
  # License
79
 
80
- Our code and weights are released under the Creative Commons Attribution Non Commercial 4.0 [LICENSE](LICENSE.txt). Please fill out a form at [here](https://forms.gle/ffPc9oZC2ZGeJ1N68) to consult the commercial use of model weights.
81
 
82
  # Code acknowledgement
83
  Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
@@ -88,15 +86,16 @@ We thank the authors for their open-source implementations.
88
 
89
  # Citation
90
  ```
91
- @misc{xgen_mm_phi3_mini,
92
- title={xgen-mm-phi3-mini-base Model Card},
93
- url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1},
94
- author={Salesforce AI Research},
95
- month={May},
96
- year={2024}
97
  }
98
  ```
99
 
 
100
  # Troubleshoot
101
 
102
  1. If you missed any packages, please consider the following
 
7
 
8
 
9
  # Model description
10
+ `xGen-MM` is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the `BLIP` series, incorporating fundamental enhancements that ensure a more robust and superior foundation. These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.
 
 
11
 
12
  In the v1.5 (08/2024) release, we present a series of XGen-MM models including:
13
  - [🤗 xGen-MM-base](https://huggingface.co/Salesforce/xgen-mm-phi3-mini-base-r-v1.5): `xgen-mm-phi3-mini-base-r-v1.5`
 
21
  - [🤗 BLIP3-GROUNDING-50M](https://huggingface.co/datasets/Salesforce/blip3-grounding-50m): a dataset for enhancing the ability to ground semantic concepts in images.
22
  - BLIP3-KALE-300M (stay tuned): a large-scale curated high-quality caption dataset.
23
 
24
+ For more details, check out our [tech report](https://arxiv.org/pdf/2408.08872) and project page (coming soon).
25
 
26
  # Data
27
  The base model is pre-trained on a mixture of data sources described above, with around 100 billion image-text tokens in total.
 
59
 
60
  # How to use
61
 
62
+ Please check out our [inference notebook](demo.ipynb) for example code to use our model.
63
 
64
  # Reproducibility:
65
 
 
75
 
76
  # License
77
 
78
+ Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) license.
79
 
80
  # Code acknowledgement
81
  Our training code is based on [OpenFlamingo: An open-source framework for training large multimodal models.](https://github.com/mlfoundations/open_flamingo), and part of our data preprocessing code is adapted from [LLaVA](https://github.com/haotian-liu/LLaVA).
 
86
 
87
  # Citation
88
  ```
89
+ @article{blip3-xgenmm,
90
+ author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
91
+ title = {xGen-MM(BLIP-3): A Family of Open Large Multimodal Models},
92
+ journal = {arXiv preprint},
93
+ month = {August},
94
+ year = {2024},
95
  }
96
  ```
97
 
98
+
99
  # Troubleshoot
100
 
101
  1. If you missed any packages, please consider the following