czczup commited on
Commit
73c0b81
1 Parent(s): 9db32d9

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +11 -19
  2. modeling_intern_vit.py +6 -12
README.md CHANGED
@@ -247,7 +247,7 @@ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast
247
 
248
  # set the max number of tiles in `max_num`
249
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
250
- generation_config = dict(max_new_tokens=1024, do_sample=False)
251
 
252
  # pure-text conversation (纯文本对话)
253
  question = 'Hello, who are you?'
@@ -399,7 +399,7 @@ for new_text in streamer:
399
 
400
  ## Finetune
401
 
402
- SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of InternVL, please check [this link](https://github.com/modelscope/swift/blob/main/docs/source_en/Multi-Modal/internvl-best-practice.md) for more details.
403
 
404
  ## Deployment
405
 
@@ -408,7 +408,7 @@ SWIFT from ModelScope community has supported the fine-tuning (Image/Video) of I
408
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
409
 
410
  ```sh
411
- pip install lmdeploy
412
  ```
413
 
414
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
@@ -416,14 +416,12 @@ LMDeploy abstracts the complex inference process of multi-modal Vision-Language
416
  #### A 'Hello, world' example
417
 
418
  ```python
419
- from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
420
  from lmdeploy.vl import load_image
421
 
422
  model = 'OpenGVLab/InternVL-Chat-V1-5'
423
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
424
- chat_template_config = ChatTemplateConfig('internvl-internlm2')
425
- pipe = pipeline(model, chat_template_config=chat_template_config,
426
- backend_config=TurbomindEngineConfig(session_len=8192))
427
  response = pipe(('describe this image', image))
428
  print(response.text)
429
  ```
@@ -437,14 +435,12 @@ When dealing with multiple images, you can put them all in one list. Keep in min
437
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
438
 
439
  ```python
440
- from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
441
  from lmdeploy.vl import load_image
442
  from lmdeploy.vl.constants import IMAGE_TOKEN
443
 
444
  model = 'OpenGVLab/InternVL-Chat-V1-5'
445
- chat_template_config = ChatTemplateConfig('internvl-internlm2')
446
- pipe = pipeline(model, chat_template_config=chat_template_config,
447
- backend_config=TurbomindEngineConfig(session_len=8192))
448
 
449
  image_urls=[
450
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
@@ -462,13 +458,11 @@ print(response.text)
462
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
463
 
464
  ```python
465
- from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
466
  from lmdeploy.vl import load_image
467
 
468
  model = 'OpenGVLab/InternVL-Chat-V1-5'
469
- chat_template_config = ChatTemplateConfig('internvl-internlm2')
470
- pipe = pipeline(model, chat_template_config=chat_template_config,
471
- backend_config=TurbomindEngineConfig(session_len=8192))
472
 
473
  image_urls=[
474
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
@@ -484,13 +478,11 @@ print(response)
484
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
485
 
486
  ```python
487
- from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig, GenerationConfig
488
  from lmdeploy.vl import load_image
489
 
490
  model = 'OpenGVLab/InternVL-Chat-V1-5'
491
- chat_template_config = ChatTemplateConfig('internvl-internlm2')
492
- pipe = pipeline(model, chat_template_config=chat_template_config,
493
- backend_config=TurbomindEngineConfig(session_len=8192))
494
 
495
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
496
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
 
247
 
248
  # set the max number of tiles in `max_num`
249
  pixel_values = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
250
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
251
 
252
  # pure-text conversation (纯文本对话)
253
  question = 'Hello, who are you?'
 
399
 
400
  ## Finetune
401
 
402
+ Many repositories now support fine-tuning of the InternVL series models, including [InternVL](https://github.com/OpenGVLab/InternVL), [SWIFT](https://github.com/modelscope/ms-swift), [XTurner](https://github.com/InternLM/xtuner), and others. Please refer to their documentation for more details on fine-tuning.
403
 
404
  ## Deployment
405
 
 
408
  LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
409
 
410
  ```sh
411
+ pip install lmdeploy==0.5.3
412
  ```
413
 
414
  LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
 
416
  #### A 'Hello, world' example
417
 
418
  ```python
419
+ from lmdeploy import pipeline, TurbomindEngineConfig
420
  from lmdeploy.vl import load_image
421
 
422
  model = 'OpenGVLab/InternVL-Chat-V1-5'
423
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
424
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
425
  response = pipe(('describe this image', image))
426
  print(response.text)
427
  ```
 
435
  > Warning: Due to the scarcity of multi-image conversation data, the performance on multi-image tasks may be unstable, and it may require multiple attempts to achieve satisfactory results.
436
 
437
  ```python
438
+ from lmdeploy import pipeline, TurbomindEngineConfig
439
  from lmdeploy.vl import load_image
440
  from lmdeploy.vl.constants import IMAGE_TOKEN
441
 
442
  model = 'OpenGVLab/InternVL-Chat-V1-5'
443
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
444
 
445
  image_urls=[
446
  'https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg',
 
458
  Conducting inference with batch prompts is quite straightforward; just place them within a list structure:
459
 
460
  ```python
461
+ from lmdeploy import pipeline, TurbomindEngineConfig
462
  from lmdeploy.vl import load_image
463
 
464
  model = 'OpenGVLab/InternVL-Chat-V1-5'
465
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
466
 
467
  image_urls=[
468
  "https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg",
 
478
  There are two ways to do the multi-turn conversations with the pipeline. One is to construct messages according to the format of OpenAI and use above introduced method, the other is to use the `pipeline.chat` interface.
479
 
480
  ```python
481
+ from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig
482
  from lmdeploy.vl import load_image
483
 
484
  model = 'OpenGVLab/InternVL-Chat-V1-5'
485
+ pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
 
 
486
 
487
  image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/demo/resources/human-pose.jpg')
488
  gen_config = GenerationConfig(top_k=40, top_p=0.8, temperature=0.8)
modeling_intern_vit.py CHANGED
@@ -20,18 +20,12 @@ from transformers.utils import logging
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
23
- try: # v1
24
- from flash_attn.flash_attn_interface import \
25
- flash_attn_unpadded_qkvpacked_func
26
- except: # v2
27
- from flash_attn.flash_attn_interface import \
28
- flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
29
-
30
  from flash_attn.bert_padding import pad_input, unpad_input
31
-
 
32
  has_flash_attn = True
33
  except:
34
- print('FlashAttention is not installed.')
35
  has_flash_attn = False
36
 
37
  logger = logging.get_logger(__name__)
@@ -74,7 +68,7 @@ class FlashAttention(nn.Module):
74
  max_s = seqlen
75
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
76
  device=qkv.device)
77
- output = flash_attn_unpadded_qkvpacked_func(
78
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
79
  softmax_scale=self.softmax_scale, causal=causal
80
  )
@@ -84,7 +78,7 @@ class FlashAttention(nn.Module):
84
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
85
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
86
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
87
- output_unpad = flash_attn_unpadded_qkvpacked_func(
88
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
89
  softmax_scale=self.softmax_scale, causal=causal
90
  )
@@ -93,7 +87,7 @@ class FlashAttention(nn.Module):
93
  'b s (h d) -> b s h d', h=nheads)
94
  else:
95
  assert max_s is not None
96
- output = flash_attn_unpadded_qkvpacked_func(
97
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
98
  softmax_scale=self.softmax_scale, causal=causal
99
  )
 
20
  from .configuration_intern_vit import InternVisionConfig
21
 
22
  try:
 
 
 
 
 
 
 
23
  from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.flash_attn_interface import \
25
+ flash_attn_varlen_qkvpacked_func
26
  has_flash_attn = True
27
  except:
28
+ print('FlashAttention2 is not installed.')
29
  has_flash_attn = False
30
 
31
  logger = logging.get_logger(__name__)
 
68
  max_s = seqlen
69
  cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
70
  device=qkv.device)
71
+ output = flash_attn_varlen_qkvpacked_func(
72
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
73
  softmax_scale=self.softmax_scale, causal=causal
74
  )
 
78
  x = rearrange(qkv, 'b s three h d -> b s (three h d)')
79
  x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
80
  x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
81
+ output_unpad = flash_attn_varlen_qkvpacked_func(
82
  x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
83
  softmax_scale=self.softmax_scale, causal=causal
84
  )
 
87
  'b s (h d) -> b s h d', h=nheads)
88
  else:
89
  assert max_s is not None
90
+ output = flash_attn_varlen_qkvpacked_func(
91
  qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
92
  softmax_scale=self.softmax_scale, causal=causal
93
  )