why chinese image ocr error ocde

#4
by Viking714 - opened

Hello, I recently use this model to do Chinese image OCR, but I got the wrong words output, the code I use is below:

from PIL import Image
img_pil = Image.open('/kaggle/input/timuimage/timu.jpg')
image = img_pil.convert("RGB")

from transformers import LayoutXLMProcessor
processor = LayoutXLMProcessor.from_pretrained("Microsoft/layoutlmv3-base-chinese")
feature_extractor = processor.feature_extractor

preprocess image to text

encoded_inputs = feature_extractor(image)
words = encoded_inputs.words

Just output the words in a format

text = ""
for word in words[0]:
text = text + word
print(text)

The output is as below:
re\1AlltTTiani|iete44si)ii"eahi|WAiL“4HNHHAilKtintteersNaaiftyUeawliditieeaHuseuay1he‘4LrLHauiiiasiliatififiaigMtiiarecuaEtaaii!t~BCpecaaOaeeiyfnaeipiesaoriyeae4raBiia4aiaei{thiulEiuaadlfh,aeaatteateeileweypakPotHsae

The Image I use is from https://www.kaggle.com/datasets/viking714/timuimage, everyone can see the image, it's public.
I use the same method to OCR English images to words by LayoutXLM and LayoutLMV2 models, they are both ok.
Thank you very much.

你需要设置ocr语言为中文+英文,也就是'chi_sim+eng'

model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

Hello, I was trying to use it in the same way. But I got this error:


ValueError Traceback (most recent call last)
in <cell line: 4>()
2 image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
3 tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
----> 4 processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)

ValueError: Received XLMRobertaTokenizer for argument tokenizer, but a ('LayoutLMv3Tokenizer', 'LayoutLMv3TokenizerFast') was expected.

What can be wrong? Thanks

找到LayoutLMv3Processor的源码,把
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast")
改成
tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')

您好,请问解决了吗,我参考上面的方法最后显示出来的还是只有英文

参考之前的回答,按照以下方式可以的到中文结果。如果不行的话可以看一下你的tesseract-ocr是不是缺少chi_sim.traineddata文件,一般会保存在/usr/share/tesseract-ocr/4.00/tessdata/

from transformers import XLMRobertaTokenizer, AutoModel, AutoProcessor, LayoutLMv3ImageProcessor, LayoutLMv3Processor
model_name = "Microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name, ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name)
processor = LayoutLMv3Processor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
feature_extractor = processor.feature_extractor
inputs = feature_extractor(image)
inputs['words']

不明白为什么要去改源码, 你只需要自己定一个拓展类LayoutLMv3ChineseProcessor就可以了。

model_name="microsoft/layoutlmv3-base-chinese"
image_processor = LayoutLMv3ImageProcessor.from_pretrained(model_name,ocr_lang='chi_sim+eng')
tokenizer = XLMRobertaTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

class LayoutLMv3ChineseProcessor(LayoutLMv3Processor):
         tokenizer_class = ("LayoutLMv3Tokenizer", "LayoutLMv3TokenizerFast",'XLMRobertaTokenizer','XLMRobertaTokenizerFast','LayoutXLMTokenizer')
    
processor = LayoutLMv3ChineseProcessor(image_processor=image_processor,tokenizer=tokenizer,apply_ocr=True)
 

上面几个根本没有说的核心地方。改毛线代码。按照我的来,不要59行代码就训练和推理完成:

tokenizer = LayoutXLMTokenizer.from_pretrained(
"./layoutlmv3-base-chinese"
)

image_processor = LayoutLMv3ImageProcessor.from_pretrained(
"./layoutlmv3-base-chinese", apply_ocr=False
)

processor = LayoutLMv3Processor(tokenizer=tokenizer, image_processor=image_processor, apply_ocr=False)
1 (3).jpg
1 (4).jpg

Sign up or log in to comment