Export to GitHub

pytesser - issue #23

How to get whitelist to work with pytesseract


Posted on Jun 9, 2015 by Helpful Panda

What steps will reproduce the problem?

Trying to use the code that makes a whitelist for Tesseract like follows

ocr = tesseract.TessBaseAPI() ocr.SetVariable("tessedit_char_whitelist", "0123456789;") ocr.SetPageSegMode(tesseract.PSM_AUTO) ocr.Init("C:\Program Files (x86)\Tesseract-OCR\","eng",tesseract.OEM_DEFAULT)

What is the expected output? What do you see instead?

Intended output is to have only "0123456789;" characters be recognized when using the image_to_string() function. Using code like what is above, image_to_string() just ignores it and grabs whatever characters it finds.

What version of the product are you using? On what operating system?

pytesseract-0.1, Python 2.7, Windows 8.1

Please provide any additional information below.

I've been trying everything people use for Tesseract-OCR, but that doesn't work with pytesseract. I haven't been able to find any solution or method to whitelisting with the image_to_string() function anywhere, which would be immensely helpful in improving the accuracy of the function.

Thanks in advance for any help on the matter.

Status: New

Labels:
Type-Defect Priority-Medium