Parse Text From Images

mafrosis

Well-known member
Joined
Jun 5, 2006
Messages
88
Location
UK
Programming Experience
5-10
Hey all,

Does anyone know of a component which could retrieve text from an image?

We have lots (about 5000) bitmaps with text in them which we would like to strip out - every image has the text part in the same place, with the same contrast, so it seems like a real possiblity to automate. Alternatively some poor fellow is going to have to type them by hand.. Fortunately not me, but i'd like to save them the task.. :)

Searching google always brings up adding text to an image, which is why I posted here.

Cheers
mafro
 
This will be very difficult, probably very very difficult. The only chance i can think of that this has got of working is if the string is in a color that is completely different to any other color in the image. You could probably then use GetPixel/SetPixel to loop through the pixels of the image and if it finds one that is an exact match for the color of the text draw it somewhere else. This is dodgy at best.
 
Agreed vis, that's why I didnt dive in and try to code it myself. The images are actually screen grabs of single windows, and we are retrieving the title from the window. This means there is a good (and uniform) contrast between the two colours.

So, I could do the pixel thing you talk about, but we really need this information saved as ASCII text! The only way I can see is a text-recognition type thing, which is a whole can of worms.

Which leads me back to trying to find a component that can already do it.. Any ideas? :confused:
 
Sounds tricky, i haven't seen any components that will do it, but then i've never looked. I'll check around and get back if i come up with something.
 
A few tips here http://www.vbdotnetforums.com/showpost.php?p=23395&postcount=2
I haven't tried any of them and don't know how good they are.

Another idea maybe too late now that you got 5000 screenshots... If you can screenshoot a particular window you can also with Win32 methods get the window title text directly without first getting the graphics. Depending on how you get window you may also retrieve with Process.MainWindowTitle. Getting text from an image is very difficult and unreliable even with known OCRs (Optical Character Recognition) so anything you can do to avoid going that route is better.
 
Given that your text is very regular it shouldnt be too bad..

Consider the menu bar as a 2d array of colours
Each item of text has pixels in a uniform colour
The background may be a gradient colour hence not fixed
Set your software up to consider just a portion of the image, i.e. perform a conceptual crop

working right, then down, crop the letters out according to the following algorithm. We assume your text is white:

Set a start integer to 0 - this will be your pixel "offset"
For a column of pixels, investigate them all to find out i any are white
If at least one white pixel is found, then move to the next column
When a column is found that is completely devoid of white pixels then it means we reached the edge of the letter
"Crop" this letter out - the height should be fixed and we just found the width
Replace all non-white pixels with black
Convert the array to a single dimensional string (or something)
Use this string as a key to a hashtable in which are stored all letters possible - lookup what letter this string is
Continue until the next column of pixels that contains a white - your new starting point


For the letter O:
VB.NET:
[FONT=Courier New]01234567[/FONT]
[FONT=Courier New]##:::##[/FONT]
[FONT=Courier New]#::#::#[/FONT]
[FONT=Courier New]#:###:#[/FONT]
[FONT=Courier New]#:###:#[/FONT]
[FONT=Courier New]#::#::#[/FONT]
[FONT=Courier New]##:::##[/FONT]
Starting at offset 0, The column has no white pixels #. Increment offset
Column 1 contains a white pixel, move on
columns 2 3 4 5 6 contain white pixels, all move on
Column 7 is full of non-white. crop width = ((7-1)-offset)
Crop out and into an array, the letter:
VB.NET:
[FONT=Courier New][FONT=Courier New]12345[/FONT]
[FONT=Courier New][FONT=Courier New]#:::#[/FONT]
[FONT=Courier New]::#::[/FONT]
[FONT=Courier New]:###:[/FONT]
[FONT=Courier New]:###:[/FONT]
[FONT=Courier New]::#::[/FONT]
[FONT=Courier New]#:::#[/FONT]
[/FONT][/FONT]

Convert this to a string or other keyable-based-on-content object, white pix become "1", other become "0"

VB.NET:
[FONT=Courier New][FONT=Courier New][FONT=Courier New][B]01110[/B][/FONT]
[FONT=Courier New]11011[/FONT]
[FONT=Courier New][B]10001[/B][/FONT]
[FONT=Courier New]10001[/FONT]
[FONT=Courier New][B]11011[/B][/FONT]
[FONT=Courier New]01110[/FONT]
[/FONT][/FONT]

As a single string:
"011101101110001100011101101110"

Now look this up:
VB.NET:
[/FONT]
[FONT=Courier New]stringbuilder.Append(charactersHashTable([B]"01110[/B][FONT=Courier New]11011[/FONT][FONT=Courier New][B]10001[/B][/FONT][FONT=Courier New]10001[/FONT][FONT=Courier New][B]11011[/B][/FONT][FONT=Courier New]01110"))[/FONT][/FONT]
 
[FONT=Courier New]move on..[/FONT]
 
 
[FONT=Courier New]You could modify the program to be self-learning. If the [B]"01110[/B][FONT=Courier New]11011[/FONT][FONT=Courier New][B]10001[/B][/FONT][FONT=Courier New]10001[/FONT][FONT=Courier New][B]11011[/B][/FONT][FONT=Courier New]01110"[/FONT][FONT=Courier New] string is not found in the hash, show it to the user (the cropped image) and say "what is this letter/number?" - whatever they enter, stick it in the hashtable (and save it upon program exit)[/FONT][/FONT]
 
 
[FONT=Courier New]ahh... it should take maybe a day to write a software like this.. with luck! :)[/FONT]
[/FONT][/FONT][/FONT]
 
This means there is a good (and uniform) contrast between the two colours.

Of course GetPixel/SetPixel are Color.FromArgb Specific in vb.net terms or integer specific in Win32 terms, any deviation will cause it not to match. You may have to combine it with the Graphics.GetNearestColor method. A day to write this yes, a fortnight to get it working!

The other thing to remember is that on 5000 images this is going to be really quite slow. GetPixel/SetPixel are not the most efficient of API's DIBSections maybe a better way to go.
 
I thought bitmap images were just a simple binary representation of the colours involved?
i.e. its a huge byte array in serial form..maybe with some header guff. I wasnt planning on reading the image as an image, more like skipping the header and reading the image bytes into a 2D byte array. Sorry if that wasnt clear!
 
The problem of the colors is still very apparent..

RGB(255,255,254)

and

RGB(255,255,255)

Look exactly the same, but as i said the match has to be an exact match. Reading pixel colors from the image direct or from the bytes that make up an image. If the color deviates even by the slightest fraction it won't work.
 
Here is the hex content of a 10x10 bitmap image of a grey 10x10 box on a white background:

VB.NET:
00000000h: 42 4D 76 01 00 00 00 00 00 00 36 00 00 00 28 00 ; BMv.......6...(.
00000010h: 00 00 0A 00 00 00 0A 00 00 00 01 00 18 00 00 00 ; ................
00000020h: 00 00 40 01 00 00 00 00 00 00 00 00 00 00 00 00 ; ..@.............
00000030h: 00 00 00 00 00 00 80 80 80 80 80 80 80 80 80 80 ; ......€€€€€€€€€€
00000040h: 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 ; €€€€€€€€€€€€€€€€
00000050h: 80 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; €€€€..€€€ÿÿÿÿÿÿÿ
00000060h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00000070h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
00000080h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00000090h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
000000a0h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
000000b0h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
000000c0h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
000000d0h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
000000e0h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
000000f0h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
00000100h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00000110h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
00000120h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00000130h: FF 80 80 80 00 00 80 80 80 FF FF FF FF FF FF FF ; ÿ€€€..€€€ÿÿÿÿÿÿÿ
00000140h: FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF ; ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
00000150h: FF 80 80 80 00 00 80 80 80 80 80 80 80 80 80 80 ; ÿ€€€..€€€€€€€€€€
00000160h: 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 ; €€€€€€€€€€€€€€€€
00000170h: 80 80 80 80 00 00                               ; €€€€..
 
Hopefully, even if it wasnt, at least some component of the image would be distinct - i.e. if the red channel was consistently above or below 0x80. Maybe the OP should post a crop of one of the pictures then we can take a look! :)
 
All very interesting guys! Thanks for all the interest.

The solution we found in the end (which incidentally is yet to be fully tested.. cross-fingers) was using ImageMagick to crop each image and then a tool called GOCR (http://jocr.sourceforge.net/) to do the actual character recognition.

We ran some command line stuff on linux and it seemed to work well - i gonna knock a .NET wrapper for the win32 versions and then set it off this afternoon. Ill post any results afterwards.
 
Back
Top