Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Friday, Dec 26, 2008 - 03:12 SGT
Posted By: Gilbert

- -
Ho Ho Ho

It's late on a cool and comfortable Christmas Eve - perfect for a spot of leisurely coding. With a few hours to burn, I thought, why not try a spot of Captcha breaking?

I've covered a bit about Captchas a year and half ago, but let's do the summary all over again. Captchas usually take the form of distorted text, displayed for a human to respond to by typing the text back in order to prevent computer "bots" from (repeatedly) accessing some resource, whether it be signing up for a new email account, posting a comment on a blog, or whatever.

Why does it work? Well, the thing is that computers can do some things very easily - say adding humongous amounts of figures - but in other stuff humans are (still) far superior, and should continue to be so for quite some time. Indeed, I daresay the person who creates an artificial intelligence that can convincingly impersonate a human to the extent of displaying understanding would not only win a Turing Award (and other assorted prizes), but would probably have opened the door to the most monumental accomplishment bar none - for once the first machine that is even marginally more intelligent than a human is created, it would be able to improve and propagate itself at a far greater pace than humans could hope to do, and attain a technological singularity. Or so theory goes.

Suffice to say that recognizing symbols, although a tiny, tiny subset of what computers would need to be capable of to be considered intelligent, is no easy task at all. It is true that the problem of optical character recognition can be considered to be solved, when confronted with properly scanned and reasonably regular fonts. But reading a handwritten cursive script (especially a doctor's), easy as it may be for most humans (other than the doctor's, maybe), still cannot be done with appreciable accuracy by computers.

Which is strange, if you think about it - yes, maybe humans are helped along by context, but give a human isolated words from a cursive pen, of differing styles from different people, and he would likely still be able to read them (well, generally better than a computer anyway). This could enter into a digression on graphology and the merits thereof, but the basic point of such a seemingly simple task being so badly flunked by crazily powerful processors still stands.


Perhaps not always... (Source: Graphic Insight)

Astute readers at this point might ask, why not just use handwritten words as a Captcha then? Well, the answer has to do with reusability and cost. Non-text based (and often contextual) captchas have been implemented, which may ask one to name, say, the third object to the left of the hamster in a picture. However, these types of Captchas often suffer from a relative lack of objects (which may be the abovementioned hamster, or handwriting samples), which open them up to another line of attack, that being simply storing the objects in a database for future use. It is easy to see that if the number of objects is small compared to the number of queries (and remember, a single popular website may need to serve millions of Captchas a day), this form of protection would quickly be defeated. Moreover, random objects look quite messy.

Text in contrast has many combinations just from combining different letters and easy on the eye, and it is pretty simple to generate an image of a word (or random string of letters to counter dictionary attacks) on some background and mess it up a little, to prevent simple OCR from being able to decipher it. Let us walk through a practical example:


A Captcha check by Mousehunt

We are supposed to type in the five letters displayed on the cheese to continue. Not particularly hard to grab that part of the screenshot since it appears in much the same position each time (no one moved the cheese here), but just for fun I used a probabilistic function (using Perl to call ImageMagick) that tries to recognize the cheese. It seems to work, and I got:


Cheese Identified

What next? The cheesy background seems to complicate things, but in reality it is not much of an obstacle - I simply strip the oranges and yellows out using their RGB values. What if the Captcha designers then use these very colours? Well, they could, but the Captcha would than be infuriatingly hard (and maybe even impossible) to read (remember some of us are colour blind, so certain other combinations would be bad enough already). Not good when facing potential customers. Indeed, in general I guess it wouldn't be hard to recognize and strip the background.


Black and White

Looking good, but there's still some extra noise; I therefore wrote another heuristic to clear away extraneous text (N.B. Actually, in most Captchas the Captcha is in its own image, so this step and the first would not apply). By iteratively clearing away small concentrations of pixels, we can get an idea of where the relevant text is:


Can't Run Can't Hide

The code then grabs that part of the image. But now comes the hard part. Clearing away the background gunk is, as previously noted, easy; Identifying individual (even distorted) characters is likewise quite achievable. The trouble for now is breaking up a word into those individual characters. Note that some Captchas (e.g. Blogger's) have their letters clearly separated, which makes things quite easy. Mousehunt (and indeed the reCAPTCHA system used on this blog) aren't quite so nice, and mess stuff up by generating lines over the letters.

To the best of my knowledge, solving this in general is still an open problem. Humans can effortlessly filter out the lines automatically, but to a naive program, they are just so many more pixels, same as the pixels actually making up the letters. I suppose someone out there will have a very smart pattern matching algorithm that does decently for the general case, but for this Captcha in particular I applied a simple horizontal line detection heuristic for a start:


Lines Removed. Sort Of.

Note that the thick blotches aren't recognized as line noise, and remain. This is the part that admits the most research, and to be honest a good general solution would probably be sufficient to support a whole Ph.D. thesis on. So we'll put this aside for now, and move on.

Though the text is extracted, it is not horizontal yet, and this would probably pose unnecessary problems for our OCR process. There are likely many ways to correct this, for example identifying the baseline (which may not be straightforward with lowercase and distorted characters), but for now I adopted the simple method of taking each pixel as a point and using the line of best fit, which is as general as it goes:


Thin Red Line

It's far from perfect, but an improvement nevertheless. The script then rotates the image such that the line of best fit is horizontal, and we are ready to OCR:


Gogogo

I chose Google's freeware Tesseract engine, and proceeded to install it on my hosted web server. In the process, I learnt that one could avoid the (protected) default directories by supplying the right arguments to the configure file (Linux noob, sorry), and that one could simulate a shell console with Perl statements (just open a filehandle piping the linux commands suffixed with 2>&1; to merge stdout and stderr messages).

There was a bit of a sticky situation as the default TIFF files produced by ImageMagick weren't compatible with Tesseract, but a bit of Googling revealed that running the ImageMagick convert executable on those files with -compress None -density 300 -strip -depth 8 -monochrome -normalize -endian MSB would return just about what is needed:


N.B. Displayed in GIF format here

All that is left is to run it through Tesseract, and we get:

HVMYR


Four out of five ain't too bad for a first try, I guess. Interestingly, this might even pass a reCAPTCHA check.

Remember that Captcha's were never meant to be anything more than a preliminary challenge, however, as the cost of reliably "breaking" them would be no more than what it costs to employ a kid at minimum wage is - the need for Captchas to be solvable by everybody, regardless of their background, caps the level of complexity that can realistically be used.

Once more, Merry Christmas!



comments (2) - email - share - print - direct link
trackbacks (7) - trackback url


Next: Hammy 2009!


Related Posts:
Canned Spam
Smack Those $p4m20r2
Any Verse, Sorry
Midmidterm
Economics Thus Far

Back to top




2 comments


glenlowell said...

please send me the application,


September 10, 2010 - 22:13 SGT     

darkangel said...

that's nice of your work.....keep up the good work and share the blessings...share it to me.


September 10, 2010 - 22:21 SGT     


7 trackbacks


Linkback by Rednano Web Search - OPTICAL COMPLAINT SINGAPORE

... [bert's blog - Ho Ho Ho] Life is a fatal complaint, and an eminently contagiou... ...DOC] currently: National University of Singapore/...


November 5, 2010 - 20:02 SGT     

Trackback by xbox 360 repair

xbox 360 repair - "[...][bert's blog][...]"


January 10, 2014 - 09:40 SGT     

Trackback by beats studio

beats studio - [bert's blog]


May 4, 2014 - 02:20 SGT     

Trackback by bp claims

bp claims - [bert's blog]


September 19, 2014 - 07:39 SGT     

Trackback by Alien Isolation Patch fr

Alien Isolation Patch fr - [bert's blog]


September 28, 2014 - 00:10 SGT     

Trackback by click link

click link - "[...][bert's blog][...]"


November 5, 2014 - 14:33 SGT     

Trackback by avis societe eps

avis societe eps - [bert's blog]


November 19, 2014 - 08:31 SGT     


Copyright © 2006-2025 GLYS. All Rights Reserved.