![]() |
TCHS 4O 2000 [4o's nonsense] alvinny [2] - csq - edchong jenming - joseph - law meepok - mingqi - pea pengkian [2] - qwergopot - woof xinghao - zhengyu HCJC 01S60 [understated sixzero] andy - edwin - jack jiaqi - peter - rex serena SAF 21SA khenghui - jiaming - jinrui [2] ritchie - vicknesh - zhenhao Others Lwei [2] - shaowei - website links - Alien Loves Predator BloggerSG Cute Overload! Cyanide and Happiness Daily Bunny Hamleto Hattrick Magic: The Gathering The Onion The Order of the Stick Perry Bible Fellowship PvP Online Soccernet Sluggy Freelance The Students' Sketchpad Talk Rock Talking Cock.com Tom the Dancing Bug Wikipedia Wulffmorgenthaler ![]() ![]() ![]() ![]() ![]() ![]() |
bert's blog v1.21 Powered by glolg Programmed with Perl 5.6.1 on Apache/1.3.27 (Red Hat Linux) best viewed at 1024 x 768 resolution on Internet Explorer 6.0+ or Mozilla Firefox 1.5+ entry views: 2266 today's page views: 130 (13 mobile) all-time page views: 3241323 most viewed entry: 18739 views most commented entry: 14 comments number of entries: 1213 page created Sat Apr 5, 2025 08:09:39 |
- tagcloud - academics [70] art [8] changelog [49] current events [36] cute stuff [12] gaming [11] music [8] outings [16] philosophy [10] poetry [4] programming [15] rants [5] reviews [8] sport [37] travel [19] work [3] miscellaneous [75] |
- category tags - academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work tags in total: 386 |
![]() | ||
|
- programming - It's late on a cool and comfortable Christmas Eve - perfect for a spot of leisurely coding. With a few hours to burn, I thought, why not try a spot of Captcha breaking? I've covered a bit about Captchas a year and half ago, but let's do the summary all over again. Captchas usually take the form of distorted text, displayed for a human to respond to by typing the text back in order to prevent computer "bots" from (repeatedly) accessing some resource, whether it be signing up for a new email account, posting a comment on a blog, or whatever. Why does it work? Well, the thing is that computers can do some things very easily - say adding humongous amounts of figures - but in other stuff humans are (still) far superior, and should continue to be so for quite some time. Indeed, I daresay the person who creates an artificial intelligence that can convincingly impersonate a human to the extent of displaying understanding would not only win a Turing Award (and other assorted prizes), but would probably have opened the door to the most monumental accomplishment bar none - for once the first machine that is even marginally more intelligent than a human is created, it would be able to improve and propagate itself at a far greater pace than humans could hope to do, and attain a technological singularity. Or so theory goes. Suffice to say that recognizing symbols, although a tiny, tiny subset of what computers would need to be capable of to be considered intelligent, is no easy task at all. It is true that the problem of optical character recognition can be considered to be solved, when confronted with properly scanned and reasonably regular fonts. But reading a handwritten cursive script (especially a doctor's), easy as it may be for most humans (other than the doctor's, maybe), still cannot be done with appreciable accuracy by computers. Which is strange, if you think about it - yes, maybe humans are helped along by context, but give a human isolated words from a cursive pen, of differing styles from different people, and he would likely still be able to read them (well, generally better than a computer anyway). This could enter into a digression on graphology and the merits thereof, but the basic point of such a seemingly simple task being so badly flunked by crazily powerful processors still stands. ![]() Perhaps not always... (Source: Graphic Insight) Astute readers at this point might ask, why not just use handwritten words as a Captcha then? Well, the answer has to do with reusability and cost. Non-text based (and often contextual) captchas have been implemented, which may ask one to name, say, the third object to the left of the hamster in a picture. However, these types of Captchas often suffer from a relative lack of objects (which may be the abovementioned hamster, or handwriting samples), which open them up to another line of attack, that being simply storing the objects in a database for future use. It is easy to see that if the number of objects is small compared to the number of queries (and remember, a single popular website may need to serve millions of Captchas a day), this form of protection would quickly be defeated. Moreover, random objects look quite messy. Text in contrast has many combinations just from combining different letters and easy on the eye, and it is pretty simple to generate an image of a word (or random string of letters to counter dictionary attacks) on some background and mess it up a little, to prevent simple OCR from being able to decipher it. Let us walk through a practical example: ![]() A Captcha check by Mousehunt We are supposed to type in the five letters displayed on the cheese to continue. Not particularly hard to grab that part of the screenshot since it appears in much the same position each time (no one moved the cheese here), but just for fun I used a probabilistic function (using Perl to call ImageMagick) that tries to recognize the cheese. It seems to work, and I got: ![]() Cheese Identified What next? The cheesy background seems to complicate things, but in reality it is not much of an obstacle - I simply strip the oranges and yellows out using their RGB values. What if the Captcha designers then use these very colours? Well, they could, but the Captcha would than be infuriatingly hard (and maybe even impossible) to read (remember some of us are colour blind, so certain other combinations would be bad enough already). Not good when facing potential customers. Indeed, in general I guess it wouldn't be hard to recognize and strip the background. ![]() Black and White Looking good, but there's still some extra noise; I therefore wrote another heuristic to clear away extraneous text (N.B. Actually, in most Captchas the Captcha is in its own image, so this step and the first would not apply). By iteratively clearing away small concentrations of pixels, we can get an idea of where the relevant text is: ![]() Can't Run Can't Hide The code then grabs that part of the image. But now comes the hard part. Clearing away the background gunk is, as previously noted, easy; Identifying individual (even distorted) characters is likewise quite achievable. The trouble for now is breaking up a word into those individual characters. Note that some Captchas (e.g. Blogger's) have their letters clearly separated, which makes things quite easy. Mousehunt (and indeed the reCAPTCHA system used on this blog) aren't quite so nice, and mess stuff up by generating lines over the letters. To the best of my knowledge, solving this in general is still an open problem. Humans can effortlessly filter out the lines automatically, but to a naive program, they are just so many more pixels, same as the pixels actually making up the letters. I suppose someone out there will have a very smart pattern matching algorithm that does decently for the general case, but for this Captcha in particular I applied a simple horizontal line detection heuristic for a start: ![]() Lines Removed. Sort Of. Note that the thick blotches aren't recognized as line noise, and remain. This is the part that admits the most research, and to be honest a good general solution would probably be sufficient to support a whole Ph.D. thesis on. So we'll put this aside for now, and move on. Though the text is extracted, it is not horizontal yet, and this would probably pose unnecessary problems for our OCR process. There are likely many ways to correct this, for example identifying the baseline (which may not be straightforward with lowercase and distorted characters), but for now I adopted the simple method of taking each pixel as a point and using the line of best fit, which is as general as it goes: ![]() Thin Red Line It's far from perfect, but an improvement nevertheless. The script then rotates the image such that the line of best fit is horizontal, and we are ready to OCR: ![]() Gogogo I chose Google's freeware Tesseract engine, and proceeded to install it on my hosted web server. In the process, I learnt that one could avoid the (protected) default directories by supplying the right arguments to the configure file (Linux noob, sorry), and that one could simulate a shell console with Perl statements (just open a filehandle piping the linux commands suffixed with 2>&1; to merge stdout and stderr messages). There was a bit of a sticky situation as the default TIFF files produced by ImageMagick weren't compatible with Tesseract, but a bit of Googling revealed that running the ImageMagick convert executable on those files with -compress None -density 300 -strip -depth 8 -monochrome -normalize -endian MSB would return just about what is needed: ![]() N.B. Displayed in GIF format here All that is left is to run it through Tesseract, and we get: Four out of five ain't too bad for a first try, I guess. Interestingly, this might even pass a reCAPTCHA check. Remember that Captcha's were never meant to be anything more than a preliminary challenge, however, as the cost of reliably "breaking" them would be no more than what it costs to employ a kid at minimum wage is - the need for Captchas to be solvable by everybody, regardless of their background, caps the level of complexity that can realistically be used. Once more, Merry Christmas! Next: Hammy 2009!
glenlowell said... please send me the application,
darkangel said... that's nice of your work.....keep up the good work and share the blessings...share it to me.
Linkback by Rednano Web Search - OPTICAL COMPLAINT SINGAPORE
Trackback by xbox 360 repair
Trackback by beats studio
Trackback by bp claims
Trackback by Alien Isolation Patch fr
Trackback by click link
Trackback by avis societe eps
|
![]() |
||||||||||||||||||||||||||
![]() Copyright © 2006-2025 GLYS. All Rights Reserved. |