Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Sunday, May 13, 2012 - 21:52 SGT
Posted By: Gilbert

- -
Finding Words

It's the final day of the Premier League, and the current punting standings are:

3198.5 seeds for FAKEBERT
3207 seeds for Mr. Ham!
(both from 3100 seeds wagered)


Mr. Ham: Useless human is going to lose after I gave him a thousand-seed headstart! Nyeah nyeah!

Me: I'm not certain that FAKEBERT is human...

Mr. Ham: I'm not letting that stand in the way of my declaring hamster superiority. Final wager: What better way than to finish with a flourish by having hams share the spoils? Tottenham to draw Fulham! (at 4.25)

FAKEBERT: I'll take Manchester City to beat Queen's Park Rangers (at 1.10)

Me: With that bit of business out of the way, may I draw the focus back to one of my part-time diversions, the study of machine recognition in general and CAPTCHA solving in particular, which I have given attention to not too long ago.

However, while that more applicative-based attempt was centered on a relatively simple target, it is time to widen our horizons; and what better way than to set the Google CAPTCHA in our sights? Less than half a year ago, Stanford researchers reported that they were unable to break a single of Google's CAPTCHAs, yet were able to crack 13 out of 15 other types of CAPTCHAs with accuracies that are usable in practice (mostly over 25%) [see academic paper]

It should be noted that a researcher with links to a local lab had reported [see paper] the ability to solve 68% of Google's CAPTCHAs back in mid-2011. However, this success may be tempered by the revelation in Section 3 that they initially divided the CAPTCHAs into "user-unfriendly" and "usable" classes, and developed (and presumably evaluated) their method on only the easier "usable" class, using 100 samples. Clearly, the conclusions may not then be applicable to the wider set of all CAPTCHAs, which is what users actually face.

To this end, I requested Mr. Robo to collect the data for an independent experiment.

Mr. Robo: Aye, aye. I have obtained the sum of 2000 samples from the former Google Accounts UnlockCaptcha page back in February. They've since closed this source of CAPTCHAs, but it should be valid for our purposes.

Me: Nice work, Mr. Robo. I'll see you get that promotion to code bunny soon. Well, the obvious next step is to create a ground truth for these CAPTCHAs. And how might we go about that?

*Looks at Mr. Robo, together with Mr. Ham*

Mr. Robo: Wai... wait a minute! It's written in my contract that I don't have to do data entry!

Esquire Pants: *appears out of nowhere and inspects contract* Looks legit to me.

*Looks at Mr. Ham*

*Mr. Ham stares back blankly, slowly keels over*

Mr. Robo: My word! Is he dead?!

Me: *pokes the motionless Mr. Ham with a chopstick* He's certainly getting better at it, he's even getting the turning black part right.

Oh well, I suppose it can't be helped...

*Hours later*

Me: That's... that's all two thousand CAPTCHAs sol... solved. I... I don't want to... look... at another CAPTCHA for the rest of... my life.

Mr. Robo: Erm, but how would you know how confident you should be about your answers?

Me: Tha.. that's right... so... how do... we... resolve th... this?

Mr. Robo: Um, there's intra-rater reliability. Which basically means you got to do it all over again, and see how different the results are.


My reaction to this realisation
(Source: gifbin.com)


*More hours later*

Me: *crawling up from under the desk* D... done.

Mr. Robo: But wait, what are you gonna do if they still don't agree? You had better do it a third time so that you can apply a majority voting rule for the disputed cases.

Me: I... will... personally... strangle... anybody... who... suggests... that...

Mr. Robo: Um. Okay. Never mind then. Let me analyse the results first. *codes away* Lookie what we have here, only 143 of the 2000 answers disagree with each other, giving a 92.85% intra-rater reliability for you. A 2010 study, again by Stanford, found 92.1% agreement by two out of three humans offering their services under Amazon's Mechanical Turk framework, and only 66.72% unanimous agreement by all three, for Google's CAPTCHAs, to give some perspective.

Bottomline is, solving state-of-the-art CAPTCHAs isn't that easy, and may even have some correlation with general intelligence - the paper noted that solving speed is weakly correlated with educational level, 9.6 seconds on average for those with no formal education and 7.64 seconds for those with doctorates, at least for image CAPTCHAs.

Some have even declared CAPTCHAs dead (back in 2008), and even the better ones are often regarded as near-unreadable (see comments here, and feedback like this, which has a good point about international users and Latin letters), with an obvious solution being to re-engineer sites such that the onus of stopping bots falls on the server side (e.g. using spam filters), rather than the client.

It should also be noted that the results obtained from our dataset might not be directly comparable with prior research (again!), since Google likely updates their generation algorithm every now and then. For example, from our 2000-sample dataset, the average length of a CAPTCHA is 8.98 characters, with a median of 8, a minimum of 5 and a maximum of 11; this is a departure from the paper covered previously, which states that the string lengths vary between 5 and 8 characters only, at least for the "usable" class.




In fact, Google CAPTCHAs of length 7 or less were extremely rare in our dataset, with the vast majority being of length 8 to 10. Google might well have decided to bolster security by increasing length, since it is evident that each additional character should reduce machine solvability by at least (1-x), where x is the average recognition rate for each individual character, assuming they demand perfect answers.

As it turns out, CAPTCHAs for which there was intra-rater disagreement tended to be slightly longer, but not significantly. The trouble, as I will show later, lies elsewhere.

Me: The problem remains. How do we proceed?

Mr. Robo: Well, it would be best to collect some more data first, so we don't depend on answers from just you. Do you have any favours you can call in?

Me: I think Mr. Ham is the go-to guy for that. *looks over* Nope, he's still dead.

Mr. Robo: Oh, he was up and about munching on some peanuts just now.

Me: Forget it, I'll put in a plea to my friends, especially those who are currently pursuing or may in the future pursue further studies, since they are more likely to be sympathetic. I can't quite ask them to go through all 2000, so please select a hundred examples for me, split evenly between my intra-rater agreement and disagreement classes.

*A couple of days pass*

Me: Alright, it seems like I got three responses. So how did they do?

Mr. Robo: More detailed results are available, but I will give a summary. Of the 50 disagreement images out of the 100, 13 (26%) of them were agreed upon by all three volunteers - for these, I suspect that the disagreement was more due to fatigue and other factors on your part, and not because of any inherent indecipherability of the CAPTCHA itself.

As a whole, the agreement class did still obtain much better consensus among everyone, as is to be expected. Fully 66% of the images in that class were agreed upon by all four participants, as opposed to 26% for the disagreement class, picking either of your answers as valid for that class. The distribution is as follows:




It therefore does seem that your intra-rater disagreement is a pretty good predictor of how difficult the CAPTCHA actually is, and the 60-odd percent concurrence tentatively supports the Stanford findings. Four of the disagreement CAPTCHAs (8%) received a different solution each of the five times they were evaluated, and here they are:




Given Solutions (clockwise from top left):
urwledsh - vawledsh - urveledsh - uawledsh - vorveldsh
pnoatemon - proatemon - prxtemon - pnatemon - pratrnon
immelxino - immelixino - immerxino - immeixino - immerxno
unmvanevo - unmvnero - uraminero - uamwnew - uvaraunevo


It therefore seems near-certain that some CAPTCHAs are, indeed, basically unsolvable, or more technically, that their deformation has destroyed critical information. From observation, these "bad" CAPTCHAs are especially wavy and squashed, making it difficult to tell loopy (e.g. m,w) or thin-type characters (e.g. i,l) apart.

Some tangential statistics: Of the 50 agreement CAPTCHAs, you have an average of 84.67% agreement with the volunteers (92%, 86% and 76% respectively). Volunteers 1 and 2 agreed 84% of the time, but their agreement with Volunteer 3 was only about 70% each.

On to the speed of solving - the volunteers were told to solve the CAPTCHAs as-per-normal, as if they were confronted by them in the normal course of web surfing. Since they could well have taken breaks in between (which you can doubtless understand), I report the median time taken. Note that since individual times were recorded to the second, and internet speeds may have had an effect, this is a very rough estimate:




It is observed that Volunteer 2 [Median time: 4s/Mean time disregarding outliers: 4.94s] was the fastest, followed by Volunteer 3 [7s/8.28s], while Volunteer 1 took slightly longer [8s/8.36s] (but also achieved the highest agreement with yours truly); the average of about 7.2s is slightly faster than the 7.64s reported for solvers with a doctorate, even considering the skewed nature of the 100-image dataset, which contains harder CAPTCHAs than might be expected.

Somewhat surprising is that the collated results do not show more time being spent on the trickier disagreement CAPTCHAs than the more straightforward agreement ones. I'm sure you can cook up several hypotheses for this:




Me: Enough with this, anything actually useful?

Mr. Robo: About that, one realisation is that the strings are far from truly random - some substrings, in particular, cropped up every so often: "ssess" about 22 times, for example, although that might be expected only in no more than one in a million CAPTCHAs were the characters really selected randomly; Even "glys" came up about five times!

We can very quickly confirm this through, you guessed it, frequency analysis, which despite some errors should give the big picture (17967 characters from the 2000-CAPTCHA data, Concise Oxford data from Wikipedia):




It seems safe to say that the CAPTCHAs were generated from a distribution similar to an English corpus. Indeed, this may well aid CAPTCHA recognition for humans - instead of having to identify each character on its own merits, neighbouring characters might well be used for difficult characters or substrings, if subwords are incorporated, a feature which many other CAPTCHA implementations may have missed. Indeed, an update on the baboon word recognition study mentions that they may well be making their decisions based on bigrams (letter-pairs) or trigrams (or not)!

Does the CAPTCHA data in fact conform to common bigram distributions as well?

bigramcountfrequencystd rank
1es4840.0303112th
2in3690.023113rd
3er3430.021484th
4se2840.0177931st
5st2770.0173513th
6ti2530.0158423rd
7ss2400.01503
8re2220.013906th
9ed2080.0130315th
10te2030.0127125th
11at1880.011778th
12ng1870.0117127th
13en1840.0115214th
14on1830.011469th
15ne1750.01096
16ra1750.0109637th
17le1720.0107732nd
18an1700.010655th
19ri1700.01065
20is1680.0105221st
21si1590.0099634th
22li1470.00921
23al1450.0090829th
24or1290.0080822nd
25co1280.00802
26di1150.00720
27de1110.0069530th
28nt1110.0069510th
29ar1090.0068335th
30la1070.00670


As it happens, not quite. For example, "th", which is the most common English bigram by some distance, is not represented within the top 30 CAPTCHA bigrams (it comes 86th); however, there remains quite a bit of overlap, with 23 of the 39 most common English bigrams found within the top 30 CAPTCHA bigrams, which again suggests the use of an English-like corpus.

Alright, I think that's enough for today.

Me: Hey, you haven't even began to crack the CAPTCHAs proper!

Mr. Robo: Patience, Hamopolis wasn't built in a day.



comments (0) - email - share - print - direct link
trackbacks (0) - trackback url


Next: Meaningful Hiatus


Related Posts:
Think Thunk
The Huntster Awakens
It Was A Honest Mistake
Pieces From The Past
Rare Sightings

Back to top




Copyright © 2006-2025 GLYS. All Rights Reserved.