[bert's blog - Finding Words]

TCHS 4O 2000 [4o's nonsense]

alvinny [2] - csq - edchong
jenming - joseph - law
meepok - mingqi - pea
pengkian [2] - qwergopot - woof
xinghao - zhengyu

HCJC 01S60 [understated sixzero]

andy - edwin - jack
jiaqi - peter - rex
serena

SAF 21SA

khenghui - jiaming - jinrui [2]
ritchie - vicknesh - zhenhao

Others
Lwei [2] - shaowei

- website links -

Alien Loves Predator
BloggerSG
Cute Overload!
Cyanide and Happiness
Daily Bunny
Hamleto
Hattrick
Magic: The Gathering
The Onion
The Order of the Stick
Perry Bible Fellowship
PvP Online
Soccernet
Sluggy Freelance
The Students' Sketchpad
Talk Rock
Talking Cock.com
Tom the Dancing Bug
Wikipedia
Wulffmorgenthaler

bert's blog v1.21
Powered by glolg
Programmed with Perl 5.6.1
on Apache/1.3.27 (Red Hat Linux)

best viewed at 1024 x 768 resolution
on Internet Explorer 6.0+
or Mozilla Firefox 1.5+

entry views: 2194
today's page views: 419 (16 mobile)
all-time page views: 3386460

most viewed entry: 18739 views
most commented entry: 14 comments
number of entries: 1226

page created Fri Jun 20, 2025 09:34:05

- tagcloud -

academics [70]
art [8]
changelog [49]
current events [36]
cute stuff [12]
gaming [11]
music [8]
outings [16]
philosophy [10]
poetry [4]
programming [15]
rants [5]
reviews [8]
sport [37]
travel [19]
work [3]

miscellaneous [75]

- category tags -

academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work

tags in total: 386

i am now probably: on the bus to NUS [?]
(status updated every 30 minutes)

name: Lim Yong San, Gilbert -

gender: Male
nationality: Singaporean
race: Chinese
dob: 25^th January 1984

height: 1.74m (5'8½")
weight: 67kg (147 pounds)
blood type: A+

Download full resume [PDF] [DOC]
currently: National University of Singapore
(studying Computer Science & Economics)
tertiary: Hwa Chong Junior College*
secondary: The Chinese High School*
* merged into Hwa Chong Institution in 2005
primary: Shuqun Primary School
pre: Jurong Christian Church Kindergarten

fav colour: Green
fav soccer clubs: Manchester United,
Brighton and Hove Albion,
English National Team
hobbies: Many (a few in no particular order:)
reading & writing
programming (sometimes)
webgame timesin ks
DotA
kicking ping-pong balls
all manner of sports involving balls
sleeping

Sunday, May 13, 2012 - 21:52 SGT
Posted By: Gilbert

- academics -
Finding Words
It's the final day of the Premier League, and the current punting standings are:

3198.5 seeds for FAKEBERT
3207 seeds for Mr. Ham!
(both from 3100 seeds wagered)

Mr. Ham: Useless human is going to lose after I gave him a thousand-seed headstart! Nyeah nyeah!

Me: I'm not certain that FAKEBERT is human...

Mr. Ham: I'm not letting that stand in the way of my declaring hamster superiority. Final wager: What better way than to finish with a flourish by having hams share the spoils? Tottenham to draw Fulham! (at 4.25)

FAKEBERT: I'll take Manchester City to beat Queen's Park Rangers (at 1.10)

Me: With that bit of business out of the way, may I draw the focus back to one of my part-time diversions, the study of machine recognition in general and CAPTCHA solving in particular, which I have given attention to not too long ago.

However, while that more applicative-based attempt was centered on a relatively simple target, it is time to widen our horizons; and what better way than to set the Google CAPTCHA in our sights? Less than half a year ago, Stanford researchers reported that they were unable to break a single of Google's CAPTCHAs, yet were able to crack 13 out of 15 other types of CAPTCHAs with accuracies that are usable in practice (mostly over 25%) [see academic paper]

It should be noted that a researcher with links to a local lab had reported [see paper] the ability to solve 68% of Google's CAPTCHAs back in mid-2011. However, this success may be tempered by the revelation in Section 3 that they initially divided the CAPTCHAs into "user-unfriendly" and "usable" classes, and developed (and presumably evaluated) their method on only the easier "usable" class, using 100 samples. Clearly, the conclusions may not then be applicable to the wider set of all CAPTCHAs, which is what users actually face.

To this end, I requested Mr. Robo to collect the data for an independent experiment.

Mr. Robo: Aye, aye. I have obtained the sum of 2000 samples from the former Google Accounts UnlockCaptcha page back in February. They've since closed this source of CAPTCHAs, but it should be valid for our purposes.

Me: Nice work, Mr. Robo. I'll see you get that promotion to code bunny soon. Well, the obvious next step is to create a ground truth for these CAPTCHAs. And how might we go about that?

*Looks at Mr. Robo, together with Mr. Ham*

Mr. Robo: Wai... wait a minute! It's written in my contract that I don't have to do data entry!

Esquire Pants: *appears out of nowhere and inspects contract* Looks legit to me.

*Looks at Mr. Ham*

*Mr. Ham stares back blankly, slowly keels over*

Mr. Robo: My word! Is he dead?!

Me: *pokes the motionless Mr. Ham with a chopstick* He's certainly getting better at it, he's even getting the turning black part right.

Oh well, I suppose it can't be helped...

*Hours later*

Me: That's... that's all two thousand CAPTCHAs sol... solved. I... I don't want to... look... at another CAPTCHA for the rest of... my life.

Mr. Robo: Erm, but how would you know how confident you should be about your answers?

Me: Tha.. that's right... so... how do... we... resolve th... this?

Mr. Robo: Um, there's intra-rater reliability. Which basically means you got to do it all over again, and see how different the results are.

Well, the monitor *was* getting dated

My reaction to this realisation
(Source: gifbin.com)

*More hours later*

Me: *crawling up from under the desk* D... done.

Mr. Robo: But wait, what are you gonna do if they still don't agree? You had better do it a third time so that you can apply a majority voting rule for the disputed cases.

Me: I... will... personally... strangle... anybody... who... suggests... that...

Mr. Robo: Um. Okay. Never mind then. Let me analyse the results first. *codes away* Lookie what we have here, only 143 of the 2000 answers disagree with each other, giving a 92.85% intra-rater reliability for you. A 2010 study, again by Stanford, found 92.1% agreement by two out of three humans offering their services under Amazon's Mechanical Turk framework, and only 66.72% unanimous agreement by all three, for Google's CAPTCHAs, to give some perspective.

Bottomline is, solving state-of-the-art CAPTCHAs isn't that easy, and may even have some correlation with general intelligence - the paper noted that solving speed is weakly correlated with educational level, 9.6 seconds on average for those with no formal education and 7.64 seconds for those with doctorates, at least for image CAPTCHAs.

Some have even declared CAPTCHAs dead (back in 2008), and even the better ones are often regarded as near-unreadable (see comments here, and feedback like this, which has a good point about international users and Latin letters), with an obvious solution being to re-engineer sites such that the onus of stopping bots falls on the server side (e.g. using spam filters), rather than the client.

It should also be noted that the results obtained from our dataset might not be directly comparable with prior research (again!), since Google likely updates their generation algorithm every now and then. For example, from our 2000-sample dataset, the average length of a CAPTCHA is 8.98 characters, with a median of 8, a minimum of 5 and a maximum of 11; this is a departure from the paper covered previously, which states that the string lengths vary between 5 and 8 characters only, at least for the "usable" class.

In fact, Google CAPTCHAs of length 7 or less were extremely rare in our dataset, with the vast majority being of length 8 to 10. Google might well have decided to bolster security by increasing length, since it is evident that each additional character should reduce machine solvability by at least (1-x), where x is the average recognition rate for each individual character, assuming they demand perfect answers.

As it turns out, CAPTCHAs for which there was intra-rater disagreement tended to be slightly longer, but not significantly. The trouble, as I will show later, lies elsewhere.

Me: The problem remains. How do we proceed?

Mr. Robo: Well, it would be best to collect some more data first, so we don't depend on answers from just you. Do you have any favours you can call in?

Me: I think Mr. Ham is the go-to guy for that. *looks over* Nope, he's still dead.

Mr. Robo: Oh, he was up and about munching on some peanuts just now.

Me: Forget it, I'll put in a plea to my friends, especially those who are currently pursuing or may in the future pursue further studies, since they are more likely to be sympathetic. I can't quite ask them to go through all 2000, so please select a hundred examples for me, split evenly between my intra-rater agreement and disagreement classes.

*A couple of days pass*

Me: Alright, it seems like I got three responses. So how did they do?

Mr. Robo: More detailed results are available, but I will give a summary. Of the 50 disagreement images out of the 100, 13 (26%) of them were agreed upon by all three volunteers - for these, I suspect that the disagreement was more due to fatigue and other factors on your part, and not because of any inherent indecipherability of the CAPTCHA itself.

As a whole, the agreement class did still obtain much better consensus among everyone, as is to be expected. Fully 66% of the images in that class were agreed upon by all four participants, as opposed to 26% for the disagreement class, picking either of your answers as valid for that class. The distribution is as follows:

It therefore does seem that your intra-rater disagreement is a pretty good predictor of how difficult the CAPTCHA actually is, and the 60-odd percent concurrence tentatively supports the Stanford findings. Four of the disagreement CAPTCHAs (8%) received a different solution each of the five times they were evaluated, and here they are:

Given Solutions (clockwise from top left):
urwledsh - vawledsh - urveledsh - uawledsh - vorveldsh
pnoatemon - proatemon - prxtemon - pnatemon - pratrnon
immelxino - immelixino - immerxino - immeixino - immerxno
unmvanevo - unmvnero - uraminero - uamwnew - uvaraunevo

It therefore seems near-certain that some CAPTCHAs are, indeed, basically unsolvable, or more technically, that their deformation has destroyed critical information. From observation, these "bad" CAPTCHAs are especially wavy and squashed, making it difficult to tell loopy (e.g. m,w) or thin-type characters (e.g. i,l) apart.

Some tangential statistics: Of the 50 agreement CAPTCHAs, you have an average of 84.67% agreement with the volunteers (92%, 86% and 76% respectively). Volunteers 1 and 2 agreed 84% of the time, but their agreement with Volunteer 3 was only about 70% each.

On to the speed of solving - the volunteers were told to solve the CAPTCHAs as-per-normal, as if they were confronted by them in the normal course of web surfing. Since they could well have taken breaks in between (which you can doubtless understand), I report the median time taken. Note that since individual times were recorded to the second, and internet speeds may have had an effect, this is a very rough estimate:

It is observed that Volunteer 2 [Median time: 4s/Mean time disregarding outliers: 4.94s] was the fastest, followed by Volunteer 3 [7s/8.28s], while Volunteer 1 took slightly longer [8s/8.36s] (but also achieved the highest agreement with yours truly); the average of about 7.2s is slightly faster than the 7.64s reported for solvers with a doctorate, even considering the skewed nature of the 100-image dataset, which contains harder CAPTCHAs than might be expected.

Somewhat surprising is that the collated results do not show more time being spent on the trickier disagreement CAPTCHAs than the more straightforward agreement ones. I'm sure you can cook up several hypotheses for this:

Me: Enough with this, anything actually useful?

Mr. Robo: About that, one realisation is that the strings are far from truly random - some substrings, in particular, cropped up every so often: "ssess" about 22 times, for example, although that might be expected only in no more than one in a million CAPTCHAs were the characters really selected randomly; Even "glys" came up about five times!

We can very quickly confirm this through, you guessed it, frequency analysis, which despite some errors should give the big picture (17967 characters from the 2000-CAPTCHA data, Concise Oxford data from Wikipedia):

It seems safe to say that the CAPTCHAs were generated from a distribution similar to an English corpus. Indeed, this may well aid CAPTCHA recognition for humans - instead of having to identify each character on its own merits, neighbouring characters might well be used for difficult characters or substrings, if subwords are incorporated, a feature which many other CAPTCHA implementations may have missed. Indeed, an update on the baboon word recognition study mentions that they may well be making their decisions based on bigrams (letter-pairs) or trigrams (or not)!

Does the CAPTCHA data in fact conform to common bigram distributions as well?

	bigram	count	frequency	std rank
1	es	484	0.03031	12th
2	in	369	0.02311	3rd
3	er	343	0.02148	4th
4	se	284	0.01779	31st
5	st	277	0.01735	13th
6	ti	253	0.01584	23rd
7	ss	240	0.01503
8	re	222	0.01390	6th
9	ed	208	0.01303	15th
10	te	203	0.01271	25th
11	at	188	0.01177	8th
12	ng	187	0.01171	27th
13	en	184	0.01152	14th
14	on	183	0.01146	9th
15	ne	175	0.01096
16	ra	175	0.01096	37th
17	le	172	0.01077	32nd
18	an	170	0.01065	5th
19	ri	170	0.01065
20	is	168	0.01052	21st
21	si	159	0.00996	34th
22	li	147	0.00921
23	al	145	0.00908	29th
24	or	129	0.00808	22nd
25	co	128	0.00802
26	di	115	0.00720
27	de	111	0.00695	30th
28	nt	111	0.00695	10th
29	ar	109	0.00683	35th
30	la	107	0.00670

As it happens, not quite. For example, "th", which is the most common English bigram by some distance, is not represented within the top 30 CAPTCHA bigrams (it comes 86th); however, there remains quite a bit of overlap, with 23 of the 39 most common English bigrams found within the top 30 CAPTCHA bigrams, which again suggests the use of an English-like corpus.

Alright, I think that's enough for today.

Me: Hey, you haven't even began to crack the CAPTCHAs proper!

Mr. Robo: Patience, Hamopolis wasn't built in a day.

Next: Meaningful Hiatus

Related Posts:

Think Thunk
The Huntster Awakens
It Was A Honest Mistake
Pieces From The Past
Rare Sightings