[bert's blog - In Relation]

TCHS 4O 2000 [4o's nonsense]

alvinny [2] - csq - edchong
jenming - joseph - law
meepok - mingqi - pea
pengkian [2] - qwergopot - woof
xinghao - zhengyu

HCJC 01S60 [understated sixzero]

andy - edwin - jack
jiaqi - peter - rex
serena

SAF 21SA

khenghui - jiaming - jinrui [2]
ritchie - vicknesh - zhenhao

Others
Lwei [2] - shaowei

- website links -

Alien Loves Predator
BloggerSG
Cute Overload!
Cyanide and Happiness
Daily Bunny
Hamleto
Hattrick
Magic: The Gathering
The Onion
The Order of the Stick
Perry Bible Fellowship
PvP Online
Soccernet
Sluggy Freelance
The Students' Sketchpad
Talk Rock
Talking Cock.com
Tom the Dancing Bug
Wikipedia
Wulffmorgenthaler

bert's blog v1.21
Powered by glolg
Programmed with Perl 5.6.1
on Apache/1.3.27 (Red Hat Linux)

best viewed at 1024 x 768 resolution
on Internet Explorer 6.0+
or Mozilla Firefox 1.5+

entry views: 2805
today's page views: 361 (108 mobile)
all-time page views: 3400346

most viewed entry: 18739 views
most commented entry: 14 comments
number of entries: 1228

page created Sat Jul 12, 2025 17:42:12

- tagcloud -

academics [70]
art [8]
changelog [49]
current events [36]
cute stuff [12]
gaming [11]
music [8]
outings [16]
philosophy [10]
poetry [4]
programming [15]
rants [5]
reviews [8]
sport [37]
travel [19]
work [3]

miscellaneous [75]

- category tags -

academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work

tags in total: 386

i am now probably: surfing the web [?]
(status updated every 30 minutes)

name: Lim Yong San, Gilbert -

gender: Male
nationality: Singaporean
race: Chinese
dob: 25^th January 1984

height: 1.74m (5'8½")
weight: 67kg (147 pounds)
blood type: A+

Download full resume [PDF] [DOC]
currently: National University of Singapore
(studying Computer Science & Economics)
tertiary: Hwa Chong Junior College*
secondary: The Chinese High School*
* merged into Hwa Chong Institution in 2005
primary: Shuqun Primary School
pre: Jurong Christian Church Kindergarten

fav colour: Green
fav soccer clubs: Manchester United,
Brighton and Hove Albion,
English National Team
hobbies: Many (a few in no particular order:)
reading & writing
programming (sometimes)
webgame timesin ks
DotA
kicking ping-pong balls
all manner of sports involving balls
sleeping

Tuesday, Sep 01, 2009 - 04:11 SGT
Posted By: Gilbert

- changelog -
In Relation
changelog v1.15
---------------
* Related Posts functionality fixed, improved.
* Minor issue with long strings not breaking in trackbacks fixed.

Thought it was the right time to clean up the extremely crude related posts algorithm written in February, making it slightly less crude, and at the same time provide some insight into its workings.

The idea for this function came from browsing forums and current affairs webzines, where it seems to be quite popular - reading a thread (or article, or post) to the end shows that the visitor has some interest in the topic, so a good way to hold his attention (and advertising revenue) would be to direct him to relevant stuff elsewhere on the site with a few well-placed suggestions. Go on, try it by checking out the direct link to individual posts, and scrolling to the bottom :P

We then ask the question: Given any individual blog post, which of the other posts (369 now, for this blog) are the most similar to it?

A brief survey back then found that search engines aren't particularly forthcoming with their proprietary algorithms, while the closest academic discipline regarding this area is probably text clustering or maybe neural maps, but I didn't want to muck about with matrices and eigenvectors for a half-day spit-and-polish event.

The initial implementation was thus based on a very simple concept: Each blog post contains words (well, most of them do), and similar blog posts probably contain similar words. So all we have to do is to pick out the most-used words in any post to get a general idea of what the post is about, and search all the other posts to get the best matches (this idea is partially indebted to word sense disambiguation from my Natural Language Processing module).

Obviously, some words are far more common than others, and matching up a bunch of "the"-s doesn't necessarily mean much. I therefore used a static list of such words for a start, but that combined with a sneaky bug that let the space character slip through as a valid word, returned poor results. Not to mention regex expressions failing completely if certain special characters were found in words, or the unacceptable performance hit from recalculating the matches each and every time a post was accessed.

Each of these issues were corrected in the current release. The space character and regex bugs were direct, and the performance issue mitigated by caching the results - the related posts of any particular post should not change if there aren't new posts that have been added since the last request, which doesn't happen that often.

But what about the common words? Well, all posts so far were downloaded, stripped of HTML tags and analysed - this gave just over 260 000 words in total, or around 700 words per post. "the" was by far the most commonly used word, appearing 13614 times, almost double "to" which had 7154 occurrences. Furthermore, over ten thousand words show up just once in the blog's history!

And lack of imagination

Evidence of poor vocabulary

This corresponds closely to known language behaviour, and though the data may be imperfect due to simplistic word parsing (e.g. is "señor" a word?) and the omission of reduction to base forms (by Porter's algorithm, for instance), it should still be plenty good enough for my purposes.

I then generate a search string from the ten or so most-popular and relatively uncommon words (i.e. which do not appear more than about 0.2% of the time) from each blog post. Each of these terms is also weighed accordingly to their rarity with exponential decay (e.g. a word that appears five times is taken to be twice as significant as one that appears maybe twenty times), and also their prevalence in the post (so the term that appears most often is [slightly] more significant that one that appears second most often).

Of course, I could have used the entire blog post as the search string and weighed every single word by inverse proportion (thus "the" would count for almost nothing), but the above approximation should behave reasonably closely.

Then, the algorithm runs through each other blog post [and thus is O(n) in practice] and computes a similarity score based on how the text in each post matches up with the search string. In this manner, all posts are ranked for relatedness, and the five most relevant items displayed at the bottom of individual posts.

Always easier with pictures

How it all works

How does the current version of the related posts algorithm perform? Empirically, not too badly - posts that were written as part of a loose series were generally classed within the top five most related of each other. Obtaining an objective evaluation might not be possible, given as the "relatedness" of posts is partly subjective, and humans may not be able to provide a "gold standard" in actual deployments (try getting a volunteer to read and recall hundreds of posts!).

Some weaknesses are easily predicted - one of them is that the exact words must match. Having one post talk at length about apples, and another about oranges and other fruit, may result in zero similarity being detected between them, despite most people likely agreeing that they have more in common than random posts. This could possibly be mitigated by deriving some form of inter-word distance, perhaps based on synonyms, or even how often words appear close to each other in a large corpus. Computationally potentially very expensive, clearly, and thus filed under Future Work.

Another problem is that the meaning of the word is not considered. Should a post talking about the apple fruit be related to one that talks about Apple's overpriced PCs? Probably not, but the existing situation is not as bad as it seems - there are other terms to help disambiguate, and if there are very few posts, the apple fruit may well be the most relevant analogue to the Apple Mac anyway.

Relativity in football - soon after Eduardo went down all too easily against Celtic (video), Arsenal got a taste of their own medicine when Rooney toppled for a penalty (video, probably by an Arsenal fan) to help United to a probably undeserved victory. Not the first time for Rooney either, as he did it to end Arsenal's 49 game unbeaten run back in 2004.

Diving's one of those things; nobody admits to it, some denounce it, others do it more than most, but nearly every player does it if the price is right (or even when it's not quite right, as Eboue did and found out to his consternation later in the same game). I feel Eduardo deserves some sympathy, though. Given that he has just recovered from a horrific tackle that broke his leg in three places, it's understandable that he tries to shun contact (if slightly too much in this case).

Taking impacts head-on would be the honest (and macho) thing to do, but can't bode well for a player's long-term career. Unfortunately, getting out of the way for the sake of safety often loses possession of the ball, since the referee usually doesn't care about players being scared away. The lure of reaping the benefits of a foul without suffering the foul itself is obvious, and the players will continue to maintain the charade as long as it is profitable to do so - bookings for simulation are very rare after all. That's part of what makes football so addictive to watch...

And finally, relativity in barber prices - I finally made a change from the uncle who had cut my hair since my primary school days (and maybe even before that), visiting the newly-opened Snip Avenue at the local market. Speed aside, it was ridiculously cheap at S$3.80 for my basic cut, compared to the S$8 my usual barber charges. Spent the savings at school on a Scholes keychain almost immediately.

With complimentary sucker