Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Tuesday, Sep 01, 2009 - 04:11 SGT
Posted By: Gilbert

- -
In Relation

changelog v1.15
---------------
* Related Posts functionality fixed, improved.
* Minor issue with long strings not breaking in trackbacks fixed.


Thought it was the right time to clean up the extremely crude related posts algorithm written in February, making it slightly less crude, and at the same time provide some insight into its workings.

The idea for this function came from browsing forums and current affairs webzines, where it seems to be quite popular - reading a thread (or article, or post) to the end shows that the visitor has some interest in the topic, so a good way to hold his attention (and advertising revenue) would be to direct him to relevant stuff elsewhere on the site with a few well-placed suggestions. Go on, try it by checking out the direct link to individual posts, and scrolling to the bottom :P

We then ask the question: Given any individual blog post, which of the other posts (369 now, for this blog) are the most similar to it?

A brief survey back then found that search engines aren't particularly forthcoming with their proprietary algorithms, while the closest academic discipline regarding this area is probably text clustering or maybe neural maps, but I didn't want to muck about with matrices and eigenvectors for a half-day spit-and-polish event.

The initial implementation was thus based on a very simple concept: Each blog post contains words (well, most of them do), and similar blog posts probably contain similar words. So all we have to do is to pick out the most-used words in any post to get a general idea of what the post is about, and search all the other posts to get the best matches (this idea is partially indebted to word sense disambiguation from my Natural Language Processing module).

Obviously, some words are far more common than others, and matching up a bunch of "the"-s doesn't necessarily mean much. I therefore used a static list of such words for a start, but that combined with a sneaky bug that let the space character slip through as a valid word, returned poor results. Not to mention regex expressions failing completely if certain special characters were found in words, or the unacceptable performance hit from recalculating the matches each and every time a post was accessed.

Each of these issues were corrected in the current release. The space character and regex bugs were direct, and the performance issue mitigated by caching the results - the related posts of any particular post should not change if there aren't new posts that have been added since the last request, which doesn't happen that often.

But what about the common words? Well, all posts so far were downloaded, stripped of HTML tags and analysed - this gave just over 260 000 words in total, or around 700 words per post. "the" was by far the most commonly used word, appearing 13614 times, almost double "to" which had 7154 occurrences. Furthermore, over ten thousand words show up just once in the blog's history!


Evidence of poor vocabulary


This corresponds closely to known language behaviour, and though the data may be imperfect due to simplistic word parsing (e.g. is "señor" a word?) and the omission of reduction to base forms (by Porter's algorithm, for instance), it should still be plenty good enough for my purposes.

I then generate a search string from the ten or so most-popular and relatively uncommon words (i.e. which do not appear more than about 0.2% of the time) from each blog post. Each of these terms is also weighed accordingly to their rarity with exponential decay (e.g. a word that appears five times is taken to be twice as significant as one that appears maybe twenty times), and also their prevalence in the post (so the term that appears most often is [slightly] more significant that one that appears second most often).

Of course, I could have used the entire blog post as the search string and weighed every single word by inverse proportion (thus "the" would count for almost nothing), but the above approximation should behave reasonably closely.

Then, the algorithm runs through each other blog post [and thus is O(n) in practice] and computes a similarity score based on how the text in each post matches up with the search string. In this manner, all posts are ranked for relatedness, and the five most relevant items displayed at the bottom of individual posts.


How it all works


How does the current version of the related posts algorithm perform? Empirically, not too badly - posts that were written as part of a loose series were generally classed within the top five most related of each other. Obtaining an objective evaluation might not be possible, given as the "relatedness" of posts is partly subjective, and humans may not be able to provide a "gold standard" in actual deployments (try getting a volunteer to read and recall hundreds of posts!).

Some weaknesses are easily predicted - one of them is that the exact words must match. Having one post talk at length about apples, and another about oranges and other fruit, may result in zero similarity being detected between them, despite most people likely agreeing that they have more in common than random posts. This could possibly be mitigated by deriving some form of inter-word distance, perhaps based on synonyms, or even how often words appear close to each other in a large corpus. Computationally potentially very expensive, clearly, and thus filed under Future Work.

Another problem is that the meaning of the word is not considered. Should a post talking about the apple fruit be related to one that talks about Apple's overpriced PCs? Probably not, but the existing situation is not as bad as it seems - there are other terms to help disambiguate, and if there are very few posts, the apple fruit may well be the most relevant analogue to the Apple Mac anyway.

Relativity in football - soon after Eduardo went down all too easily against Celtic (video), Arsenal got a taste of their own medicine when Rooney toppled for a penalty (video, probably by an Arsenal fan) to help United to a probably undeserved victory. Not the first time for Rooney either, as he did it to end Arsenal's 49 game unbeaten run back in 2004.

Diving's one of those things; nobody admits to it, some denounce it, others do it more than most, but nearly every player does it if the price is right (or even when it's not quite right, as Eboue did and found out to his consternation later in the same game). I feel Eduardo deserves some sympathy, though. Given that he has just recovered from a horrific tackle that broke his leg in three places, it's understandable that he tries to shun contact (if slightly too much in this case).

Taking impacts head-on would be the honest (and macho) thing to do, but can't bode well for a player's long-term career. Unfortunately, getting out of the way for the sake of safety often loses possession of the ball, since the referee usually doesn't care about players being scared away. The lure of reaping the benefits of a foul without suffering the foul itself is obvious, and the players will continue to maintain the charade as long as it is profitable to do so - bookings for simulation are very rare after all. That's part of what makes football so addictive to watch...

And finally, relativity in barber prices - I finally made a change from the uncle who had cut my hair since my primary school days (and maybe even before that), visiting the newly-opened Snip Avenue at the local market. Speed aside, it was ridiculously cheap at S$3.80 for my basic cut, compared to the S$8 my usual barber charges. Spent the savings at school on a Scholes keychain almost immediately.


Like they say, everything happens for a reason




comments (0) - email - share - print - direct link
trackbacks (2) - trackback url


Next: Much To Do About Nothing


Related Posts:
The Washing Machine
Gah
Semester 7 Post Mortem
Midmidterm
Economics Thus Far

Back to top




2 trackbacks


Linkback by LwEi's World

....glys.com/favicon.ico'/bert's blogIn Relation4 days agoinput type='hidden' value='http://ww...


September 5, 2009 - 06:36 SGT     

Linkback by LwEi's World

....glys.com/favicon.ico'/bert's blogIn Relation5 days agoinput type='hidden' value='http://ww...


September 6, 2009 - 11:12 SGT     


Copyright © 2006-2025 GLYS. All Rights Reserved.