TCHS 4O 2000 [4o's nonsense] alvinny [2] - csq - edchong jenming - joseph - law meepok - mingqi - pea pengkian [2] - qwergopot - woof xinghao - zhengyu HCJC 01S60 [understated sixzero] andy - edwin - jack jiaqi - peter - rex serena SAF 21SA khenghui - jiaming - jinrui [2] ritchie - vicknesh - zhenhao Others Lwei [2] - shaowei - website links - Alien Loves Predator BloggerSG Cute Overload! Cyanide and Happiness Daily Bunny Hamleto Hattrick Magic: The Gathering The Onion The Order of the Stick Perry Bible Fellowship PvP Online Soccernet Sluggy Freelance The Students' Sketchpad Talk Rock Talking Cock.com Tom the Dancing Bug Wikipedia Wulffmorgenthaler |
bert's blog v1.21 Powered by glolg Programmed with Perl 5.6.1 on Apache/1.3.27 (Red Hat Linux) best viewed at 1024 x 768 resolution on Internet Explorer 6.0+ or Mozilla Firefox 1.5+ entry views: 2662 today's page views: 240 (65 mobile) all-time page views: 3207966 most viewed entry: 18739 views most commented entry: 14 comments number of entries: 1203 page created Sun Jan 19, 2025 17:49:13 |
- tagcloud - academics [70] art [8] changelog [49] current events [36] cute stuff [12] gaming [11] music [8] outings [16] philosophy [10] poetry [4] programming [15] rants [5] reviews [8] sport [37] travel [19] work [3] miscellaneous [75] |
- category tags - academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work tags in total: 386 |
|
- changelog - changelog v1.15 --------------- * Related Posts functionality fixed, improved. * Minor issue with long strings not breaking in trackbacks fixed. Thought it was the right time to clean up the extremely crude related posts algorithm written in February, making it slightly less crude, and at the same time provide some insight into its workings. The idea for this function came from browsing forums and current affairs webzines, where it seems to be quite popular - reading a thread (or article, or post) to the end shows that the visitor has some interest in the topic, so a good way to hold his attention (and advertising revenue) would be to direct him to relevant stuff elsewhere on the site with a few well-placed suggestions. Go on, try it by checking out the direct link to individual posts, and scrolling to the bottom :P We then ask the question: Given any individual blog post, which of the other posts (369 now, for this blog) are the most similar to it? A brief survey back then found that search engines aren't particularly forthcoming with their proprietary algorithms, while the closest academic discipline regarding this area is probably text clustering or maybe neural maps, but I didn't want to muck about with matrices and eigenvectors for a half-day spit-and-polish event. The initial implementation was thus based on a very simple concept: Each blog post contains words (well, most of them do), and similar blog posts probably contain similar words. So all we have to do is to pick out the most-used words in any post to get a general idea of what the post is about, and search all the other posts to get the best matches (this idea is partially indebted to word sense disambiguation from my Natural Language Processing module). Obviously, some words are far more common than others, and matching up a bunch of "the"-s doesn't necessarily mean much. I therefore used a static list of such words for a start, but that combined with a sneaky bug that let the space character slip through as a valid word, returned poor results. Not to mention regex expressions failing completely if certain special characters were found in words, or the unacceptable performance hit from recalculating the matches each and every time a post was accessed. Each of these issues were corrected in the current release. The space character and regex bugs were direct, and the performance issue mitigated by caching the results - the related posts of any particular post should not change if there aren't new posts that have been added since the last request, which doesn't happen that often. But what about the common words? Well, all posts so far were downloaded, stripped of HTML tags and analysed - this gave just over 260 000 words in total, or around 700 words per post. "the" was by far the most commonly used word, appearing 13614 times, almost double "to" which had 7154 occurrences. Furthermore, over ten thousand words show up just once in the blog's history! Evidence of poor vocabulary This corresponds closely to known language behaviour, and though the data may be imperfect due to simplistic word parsing (e.g. is "señor" a word?) and the omission of reduction to base forms (by Porter's algorithm, for instance), it should still be plenty good enough for my purposes. I then generate a search string from the ten or so most-popular and relatively uncommon words (i.e. which do not appear more than about 0.2% of the time) from each blog post. Each of these terms is also weighed accordingly to their rarity with exponential decay (e.g. a word that appears five times is taken to be twice as significant as one that appears maybe twenty times), and also their prevalence in the post (so the term that appears most often is [slightly] more significant that one that appears second most often). Of course, I could have used the entire blog post as the search string and weighed every single word by inverse proportion (thus "the" would count for almost nothing), but the above approximation should behave reasonably closely. Then, the algorithm runs through each other blog post [and thus is O(n) in practice] and computes a similarity score based on how the text in each post matches up with the search string. In this manner, all posts are ranked for relatedness, and the five most relevant items displayed at the bottom of individual posts. How it all works How does the current version of the related posts algorithm perform? Empirically, not too badly - posts that were written as part of a loose series were generally classed within the top five most related of each other. Obtaining an objective evaluation might not be possible, given as the "relatedness" of posts is partly subjective, and humans may not be able to provide a "gold standard" in actual deployments (try getting a volunteer to read and recall hundreds of posts!). Some weaknesses are easily predicted - one of them is that the exact words must match. Having one post talk at length about apples, and another about oranges and other fruit, may result in zero similarity being detected between them, despite most people likely agreeing that they have more in common than random posts. This could possibly be mitigated by deriving some form of inter-word distance, perhaps based on synonyms, or even how often words appear close to each other in a large corpus. Computationally potentially very expensive, clearly, and thus filed under Future Work. Another problem is that the meaning of the word is not considered. Should a post talking about the apple fruit be related to one that talks about Apple's overpriced PCs? Probably not, but the existing situation is not as bad as it seems - there are other terms to help disambiguate, and if there are very few posts, the apple fruit may well be the most relevant analogue to the Apple Mac anyway. Relativity in football - soon after Eduardo went down all too easily against Celtic (video), Arsenal got a taste of their own medicine when Rooney toppled for a penalty (video, probably by an Arsenal fan) to help United to a probably undeserved victory. Not the first time for Rooney either, as he did it to end Arsenal's 49 game unbeaten run back in 2004. Diving's one of those things; nobody admits to it, some denounce it, others do it more than most, but nearly every player does it if the price is right (or even when it's not quite right, as Eboue did and found out to his consternation later in the same game). I feel Eduardo deserves some sympathy, though. Given that he has just recovered from a horrific tackle that broke his leg in three places, it's understandable that he tries to shun contact (if slightly too much in this case). Taking impacts head-on would be the honest (and macho) thing to do, but can't bode well for a player's long-term career. Unfortunately, getting out of the way for the sake of safety often loses possession of the ball, since the referee usually doesn't care about players being scared away. The lure of reaping the benefits of a foul without suffering the foul itself is obvious, and the players will continue to maintain the charade as long as it is profitable to do so - bookings for simulation are very rare after all. That's part of what makes football so addictive to watch... And finally, relativity in barber prices - I finally made a change from the uncle who had cut my hair since my primary school days (and maybe even before that), visiting the newly-opened Snip Avenue at the local market. Speed aside, it was ridiculously cheap at S$3.80 for my basic cut, compared to the S$8 my usual barber charges. Spent the savings at school on a Scholes keychain almost immediately. Like they say, everything happens for a reason Next: Much To Do About Nothing
Linkback by LwEi's World
Linkback by LwEi's World
|
||||||||||||
Copyright © 2006-2025 GLYS. All Rights Reserved. |