[bert's blog - Midweek Update On Data Cells]

TCHS 4O 2000 [4o's nonsense]

alvinny [2] - csq - edchong
jenming - joseph - law
meepok - mingqi - pea
pengkian [2] - qwergopot - woof
xinghao - zhengyu

HCJC 01S60 [understated sixzero]

andy - edwin - jack
jiaqi - peter - rex
serena

SAF 21SA

khenghui - jiaming - jinrui [2]
ritchie - vicknesh - zhenhao

Others
Lwei [2] - shaowei

- website links -

Alien Loves Predator
BloggerSG
Cute Overload!
Cyanide and Happiness
Daily Bunny
Hamleto
Hattrick
Magic: The Gathering
The Onion
The Order of the Stick
Perry Bible Fellowship
PvP Online
Soccernet
Sluggy Freelance
The Students' Sketchpad
Talk Rock
Talking Cock.com
Tom the Dancing Bug
Wikipedia
Wulffmorgenthaler

bert's blog v1.21
Powered by glolg
Programmed with Perl 5.6.1
on Apache/1.3.27 (Red Hat Linux)

best viewed at 1024 x 768 resolution
on Internet Explorer 6.0+
or Mozilla Firefox 1.5+

entry views: 218
today's page views: 352 (14 mobile)
all-time page views: 3386393

most viewed entry: 18739 views
most commented entry: 14 comments
number of entries: 1226

page created Fri Jun 20, 2025 08:15:33

- tagcloud -

academics [70]
art [8]
changelog [49]
current events [36]
cute stuff [12]
gaming [11]
music [8]
outings [16]
philosophy [10]
poetry [4]
programming [15]
rants [5]
reviews [8]
sport [37]
travel [19]
work [3]

miscellaneous [75]

- category tags -

academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work

tags in total: 386

i am now probably: sleeping. zzz [?]
(status updated every 30 minutes)

name: Lim Yong San, Gilbert -

gender: Male
nationality: Singaporean
race: Chinese
dob: 25^th January 1984

height: 1.74m (5'8½")
weight: 67kg (147 pounds)
blood type: A+

Download full resume [PDF] [DOC]
currently: National University of Singapore
(studying Computer Science & Economics)
tertiary: Hwa Chong Junior College*
secondary: The Chinese High School*
* merged into Hwa Chong Institution in 2005
primary: Shuqun Primary School
pre: Jurong Christian Church Kindergarten

fav colour: Green
fav soccer clubs: Manchester United,
Brighton and Hove Albion,
English National Team
hobbies: Many (a few in no particular order:)
reading & writing
programming (sometimes)
webgame timesin ks
DotA
kicking ping-pong balls
all manner of sports involving balls
sleeping

Thursday, Apr 26, 2012 - 00:17 SGT
Posted By: Gilbert

- academics -
Midweek Update On Data Cells
Since the organizers very kindly extended the deadline again (which I suspect is something of a custom in academic circles), I got another weekend to tinker with the cell data (not Excel!), and here's the overview:

Produced by gnuplot

I can't even begin to express my appreciation for datasets which can be quickly processed (which my main line of research, alas, is not), and since LIBSVM was blindingly fast too, I produced the above graph by adjusting the training and test set sizes.

Recall that about a fortnight ago, I reported an accuracy of about 88%, obtained by splitting the entire set into disjoint training and test sets of (approximately) equal sizes, training an SVM model on the training set, computing the accuracy on the test set, and averaging for ten runs; in the above chart, the equivalent accuracy on the updated feature set can be obtained by following the line vertically upwards from 0.5 on the x-axis (i.e. training set size divided by the entire set size is 0.5) until it hits the red curve, and this returns an improved accuracy of 94.26%, averaged over a hundred runs this time (the accuracy for individual runs tends to be within ±3%)

It can be observed that the larger the training set size in relation to the test set, the better the accuracy. This is not too surprising, since a larger training set intuitively contains more information, allowing better generalizations to be drawn. Note however that the cell classes are distinct enough that even with only a handful of samples (about five to ten) from each class, an overall accuracy of over 70% can still be obtained.

The rightmost datapoint does not actually touch the line x=1, since it was obtained using the leave-one-out method - for each cell image, all the other images were used to train the SVM model, and since there is only one possible combination, no averaging is required. The accuracy achieved is 96.39%.

It might then be instructive to consider the confusion matrix for the leave-one-out case (zeroes omitted):

	hom	coa	fin	nuc	cen	cyt
homogeneous			6	1
coarse speckled	1				1
fine speckled	8			1	1
nucleolar					1
centromere			3	2
cytoplasmic				1

From the matrix, six homogeneous cells were incorrectly classified as fine speckled cells (and so on). As can be seen, the bulk of the remaining confusion exists between homogeneous and fine speckled cells, which together account for 14 of the 26 errors. Just for completeness' sake, let us examine these intransigent lads (rescaled to identical size):

Renewed respect for those who stare at this for hours on end

Renewed respect for those who stare at this for hours on end

For the most part, I frankly can't see s**t, captain

I suppose this shall have to do for now, having run out of time, though some of the misclassifications remained disappointing. This may also serve to illustrate that by far the most gains in such tasks (if not too complicated) are usually obtained at the beginning, and the final drops of incremental progress (see for instance the Netflix Prize) can be notoriously difficult to squeeze out.

So, what features were used? Once again, there is nothing particularly devious here, with the final collection of 22 features chosen empirically. I suppose attempting to select them via a more principled procedure is possible too, but that will have to wait.

The features used can roughly be put into seven categories:

The (given) overall intensity value; some images are naturally far darker than others (1 feature)
The basic dimensions of the image (3 features)
The statistical distribution of image pixel intensities (3 features)
The texture "roughness" features (3 features)
Composite features about detected distinct, large bright blobs (6 features)
Composite features about other bright blobs (5 features)
Composite features about dark blobs (1 feature)

If anybody's wondering why there's only one dark blob feature, blame my experiments (at the 50-50 split level)

So how important are each of these categories of features? Running leave-one-out on various selected subsets (the average accuracies for equal-sized splits is generally a few points lower) of the full feature set gives:

Categories	Accuracy
{Guessing largest class}	28.85%
{1}	8.04%
{1,2}	37.86%
{1,2,3}	86.27%
{1,2,3,4}	96.53%*
{1,2,3,4,5}	95.98%
{1,2,3,4,5,6}	96.39%
{1,2,3,4,5,6,7}	96.39%

*On hindsight, proper transformation of the images coupled with fairly good texture descriptors appears sufficient; however, as there were some particularly bad nucleolar cell misclassifications without the additional features, this did not sit very well with me, and therefore I opted for the larger feature set, which should still sit within the expected error range. Actually, I had also forgotten the basic mean feature of the statistical distribution category at one stage - that cost several percentage points.

All this seems to imply that the problem might be less complex than I had expected - hopefully I have not made any dreadfully embarrassingly gross mistakes in the methodology, but in any case I'll find out in due time, when the official results on the real test set are released. Not too bad a learning experience for two weekends...

Next: Between Many Lines

Related Posts:

Recognizing News
All The Same
Bolbolbol
Tying Up Ends
Image Concerns

2 trackbacks

Trackback by Earn Money Online india

Earn Money Online india - [bert's blog]

May 9, 2012 - 01:48 SGT

Trackback by where to get beats

where to get beats - [bert's blog]

May 19, 2012 - 00:01 SGT