Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Thursday, Apr 26, 2012 - 00:17 SGT
Posted By: Gilbert

- -
Midweek Update On Data Cells

Since the organizers very kindly extended the deadline again (which I suspect is something of a custom in academic circles), I got another weekend to tinker with the cell data (not Excel!), and here's the overview:



I can't even begin to express my appreciation for datasets which can be quickly processed (which my main line of research, alas, is not), and since LIBSVM was blindingly fast too, I produced the above graph by adjusting the training and test set sizes.

Recall that about a fortnight ago, I reported an accuracy of about 88%, obtained by splitting the entire set into disjoint training and test sets of (approximately) equal sizes, training an SVM model on the training set, computing the accuracy on the test set, and averaging for ten runs; in the above chart, the equivalent accuracy on the updated feature set can be obtained by following the line vertically upwards from 0.5 on the x-axis (i.e. training set size divided by the entire set size is 0.5) until it hits the red curve, and this returns an improved accuracy of 94.26%, averaged over a hundred runs this time (the accuracy for individual runs tends to be within ±3%)

It can be observed that the larger the training set size in relation to the test set, the better the accuracy. This is not too surprising, since a larger training set intuitively contains more information, allowing better generalizations to be drawn. Note however that the cell classes are distinct enough that even with only a handful of samples (about five to ten) from each class, an overall accuracy of over 70% can still be obtained.

The rightmost datapoint does not actually touch the line x=1, since it was obtained using the leave-one-out method - for each cell image, all the other images were used to train the SVM model, and since there is only one possible combination, no averaging is required. The accuracy achieved is 96.39%.

It might then be instructive to consider the confusion matrix for the leave-one-out case (zeroes omitted):

homcoafinnuccencyt
homogeneous61
coarse speckled11
fine speckled811
nucleolar1
centromere32
cytoplasmic1


From the matrix, six homogeneous cells were incorrectly classified as fine speckled cells (and so on). As can be seen, the bulk of the remaining confusion exists between homogeneous and fine speckled cells, which together account for 14 of the 26 errors. Just for completeness' sake, let us examine these intransigent lads (rescaled to identical size):


For the most part, I frankly can't see s**t, captain


I suppose this shall have to do for now, having run out of time, though some of the misclassifications remained disappointing. This may also serve to illustrate that by far the most gains in such tasks (if not too complicated) are usually obtained at the beginning, and the final drops of incremental progress (see for instance the Netflix Prize) can be notoriously difficult to squeeze out.

So, what features were used? Once again, there is nothing particularly devious here, with the final collection of 22 features chosen empirically. I suppose attempting to select them via a more principled procedure is possible too, but that will have to wait.

The features used can roughly be put into seven categories:

  1. The (given) overall intensity value; some images are naturally far darker than others (1 feature)
  2. The basic dimensions of the image (3 features)
  3. The statistical distribution of image pixel intensities (3 features)
  4. The texture "roughness" features (3 features)
  5. Composite features about detected distinct, large bright blobs (6 features)
  6. Composite features about other bright blobs (5 features)
  7. Composite features about dark blobs (1 feature)

If anybody's wondering why there's only one dark blob feature, blame my experiments (at the 50-50 split level)

So how important are each of these categories of features? Running leave-one-out on various selected subsets (the average accuracies for equal-sized splits is generally a few points lower) of the full feature set gives:

CategoriesAccuracy
{Guessing largest class}28.85%
{1}8.04%
{1,2}37.86%
{1,2,3}86.27%
{1,2,3,4}96.53%*
{1,2,3,4,5}95.98%
{1,2,3,4,5,6}96.39%
{1,2,3,4,5,6,7}96.39%


*On hindsight, proper transformation of the images coupled with fairly good texture descriptors appears sufficient; however, as there were some particularly bad nucleolar cell misclassifications without the additional features, this did not sit very well with me, and therefore I opted for the larger feature set, which should still sit within the expected error range. Actually, I had also forgotten the basic mean feature of the statistical distribution category at one stage - that cost several percentage points.

All this seems to imply that the problem might be less complex than I had expected - hopefully I have not made any dreadfully embarrassingly gross mistakes in the methodology, but in any case I'll find out in due time, when the official results on the real test set are released. Not too bad a learning experience for two weekends...



comments (0) - email - share - print - direct link
trackbacks (2) - trackback url


Next: Between Many Lines


Related Posts:
Recognizing News
All The Same
Bolbolbol
Tying Up Ends
Image Concerns

Back to top




2 trackbacks


Trackback by Earn Money Online india

Earn Money Online india - [bert's blog]


May 9, 2012 - 01:48 SGT     

Trackback by where to get beats

where to get beats - [bert's blog]


May 19, 2012 - 00:01 SGT     


Copyright © 2006-2025 GLYS. All Rights Reserved.