[bert's blog - Food For Thought]

TCHS 4O 2000 [4o's nonsense]

alvinny [2] - csq - edchong
jenming - joseph - law
meepok - mingqi - pea
pengkian [2] - qwergopot - woof
xinghao - zhengyu

HCJC 01S60 [understated sixzero]

andy - edwin - jack
jiaqi - peter - rex
serena

SAF 21SA

khenghui - jiaming - jinrui [2]
ritchie - vicknesh - zhenhao

Others
Lwei [2] - shaowei

- website links -

Alien Loves Predator
BloggerSG
Cute Overload!
Cyanide and Happiness
Daily Bunny
Hamleto
Hattrick
Magic: The Gathering
The Onion
The Order of the Stick
Perry Bible Fellowship
PvP Online
Soccernet
Sluggy Freelance
The Students' Sketchpad
Talk Rock
Talking Cock.com
Tom the Dancing Bug
Wikipedia
Wulffmorgenthaler

bert's blog v1.21
Powered by glolg
Programmed with Perl 5.6.1
on Apache/1.3.27 (Red Hat Linux)

best viewed at 1024 x 768 resolution
on Internet Explorer 6.0+
or Mozilla Firefox 1.5+

entry views: 1724
today's page views: 486 (11 mobile)
all-time page views: 3393061

most viewed entry: 18739 views
most commented entry: 14 comments
number of entries: 1226

page created Wed Jul 2, 2025 10:24:46

- tagcloud -

academics [70]
art [8]
changelog [49]
current events [36]
cute stuff [12]
gaming [11]
music [8]
outings [16]
philosophy [10]
poetry [4]
programming [15]
rants [5]
reviews [8]
sport [37]
travel [19]
work [3]

miscellaneous [75]

- category tags -

academics art changelog current events cute stuff gaming miscellaneous music outings philosophy poetry programming rants reviews sport travel work

tags in total: 386

i am now probably: in a CS3230 tutorial [?]
(status updated every 30 minutes)

name: Lim Yong San, Gilbert -

gender: Male
nationality: Singaporean
race: Chinese
dob: 25^th January 1984

height: 1.74m (5'8½")
weight: 67kg (147 pounds)
blood type: A+

Download full resume [PDF] [DOC]
currently: National University of Singapore
(studying Computer Science & Economics)
tertiary: Hwa Chong Junior College*
secondary: The Chinese High School*
* merged into Hwa Chong Institution in 2005
primary: Shuqun Primary School
pre: Jurong Christian Church Kindergarten

fav colour: Green
fav soccer clubs: Manchester United,
Brighton and Hove Albion,
English National Team
hobbies: Many (a few in no particular order:)
reading & writing
programming (sometimes)
webgame timesin ks
DotA
kicking ping-pong balls
all manner of sports involving balls
sleeping

Friday, Jan 04, 2013 - 23:43 SGT
Posted By: Gilbert

Food For Thought
Warning: This is a sketch of my thoughts on a problem I found absorbing, which might not exactly be everyone's cup of tea; that stated, let's dive into it.

The intuition behind data compression should be readily graspable - as there often exist patterns, or redundancies, in interesting information, we often do not have to exhaustively mention every detail. To simplify the following discussion, we assume that all data has been converted into bits (which they in fact are in computers), or collections of zeroes and ones only - for example, the symbol "A" can be represented by "1000001", the symbol "Z" by "1011010", and so on.

It might then be expected that, for an alphabet of 2^N or fewer unique symbols, we can assign each symbol a corresponding bit string of length N. However, a little further thought also reveals that one does not actually need N bits for each symbol, which happens to be the basic idea behind the Huffman and other coding/compression schemes.

[N.B. Huffman coding probably deserves a short diversion on its own - it works by assigning shorter bit strings to more frequently-occuring symbols. For example, with "e" the most common character in English, a Huffman encoder for general English text could assign it the bit string "1", of length one.

This does however mean that only two of the four possible bit strings of length two - "00" and "01" - are available, since "10" and "11" would have to be interpreted as "e" on the first bit. This is as symbol separations have to be inferred purely from the data stream of bits only, and is a point often lost on students completely new to the field. A common naïve response would be to designate a bit, such as "0", to be the symbol separator, but this would mean that the N symbols would have to be mapped to the N bit strings of lengths 1 to N composed wholly of "1"s, clearly inefficient.. or is it?

Perhaps slightly surprisingly, this state of affairs may sometimes correspond to the outcome of Huffman coding. More generally, however, there are cases in the construction of the code where the objects with the lowest probabilities are groups of symbols, rather than single symbols, especially towards the end of the construction procedure, which implies that the shortest bit strings are in a sense "reserved". Intuitively, Huffman coding does not compress by much when all symbols are about equally probable, but this is getting somewhat out of scope already.]

And, uh, back on track (and avoiding the temptation of lossy compression for another day). It seems patently obvious that a universal lossless compression algorithm that is guaranteed to reduce any bit string into a shorter one, and reverse the operation perfectly, cannot exist. The simplest explanation would be that if such an algorithm existed, all data would be (eventually) compressible into a single bit by repeated applications, which is plainly ridiculous.

Ignorance of this concept has drawn the ire of a Mr. Nelson, who issued an open challenge, about a decade ago, to compress a million random digits (that birthed some of the most hilarious reviews ever) - or equivalently, to produce a program that operates on some given data to reproduce the million digits, where the bit string length of program and data is shorter than the bit string length of the million digits.

Now, while this has not been achieved, it is evident (as acknowledged in an update, and probably elsewhere) that the accepters of this challenge do have at least one advantage - their program does not have to be a good general compressor at all. It just has to be (barely) good at one particular (if long) input, and to make it even easier, the length of standard libraries is excluded (which I think should not be the case, if the definition of compression is taken strictly)

What Is Random?

In the past with his 11x crit, anyway

Begs to differ
(Source: dotafire.com)

But why is the challenge starter so confident that such a program does not exist, or is at least very unlikely to? This, I believe, stems from the belief that the data (from RAND, no less) is random, i.e. there are no exploitable patterns within.

This does beg the question of what exactly makes data "random". There are several possible approaches, beginning with some form of circular definition (it is random if it cannot be [significantly?] compressed, it cannot be compressed because it is random). In practice, what is often done is to run the candidate data through a battery of tests (e.g. Diehard), and declare the data random if it passes.

It should be acknowledged that "randomness" can be a fairly tricky concept, with humans often unable to simulate it well, and it is sometimes nonintuitive to boot (e.g. the sequence of ten coin flips "HHHHHHHHHH" is actually exactly as random as say "HTTTHTHHTH" for a fair coin)

[N.B. A fun diversion: ability probabilities in DotA are not independently random, in the sense that a Bash with a stated 25% chance to occur per attack does not actually have a true 25% chance each attack.

Rather, the distribution is weighted such that although Bash does happen on 25% of attacks in the long run, multiple consecutive Bashes (or non-Bashes) are less common than might be expected. For example, a true flat 25% per-attack chance would give a 6.25% chance that two consecutive attacks would both produce Bashes, whereas with the modified distribution used in practice, the actual chance is only about 0.7%, which is compensated by long periods of going without Bashes occur correspondingly less often too, reducing the impact of luck (if possibly slightly exploitable)]

Coming back, one oft-stated property of pseudo-random number generators (PRNGs) is that they are periodic, or more precisely that if the generator contains N bits in its internal state, it will repeat itself (forever) after producing no more than 2^N numbers. This makes sense if we consider each state (which includes the seed, if any) to deterministically lead to some other state, and anyhow 2^N is large enough in practice for many purposes even with fairly small N.

Have Some Pi On Us

(Source: flickr.com)

Consider now the case of π=3.14159... . It is known that π is irrational, i.e. it has no period (note that this does not by itself imply that all subsequences must eventually occur in π, and in fact the longest known subsequences of the same digit are relatively short, even if this is in fact suspected to be the case). It is slightly more questionable as to whether it is random, which has, perhaps unexpectedly, remained under debate in recent years.

A 2005 paper by the physicist Fischbach of Purdue claimed that π is a fairly good, if not the best, source of random numbers. The methodology was duly criticized and the conclusion disagreed with by other researchers, however, so it seems reasonable to suppose that the digits of π are about as random as sequences produced by many existing PRNGs, for what it's worth.

What ties this in is that many algorithms for π, including the fairly recent BBP formula that remarkably manages to compute a (hexadecimal) digit at a given location in π without having to compute all the digits before it, are known. Then, if π is considered random, each of these may be thought of as a non-periodic PRNG.

[N.B. BBP still takes a longer time the deeper the desired digit is within π, so it's not completely magic. Notably, it was discovered mechanically using the PSLQ integer relation finding algorithm, and only afterwards had a traditional proof constructed for it. And before I forget, let's throw in an intuitive explanation of Euler's formula.]

At this point, it could be argued that π is not an ideal PRNG since, well, it's π! It would hardly be random if an observer notes that the first digits are the famous "1,4,1,5,9,2,6,5", and infers that "3,5,9,..." follows, right?

Some counters would be that there is no reason why some other schema for extracting digits of π, such as say advancing five digits before retreating two (among other possibly far more convoluted methods), could not be used, and the inference might not necessarily apply in theory anyway (since the sequence should occur at countless other positions in π, though to be fair this is a weak refutation).

Perhaps most importantly, there is no call to use a π generator over other irrational number generators, which one suspects could be constructed out of arbitrary BBP-like spigot formulae. Fine, they would be kinda slow if many numbers are required, but you get what you pay for.

No Lunch Tabs Accepted

So back to the challenge, the creator does accept that there is no proof that the million seemingly-random digits do not actually comprise some special irrational number that can be described by a relatively compact formula, just that it appears to be unlikely. However, since the compressibility standard is so generous (one single byte less than the original suffices), one might suspect that a small distributional bias (within limits of randomness tests) might be enough - and yet no-one has managed.

To illustrate the difficulty, consider the very simple strategy of identifying sequences of bit strings in arithmetic progression. Let the length of the bit string be L, the address of the first occurence be A, the (constant) gap between occurences be G, and the number of repeats be R.

In this scheme, for it to be "profitable" to compress this sequence, it must be that L*R > L+F(A)+F(G)+F(R), where F() represents a reserved fixed length for that variable, for simplicity. Typical values could be F(A)=20 (since 2²⁰ is just more than a million, this would be enough to represent the address of all the digits), F(G)=12 (assuming we want to search only in jumps of up to 4096 digits) and F(R)=6 (assuming we are more than happy enough to find a progression that holds up to 64 times). If so, we have L*R > L+38, or equivalently L > 38/(R-1).

But do such profitably-compressible sequences exist? Well, let us imagine that the string "1010" is found at some positions A, A+12 and A+24 (so in this case L=4 and G=12); while this may seem promising, it is quickly realised that to make even a two-bit profit, R has to be 11, and the probability of that is about 1/(2⁴)⁸, or less than one in four billion if the bits are indeed (more-or-less) random, as compared to the less than four million bits in the given data.

We could try to skimp on any of F(A), F(G) and F(R), but this would either reduce the number of potential matches, or cap the potential benefit of matches. And if this were not enough, we need to make sufficient "profit" to cover the cost of the code, which crazy optimizations notwithstanding should still run into the thousands of bits. Aww.

Next: Yo Bro

Related Posts:

Modern Physics in Fifteen Easy Lectures
You Got A Problem?
A Tangled Web
Groundwork
Least Publishable Unit

2 comments

anonymous said...

is the pi symbol made of crystallized semen?

January 8, 2013 - 10:12 SGT

Mr. Ham said...

the sums do not add up

January 8, 2013 - 22:36 SGT