Powered by glolg
Display Preferences Most Recent Entries Chatterbox Blog Links Site Statistics Category Tags About Me, Myself and Gilbert XML RSS Feed
Friday, Jan 04, 2013 - 23:43 SGT
Posted By: Gilbert

Food For Thought

Warning: This is a sketch of my thoughts on a problem I found absorbing, which might not exactly be everyone's cup of tea; that stated, let's dive into it.


The intuition behind data compression should be readily graspable - as there often exist patterns, or redundancies, in interesting information, we often do not have to exhaustively mention every detail. To simplify the following discussion, we assume that all data has been converted into bits (which they in fact are in computers), or collections of zeroes and ones only - for example, the symbol "A" can be represented by "1000001", the symbol "Z" by "1011010", and so on.

It might then be expected that, for an alphabet of 2N or fewer unique symbols, we can assign each symbol a corresponding bit string of length N. However, a little further thought also reveals that one does not actually need N bits for each symbol, which happens to be the basic idea behind the Huffman and other coding/compression schemes.

[N.B. Huffman coding probably deserves a short diversion on its own - it works by assigning shorter bit strings to more frequently-occuring symbols. For example, with "e" the most common character in English, a Huffman encoder for general English text could assign it the bit string "1", of length one.

This does however mean that only two of the four possible bit strings of length two - "00" and "01" - are available, since "10" and "11" would have to be interpreted as "e" on the first bit. This is as symbol separations have to be inferred purely from the data stream of bits only, and is a point often lost on students completely new to the field. A common naïve response would be to designate a bit, such as "0", to be the symbol separator, but this would mean that the N symbols would have to be mapped to the N bit strings of lengths 1 to N composed wholly of "1"s, clearly inefficient.. or is it?

Perhaps slightly surprisingly, this state of affairs may sometimes correspond to the outcome of Huffman coding. More generally, however, there are cases in the construction of the code where the objects with the lowest probabilities are groups of symbols, rather than single symbols, especially towards the end of the construction procedure, which implies that the shortest bit strings are in a sense "reserved". Intuitively, Huffman coding does not compress by much when all symbols are about equally probable, but this is getting somewhat out of scope already.]

And, uh, back on track (and avoiding the temptation of lossy compression for another day). It seems patently obvious that a universal lossless compression algorithm that is guaranteed to reduce any bit string into a shorter one, and reverse the operation perfectly, cannot exist. The simplest explanation would be that if such an algorithm existed, all data would be (eventually) compressible into a single bit by repeated applications, which is plainly ridiculous.

Ignorance of this concept has drawn the ire of a Mr. Nelson, who issued an open challenge, about a decade ago, to compress a million random digits (that birthed some of the most hilarious reviews ever) - or equivalently, to produce a program that operates on some given data to reproduce the million digits, where the bit string length of program and data is shorter than the bit string length of the million digits.

Now, while this has not been achieved, it is evident (as acknowledged in an update, and probably elsewhere) that the accepters of this challenge do have at least one advantage - their program does not have to be a good general compressor at all. It just has to be (barely) good at one particular (if long) input, and to make it even easier, the length of standard libraries is excluded (which I think should not be the case, if the definition of compression is taken strictly)


What Is Random?


Begs to differ
(Source: dotafire.com)


But why is the challenge starter so confident that such a program does not exist, or is at least very unlikely to? This, I believe, stems from the belief that the data (from RAND, no less) is random, i.e. there are no exploitable patterns within.

This does beg the question of what exactly makes data "random". There are several possible approaches, beginning with some form of circular definition (it is random if it cannot be [significantly?] compressed, it cannot be compressed because it is random). In practice, what is often done is to run the candidate data through a battery of tests (e.g. Diehard), and declare the data random if it passes.

It should be acknowledged that "randomness" can be a fairly tricky concept, with humans often unable to simulate it well, and it is sometimes nonintuitive to boot (e.g. the sequence of ten coin flips "HHHHHHHHHH" is actually exactly as random as say "HTTTHTHHTH" for a fair coin)

[N.B. A fun diversion: ability probabilities in DotA are not independently random, in the sense that a Bash with a stated 25% chance to occur per attack does not actually have a true 25% chance each attack.

Rather, the distribution is weighted such that although Bash does happen on 25% of attacks in the long run, multiple consecutive Bashes (or non-Bashes) are less common than might be expected. For example, a true flat 25% per-attack chance would give a 6.25% chance that two consecutive attacks would both produce Bashes, whereas with the modified distribution used in practice, the actual chance is only about 0.7%, which is compensated by long periods of going without Bashes occur correspondingly less often too, reducing the impact of luck (if possibly slightly exploitable)]

Coming back, one oft-stated property of pseudo-random number generators (PRNGs) is that they are periodic, or more precisely that if the generator contains N bits in its internal state, it will repeat itself (forever) after producing no more than 2N numbers. This makes sense if we consider each state (which includes the seed, if any) to deterministically lead to some other state, and anyhow 2N is large enough in practice for many purposes even with fairly small N.


Have Some Pi On Us


(Source: flickr.com)


Consider now the case of π=3.14159... . It is known that π is irrational, i.e. it has no period (note that this does not by itself imply that all subsequences must eventually occur in π, and in fact the longest known subsequences of the same digit are relatively short, even if this is in fact suspected to be the case). It is slightly more questionable as to whether it is random, which has, perhaps unexpectedly, remained under debate in recent years.

A 2005 paper by the physicist Fischbach of Purdue claimed that π is a fairly good, if not the best, source of random numbers. The methodology was duly criticized and the conclusion disagreed with by other researchers, however, so it seems reasonable to suppose that the digits of π are about as random as sequences produced by many existing PRNGs, for what it's worth.

What ties this in is that many algorithms for π, including the fairly recent BBP formula that remarkably manages to compute a (hexadecimal) digit at a given location in π without having to compute all the digits before it, are known. Then, if π is considered random, each of these may be thought of as a non-periodic PRNG.

[N.B. BBP still takes a longer time the deeper the desired digit is within π, so it's not completely magic. Notably, it was discovered mechanically using the PSLQ integer relation finding algorithm, and only afterwards had a traditional proof constructed for it. And before I forget, let's throw in an intuitive explanation of Euler's formula.]

At this point, it could be argued that π is not an ideal PRNG since, well, it's π! It would hardly be random if an observer notes that the first digits are the famous "1,4,1,5,9,2,6,5", and infers that "3,5,9,..." follows, right?

Some counters would be that there is no reason why some other schema for extracting digits of π, such as say advancing five digits before retreating two (among other possibly far more convoluted methods), could not be used, and the inference might not necessarily apply in theory anyway (since the sequence should occur at countless other positions in π, though to be fair this is a weak refutation).

Perhaps most importantly, there is no call to use a π generator over other irrational number generators, which one suspects could be constructed out of arbitrary BBP-like spigot formulae. Fine, they would be kinda slow if many numbers are required, but you get what you pay for.


No Lunch Tabs Accepted



So back to the challenge, the creator does accept that there is no proof that the million seemingly-random digits do not actually comprise some special irrational number that can be described by a relatively compact formula, just that it appears to be unlikely. However, since the compressibility standard is so generous (one single byte less than the original suffices), one might suspect that a small distributional bias (within limits of randomness tests) might be enough - and yet no-one has managed.

To illustrate the difficulty, consider the very simple strategy of identifying sequences of bit strings in arithmetic progression. Let the length of the bit string be L, the address of the first occurence be A, the (constant) gap between occurences be G, and the number of repeats be R.

In this scheme, for it to be "profitable" to compress this sequence, it must be that L*R > L+F(A)+F(G)+F(R), where F() represents a reserved fixed length for that variable, for simplicity. Typical values could be F(A)=20 (since 220 is just more than a million, this would be enough to represent the address of all the digits), F(G)=12 (assuming we want to search only in jumps of up to 4096 digits) and F(R)=6 (assuming we are more than happy enough to find a progression that holds up to 64 times). If so, we have L*R > L+38, or equivalently L > 38/(R-1).

But do such profitably-compressible sequences exist? Well, let us imagine that the string "1010" is found at some positions A, A+12 and A+24 (so in this case L=4 and G=12); while this may seem promising, it is quickly realised that to make even a two-bit profit, R has to be 11, and the probability of that is about 1/(24)8, or less than one in four billion if the bits are indeed (more-or-less) random, as compared to the less than four million bits in the given data.

We could try to skimp on any of F(A), F(G) and F(R), but this would either reduce the number of potential matches, or cap the potential benefit of matches. And if this were not enough, we need to make sufficient "profit" to cover the cost of the code, which crazy optimizations notwithstanding should still run into the thousands of bits. Aww.



comments (2) - email - share - print - direct link
trackbacks (0) - trackback url


Next: Yo Bro


Related Posts:
Modern Physics in Fifteen Easy Lectures
You Got A Problem?
A Tangled Web
Groundwork
Least Publishable Unit

Back to top




2 comments


anonymous said...

is the pi symbol made of crystallized semen?


January 8, 2013 - 10:12 SGT     

Mr. Ham said...

the sums do not add up


January 8, 2013 - 22:36 SGT     


Copyright © 2006-2025 GLYS. All Rights Reserved.