Friday, October 23, 2009

A silent death of the Horizontal no Free Lunch theorem

The only reply I got for my concerns regarding W. Dembski's and R. Marks's article The Search for a Search - and its Horizontal No Free Lunch theorem - was an email from R. Marks, stating that he'd look into it.
Now, two weeks later, R. Marks took the article from his list of publications. The Evolutionary Informatics lab has still an announcement, basically the abstract of the article:
Many searches are needle-in-the-haystack problems, looking for small targets in large spaces. In such cases, blind search can stand no hope of success. Success, instead, requires an assisted search. But whence the assistance required for a search to be successful? To pose the question this way suggests that successful searches do not emerge spontaneously but need themselves to be discovered via a search. The question then naturally arises whether such a higher-level “search for a search” is any easier than the original search. We prove two results: (1) The Horizontal No Free Lunch Theorem, which shows that average relative performance of searches never exceeds unassisted or blind searches. (2) The Vertical No Free Lunch Theorem, which shows that the difficulty of searching for a successful search increases exponentially compared to the difficulty of the original search.
The given link doesn't lead to the article any longer, but to their earlier opus Conservation of Information in Search - neither the term vertical nor the term horizontal appear in this article.

In his first podcast with Casey Luskin, W. Dembski promised that this article was something like a nail to the coffin of evolution. I assume that W. Dembski will explain how he was wrong with this assessment in his next interview...

One if by land, and two if by sea...

Or, in the Intelligent Design version: Two lanterns contain the same information as one lantern, ergo, they come by land...

Thursday, October 22, 2009

Horizontal No Free Lunch Theorem

You keep using that word. I do not think it means what you think it means
Inigo Montoya


In their upcoming paper The Search for a Search: Measuring the Information Cost of Higher Level Search W. Dembski and R.Marks introduce a Horizontal No Free Lunch Theorem.

At Uncommon Descent, I tried to reach W. Dembski with the following post which is - of course - awaiting moderation at the moment.

The Horizontal No free Lunch Theorem doesn't work for a search of length m > 1 for a target T ⊆ Ω:

Let Ω be a finite search-space, T ⊆ Ω the target - a non-empty subset of Ω. A search is a (finite) sequence of Ω-valued random variables (φ1, φ2, ..., , φm). A search is successful, if φn ∈ T for one n, 1 ≤ n ≤ m.

I suppose we do agree here. Now, we look at a search Φ as a Ωm-valued random variable, i.e., Φ := (φ1, φ2, ..., , φm).

When is it successful? If we are still looking for a T ⊆ Ω we can say that we found T during our search if

Φ ∈ Ωm \ (Ω \ T)m

Let's define Θ as the subspace of Ωm which exists from the representations of targets in Ω, i.e.,

Θ := {Ωm \ (Ω \ T)m|T non-empty subset of Ω}

Obviously, Θ is much smaller than Ωm.

But this Θ is the space of feasible targets. And if you take an exhaustive partition of Θ instead of Ωm in Theorem III.1 Horizontal No Free Lunch, you'll find that you can indeed have positive values for the active entropy as defined in the same theorem.

But that's not much of a surprise, as random sampling without repetition works better than random sampling with repetition.

But if you allow T to be any subset of Ωm, your results get somewhat trivial, as you are now looking at m independent searches of length 1 for different targets.

The searches which you state as examples in this paper and the previous one all work with a fixed target, i.e., elements of Θ. You never mention the possibility that the target changes between the steps of the search (one possible interpretation of taking arbitrary subsets of Ωm into account).

So, I'm faced with two possibilities:
  1. You didn't realize the switch from stationary targets to moving ones when you introduced searching for an arbitrary subset of Ωm
  2. You realized this switch to a very different concept, but chose not to stress the point.

Tuesday, October 20, 2009

W. Dembski's and R. Marks's The Search for a Search and my major problem with it

William Dembski and Robert Marks announced a new paper The Search for a Search: Measuring the Information Cost of Higher Level Search (draft) which should appear in a peer-reviewed journal at any moment.

My problem: When I read the term search for a target without any qualifier, I imagine that
  1. the target is fix
  2. the search is over when the target is found the first time


A model for this: a game of Battleship against a random generator with this additional rules:
  1. all ships are of size one
  2. you don't know how many ships there are
  3. the game is over with the first hit


So, if your search space is Ω then there are 2|&Omega| - 1 different targets.

But with this kind of target, the math of W. Dembski's and R. Marks's paper doesn't work: They claim that you can't find a strategy for that works better on average than random search. In the Battleship example: there is no better strategy than shooting at random each turn until your first hit. But this is obviously wrong: A better strategy is not to hit a square twice.

But W. Dembski and R. Marks proved their claim - how so? Well, they redefined the concept of target. The altered rules for the Battleship:
  1. the position and the number of the ships changes each round
  2. to succeed, you have to hit a ship each round


For this version - and not for the one above - they proved all their results. The problem: That's not search as we now it...

The technical details: In section III, B, they formalize a search
Suppose now ­ Ω denotes an arbitrary search space. In general, assume ­Ω is a compact metric space with metric D, though for most theorems and calculations assume ­Ω is finite (in which case D is a discrete metric). Let X1, X2,...,Xn denote ­Ω-valued random variables that are independent and identically distributed with uniform probability distribution U. X1, X2,...,Xm therefore represents a blind search of ­Ω. Next, let Y1, Y2,...,Ym denote ­Ω-valued random variables that are arbitrarily distributed. X1, X2,...,Xn therefore characterizes an assisted search of ­Ω.


That's our first version of Battleship - I've no problem with this section. But the next paragraph states
If we now let ­Ωm denote the m-fold Cartesian product of Ω­, then X1, X2,...,Xm and Y1, Y2,...,Ym are ­m-valued random variables. Because the uniform probability on ­­Ωm is just the m-fold product measure of U, X1, X2,...,Xm) is uniformly distributed on
­ = Ωm. Thus, by attending solely to the probabilistic
structure of searches, we represent the blind search := (X1, X2,...,Xm) as having a m-dimensional uniform distribution with respect to a metric D~ that suitably extends to the metric D on Ω­. Moreover, we represent the assisted search := (Y1, Y2,...,Ym ), in searching for an arbitrary subset of , as having distribution on .
But all these tildes are superfluous. Dropping them, we see that the probabilistic connection between a blind search and an assisted search can be entirely characterized in terms of a search space ­, a metric D that induces a compact topology on ­Ω, a target T included
in ­Ω, a uniform probability measure U on ­ induced by D, and an arbitrary probability measure φ on ­Ω. U represents the probabilistic structure of a blind search, and φ represents the probabilistic structure of an assisted search.


Now, is a subset of Ωm - and there are 2|Ω| m -1 non-empty subsets! A successful search means that , i.e., Yn ∈ Tn for all 1 ≤ n ≤ m. So, a stationary target would be not represented by

Tn

as one may naively expect, but by

T×Ω×...×Ω
∪ Ω×T×Ω×...×Ω
∪ Ω×Ω×T×Ω×...×Ω
∪...
∪ &Omega×Ω×...×&Omega×T

but most of the targets don't fit this pattern.

Of course you can claim that knowing that you target doesn't move is additional information. But I haven't encountered a definition of search which expects you to be right on target each time...

P.S.: I tried to reach R. Marks via email - and I left an entry at W. Dembski's blog...

P.P.S.: some details.

Saturday, October 17, 2009

I wrote a letter to W. Dembski and R. Marks

After reading their article Conservation of Information in Search:
Measuring the Cost of Success
in some detail, I decided to write to W. Dembski and R. Marks, just to hint them to some minor miscalculations: So, they have the chance to get at least the math straight in their peer-reviewed paper.

At uncommondescent.com, I tried to draw attendance of Willima Dembski to two minor errors in your IEEE paper ''Conservation of Information in Search – Measuring the Cost of Success'':

Could you correct the sign errors in equation (27) and on page 1057, left column, for Q?

More problematic is the next line: you speak about the active information per query, but you calculate the active information per generation, as you drop the factor -2 from Q ~ -2log(L)/log(1-2µ) and so reach

I+ ~ L/log(L) * log(1-2µ) instead of
I+ ~ L/(2*log(L)) * log(1-2µ).

So, I+ should be µ*L/ln(L) and not 2*µ*L/ln(L).

(I stumbled upon this when I calculated my approximation I+ ~ µ*L/H_((1-b)L) for the average number of queries Q ~ H_((1-b)L)/µ )

And there is a little problem with Fig. 2: all the paths seem to start with no correct bits at all - while the math in ch. III-2 assumes that half of the bits are correct.

Let L=100, µ=0.00005:

1. if you use b(eta) = .5 (as in your calculation) Q ~ 92,100 and I+ ~ 0.00109. Then, you should take other paths for your pic

2. if you use b(eta) = 0 (as in you figure) Q ~ 106,000 and I+ ~ 0.000943

The number in your text is I+ ~0.0022. That's wrong in any case.


Yours
Di***** Eb**


Don't get me wrong: I don't want to be just nit-picking. One sign error is no big deal, an even number of sign errors cancels itself out :-)
And the factor two? It's at least the same order of magnitude. (And no one know what magnitude is to be expected for active information, so even a factor 100 would not have been easily spotted). And I didn't mention all the inaccuracies in their references' section....

Monday, October 12, 2009

Code I'm not proud of...

I was asked for the code I used to make my little pics. Well, it's not pretty, it's just a few functions in R. For example:

weasel.matrix.solo <- function(target.length=5, alphabet.length=27,mut.prob=.04,mut.effect=FALSE){
if(mut.effect) mut.prob <- alphabet.length*mut.prob/(alphabet.length-1)
M <- matrix(integer((target.length+1)*(target.length+1)),nrow=target.length+1)
for(k in 1:(target.length)){
verschlechtert <- integer(2*target.length+2)
verbessert <- verschlechtert
verschlechtert[1:k + target.length +1] <- dbinom(0:(k-1),k-1,mut.prob*(alphabet.length-1)/alphabet.length)
verbessert[1:(target.length+2-k) + target.length + 1] <- dbinom(0:(target.length+1-k),(target.length+1-k),mut.prob/alphabet.length)
for(kk in 1:(target.length+1)){
M[k,kk] <-sum( verbessert[(0:target.length) + target.length+3-k]*verschlechtert[0:target.length +target.length+3-kk])
}
}
M[target.length+1,target.length+1] <- 1
return(M)
}


This creates a transition matrix for Dawkins's algorithm with a single child. target.length is the length of the target phrase, alphabet.length is self-explaining, mut.prob is the probability with which a letter is mutated: if mut.effect==FALSE, such a mutation has no effect with probability 1/N, i.e., it's a change to the same letter :-)


weasel.matrix.cum <- function(M, pop.size = 10){
target.length <- dim(M)[1]-1
M1 <- M
for(k in 1:(target.length)){
M1[k,] <- cumsum(M[k,])
}
M1 <- M1^pop.size
for(k in 1:(target.length))
M1[k,] <- M1[k,] - c(0,M1[k,-target.length-1])
return(M1)
}

creates a transition matrix for a whole population of size pop.size, given a transition matrix M for a single-child process.

The last thing which is needed is this little snippet:
erw.exakt <- function(M,start.vektor){

M.size <- dim(M)[1]
M <- M[-M.size, -M.size]
M <- solve(diag(M.size-1)-M)
M <- apply(M,1,sum)
return(sum(start.vektor[1:(M.size-1)] * M))
}

Here, the expected value of generations for a transition matrix M is calculated, using a initial distribution for start.vektor (mostly Bin(target.length,mut.prob) distributed - or just the first unit vector).

If I ever prettify this code, I'll take Tom English on his generous offer to put it somewhere on the net.

Another approach

I completed this little project here.

I thought about different ways to present my ideas on W. Dembski's and R. Marks's article Conservation of Information in Search: Measuring the Cost of Success, and tried a few out in the last posts. The section which I'm interested in is the description of three evolutionary algorithms they give on pp. 1055-1057. Here's another approach for III E - G:

















Original TextAnnotiations
E. Partitioned Search
Partitioned search [12] is a “divide and conquer” procedure best introduced by example.The name partitioned search seems to be an invention of R. Marks and W. Dembski. The reference is made to Dawkins's book The Blind Watchmaker in which the phrase can't be found. (See Tom English's blog)
Consider the L = 28 character phrase
METHINKS∗IT∗IS∗LIKE∗A∗WEASEL. (19)
This is indeed a phrase which is used in Dawkins's book - in an algorithm with which Dawkins explained the idea of cumulative selection.
Suppose that the result of our first query of L =28 characters is

SCITAMROFN∗IYRANOITULOVESAM. (20)
An example of the egregious humorous skills of Dembski and Marks: backwards, we get:
mas*evolutionaryi*nformatics
That's no problem, as the first phrase of the algorithm can be arbitrarily chosen
Two of the letters {E, S} are in the correct position. They are
shown in a bold font. In partitioned search, our search for these
letters is finished.
At least, that's the way Dembski's and Marks's search works
For the incorrect letters, we select 26 new letters and obtain

OOT∗DENGISEDESEHT∗ERA∗NETSIL. (21)
listen*are*thesedesigned*too
hilarious. And an sign that we don't see the output of an actual program, but something imagined to be a run of their algorithm. BTW, the fitness function would have to encode the position of the correct letters, and the possible outcomes of the function wouldn't be a totally ordered set, but only partially ordered (what's better: METHINKS*IT*ISXXXXXXXXXXXXXX or XXXXXXXXXXXXXX*LIKE*A*WEASEL). That's at least unusual, and perhaps a reason that no one else uses partitioned search.
Five new letters are found, bringing the cumulative tally of discovered characters to {T, S,E, ∗,E, S,L}. All seven characters are ratcheted into place. The 19 new letters are chosen, and the process is repeated until the entire target phrase is found.This ratcheting into place is special for the algorithm: the algorithm described in Dawkins's book doesn't show it.
Assuming uniformity, the probability of successfully identifying a specified letter with sample replacement at least once in Q queries is 1 − (1 − 1/N)Q, and the probability of identifying all L characters in Q queries is
q = (1 − (1 − 1/N)Q)L (22)
Yep, that's true. But why not doing a little math and giving us the expected number of queries? That would be a little bit less banal. And it's helpful if you want compare different algorithms
For the alternate search using purely random queries of the entire phrase, a sequence of L letters is chosen. The result is either a success and matches the target phrase, or does not. If there is no match, a completely new sequence of letters is chosen. To compare partitioned search to purely random queries, we can rewrite (5) as
p =1 − (1 − (1/N)L)Q (23)
yup.
For L =28 and N =27 and moderate values of Q,we have p << q corresponding to a large contribution of active information. The active information is due to knowledge of partial solutions of the target phrase. Without this knowledge, the entire phrase is tagged as “wrong” even if it differs from the target by one character.So, how big is this active information? For p, it was calculated in section III, A as I+(p) = log(Q), and using the same approximation, we get I+(q) = L log(Q) (that's only true-ish for small values of Q and large alphabets...)
The enormous amount of active information provided by partitioned search is transparently evident when the alphabet is binary. Then, independent of L, convergence can always be performed in two steps. From the first query, the correct and incorrect bits are identified. The incorrect bits are then complemented to arrive at the correct solution. Generalizing to an alphabet of N characters, a phrase of arbitrary length L can always be identified in, at most, N − 1 queries. The first character is offered, and the matching characters in the phrase are identified and frozen in place. The second character is offered, and the process is repeated. After repeating the procedure N − 1 times, any phrase characters not yet identified must be the last untested element in the alphabet. Wow, the hangman's game. In detail.
Partitioned search can be applied at different granularities.
We can, for example, apply partitioned search to entire words
rather than individual letters. Let there be W words each with
L/W characters each. Then, partitioned search probability of
success after Q queries is
pW=(1 − (1 − N−L/W)Q)W. (24)
What's that all about? Imagine an alphabet of 32 letters - including {A,B,...,Z,*} and our weasel-phrase. Then the phrase could also be encoded by 28 5-bit words. One 5-bit word is only correct, if all 5 bits are correct. Therefore, we get the same expressions for N=32, L=28, W=28 and N=2, L=140 and W=28.
Equations (22) and (23) are special cases for W = L and
W =1.If N−L/W << 1, we can make the approximation pW ≈ QWN−L from which it follows that the active information is
I+ ≈ W log2 Q. (25)
With reference to (6), the active information is that of W individual searches: one for each word.
So, for W=L, we get I+(q) = L log(Q). My interpretation: virtually, this algorithm provides you with L fitness functions, one for each letter, indicating whether it is correct or not. BTW: I think it is surprising how the fitness functions get ignored during the whole article...

Sunday, October 11, 2009

Evolutionary Algorithms in Conservation of Information in Search

work in progress :-)

In their paper Conservation of Information in Search: Measuring the Cost of Success, William A. Dembski, Senior Member, IEEE, and Robert J. Marks II, Fellow, IEEE present three algorithms which could loosely be described as evolutionary. Idiosyncratically, they are labeled
  1. Partitioned Search
  2. Optimization by Mutation, and
  3. Optimization by Mutation With Elitism
These procedures are discussed section II, part E and F (pp. 1055-1057), and some additional math is done in the appendix.

The algorithms and the examples given



The three algorithms are introduced or illustrated with an example, for which the parameters can be found in this table:
Algorithmclassalphabetlength of targetMutationfitness function
Partitioned Searchnone{*,A,B,...Z} (27 el.)28every incorrect letter is mutatedposition of correct letters
Optimization by Mutation(1,2) ES{0,1} (2 el.)100each letter is flipped with probability µ=0.00005number of correct letters
Optimization by Mutation With Elitism(1+1) ES{0,1} (2 el.)131*exactly one letter is flippednumber of correct letters

*Why 131b for the last algorithm? I could understand 133 ~ 28* log227, but 131?

References


Four references are given in the section under discussion:
  • [3] M. Aigner, Discrete Mathematics. Providence, RI: Amer. Math. Soc., 2007.
  • [12] R. Dawkins, The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe Without Design. New York: Norton, 1996.
  • [32] D. J. C.MacKay, Information Theory, Inference and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2002.
  • [34] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York: McGraw-Hill, 1991, pp. 537–542.
The first one is unproblematic - though superfluous: it is given as a reference for the Lth harmonic number. The three other references are either problematic or at least indicating sloppy work:

1. Partitioned search [12] is a “divide and conquer” procedure best introduced by example. Richard Dawkins doesn't use the expression "partitioned search" in his book.

2. When μ << 1, the search is a simple Markov birth process [34] that converges to the target. The page numbers of the reference don't fit: it should be pp. 647-650, not pp. 537–542. But at least, page numbers are given!

3. To illustrate, consider a single-parent version of a simple mutation algorithm analyzed by McKay [32].David MacKay analyzes explicitly multi-parented algorithms, there is not much of a resemblance with Dembski's and Marks's procedure. And annoyingly, again, page numbers are omitted. MacKay's book has over 600 pages, only a few in Chapter III, 19 : Why have Sex? Information Acquisition and Evolution are dedicated to evolutionary algorithms.

But one has to be grateful for small things: At least these references weren't garbled like
[7] S. Christensen and F. Oppacher, “What can we learn from no free lunch? A first attempt to characterize the concept of a searchable,” in Proc. Genetic Evol. Comput., 2001, pp. 1219–1226.

Of course, it's about the concept of a searchable function. Not the first time, R. Marks and W. Dembski got that wrong.
Ann.: Tom English explained that errors in references are common in the engineering literature. So, I'll stop classifying the references of Dembski's and Marks's article as either superfluous or erroneous for now.

Saturday, October 10, 2009

I'm annoyed.

Perhaps I'm easily annoyed. But every time I read the paper Conservation of Information in Search - Measuring the Cost of Success, I'm getting, well, pretty much distressed. I wanted to write a concise rebuke of this article, but you can't be concise: there is so much disturbing in it, often little things. Take for instance this innocuous little sentence
To illustrate, consider a single-parent version of a simple mutation algorithm analyzed by McKay [32].
The reader may be interested in McKay's analysis of a simple mutation algorithm, and follows the reference:
D. J. C.MacKay, Information Theory, Inference and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2002.
Thankfully, David J.C. MacKay (he seems to prefer this spelling) put his book online. And yes, it is a book, a textbook with over 600 pages. It's covering a wide range, so at least, Dembski and Marks should have given the pages which they thought to be relevant. Which they thought to be relevant is a key phrase, because a skimming of the book doesn't render anything resembling an analysis of a simple mutation algorithm like the one in the article: yes, there is Chapter III, 19 :Why have Sex? Information Acquisition and Evolution - but this is explicitly about algorithms with multiple parents, it just doesn't fit a single parent version - MacKay assumes for instance that the fitness of the parents is normally distributed, something you just can't have for a single parent version. So why do Marks and Dembski give MacKay as a reference? I have no idea.

And it's going on


Just checking the next reference: [34] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed. New York: McGraw-Hill, 1991, pp. 537–542. for this little piece of marginal information
When μ << 1, the search is a simple Markov birth process [34] that converges to the target.
You may say: At least, they are giving a page number. But it's the wrong one! Take out your 3rd edition of Papoulis's very good introduction to the theory of probability, and you'll find Markov birth processes on pp. 647 - 650: why can't they get at least the details right?

Saturday, October 3, 2009

Which parameter should Dembski have used?

In the previous post I talked about Dawkins's weasel algorithm as understood by virtually everyone. Even William Dembski and Robert Marks describe something similar in their paper Conservation of Information in Search: Measuring the Cost of Success. But it's not the algorithm Partioned Search introduced on p. 1055 (and claimed by William Dembski to represent Dawkins's weasel), it's an algorithm called Random Search of p. 1056. The differences to the weasel of Dawkins:
  1. the alphabet consists of two letters, not 27.
  2. the target length is 100, not 28
  3. the size of a generation is quite low: it's two
  4. the mutation probability per letter is very small: µ = 0.00005
The value S=2 for the population size and µ = 0.00005 for the effective mutation rate are chosen to be able to do a little math - but they result in quite a great number of generations (~55,000) and queries (~110,000) on average.
Which values would a real programmer have used? Fooling around with his program, tweaking the parameter S and µ a little bit, he would soon have found out, that he should alter the original values considerably:

The greater the size of the population, the greater the optimal size of the mutation rate.


The smallest number of queries (1576) on average is achieved with S = 9 and µ = 0.009. Any practitioner would have come up with values near to these, as they make the program considerably faster. Sadly, these values doesn't allow for an application of Dembski's and Marks's elegant estimation of the number of correct letters as a function of generations.

What's interesting: While Dawkins's weasel had a range of population sizes which preferred virtually the same mutation probability, this doesn't happen in this realization. We'll see that this effect depends mainly on the size of the alphabet...

Friday, October 2, 2009

Which parameters did Dawkins use for his weasel?

If R. Dawkins used a weasel algorithm as understood by virtually everyone who read his book (and tried to program it), he has two parameters to play with:
  1. the size of the population S and
  2. the probability µ to change a letter
So, which values did he use? That, I can't answer. But I can say which values he should have used:

Obviously, the greater the size of the population for a fixed µ, the smaller the number of generations it takes to find the target. On the other hand, for a fixed population size, a similar rule for µ can't be given - one probably wants to keep the parent in the next generation, so µ has to be quite small for small populations, and can be bigger for great ones.

Here, the coloured lines represent the expected values of the number of generations for different sizes of population as a function of µ. At each curve, I added the minimal expected number - and I joined these minima by a black line. So, a population size of 200 results at best in 49 generation on average - and this minimum is reached for µ ~ 0.044.

But if you want to have a fast algorithm, you'd look at the number of queries, not the number of generations, i.e., the total number of children, or how often the fitness function has to be calculated:

(Yep, I like this pic :-) Here, you can see that for getting the best results, the size of the population is surprisingly small: 20 individuals. But though it takes over 325 generations on average to reach the target, the fitness value for only ~6514 individuals has to be calculated - that is, if you chose µ ~ 0.041.
Another interesting effect: for population sizes from 20 to 200, the optimal µ doesn't change very much, it's just a little bit over 0.04.

Of course, no one does these calculations before running the program: you write your weasel and play with it. But the natural behaviour of this algorithm will lead you to population sizes from 20 to 200, and a µ of 0.04 - 0.05.

Sunday, September 27, 2009

The Original Weasels?

W. Dembski introduces two programs in his post The original weasels at his blog Uncommon Descent. These programs were sent to him by someone called Oxfordensis, and W. Dembski states
These are by far the best candidates [for being Dawkins's original program] we have received to date.

Let's have a short look:

Weasel 1


Here's how this program works for the target "METHINKS IT IS LIKE A WEASEL":
  1. Chose random string
  2. copy string 100 times, each time changing exactly one letter at random
  3. select string with most correct letters in place
  4. is the number of correct letters 28? Then STOP. Otherwise:
  5. goto 2
The mutation procedure is interesting: While most programs I've seen mutate the parent string by changing each letter with a certain probability, here, exactly one letter is changed - but I've seen implementations of this version, too. The letter-based mutation seems to be more natural, while the other one has mathematical advantages: the neighbourhood of a string s, i.e., the strings s can mutate into, becomes small: it isn't the whole set of size 2728 ≈ 1.20 * 1040 anymore, but only of size 27 * 28 = 756...
Another advantage: you only have to call your random generator twice for each child, while a straight-forward implementation of the letter-mutating process take (1+μ)*28 of these costly calls per child - though this number can be reduced to (1+28*μ) calls per child....
So, one parameter less to tweak: though the number of children is set to 100 in this program, this can be changed easily.

But which value to chose for the remaining parameter - the size of the population? The bigger the size of the population, the less generations (black line) it takes to complete the search. So, make it big? No, not necessarily: commonly, you want your program to be fast. Therefore, you should try to reduce the number of queries (blue line), i.e, the number of mutated strings - which is the number of evaluations of the fitness function, too. Mutating at random, and calculating the fitness function, that's what's taking time.
I suppose if you fool around with the program for a while, you'll quickly find out that the optimal size of generations in the sense above is about 50 - and an exact calculation leads to an average of 3931 queries for a population size of 49 as an optimum (σ = 648.04).

Weasel 2


Well, this beast is boring. BORING. Look how it works for the target "METHINKS IT IS LIKE A WEASEL":
  1. Chose random string
  2. make new string by copying old string one time, changing exactly one letter at random
  3. chose string from old and new string with most correct letters in place
  4. is the number of correct letters 28? Then STOP. Otherwise:
  5. goto 2

It's just a randomized version of the hangman game, taking 2969 steps on average (σ = 959.2). And it's what W. Dembski and R. Marks call Optimization by Mutation With Elitism in their current paper:
3) Optimization by Mutation With Elitism: Optimization by
mutation with elitism is the same as optimization by mutation
in Section III-F2 with the following change. One mutated
child is generated. If the child is better than the parent, it
replaces the parent. If not, the child dies, and the parent
tries again. Typically, this process gives birth to numerous
offspring, but we will focus attention on the case of a single
child. We will also assume that there is a single bit flip per
generation.

Conclusions


Could these be the programs used by R. Dawkins in his book The Blind Watchmaker? The first one would fit his algorithm, and with a population size, it would take 47.6 generations on average (σ= 11.7). And what's about the BBC video? If the second version was used for this, I'd regard it at best as a visual effect, as it doesn't match Dawkins's algorithm. And yes, it is (implicitly) latching :-)

Wednesday, September 23, 2009

Random Mutation

In their paper Conservation of Information in Search: Measuring the Cost of Success William Dembski and Robert Marks present an algorithm which resembles Dawkins's weasel (the algorithm I talked about before - at length :-)
It isn't the algorithm Partioned Search introduced on p. 1055, it's the algorithm Random Mutation on p. 1056. The differences
  1. the alphabet consists of two letters, not 27.
  2. the target length is 100, not 28
  3. the size of a generation is quite low: it's two
  4. the mutation probability per letter is very small: µ = 0.00005
Many programmers have implemented weasel-like programs, and the runs I've seen usually use a µ between 0.04 and 0.05, while the generation size is greater than 20.

So, why do W. Dembski and R. Marks use their values? Well, in the appendix of their paper, they calculated an approximation for the number of correct letters depending on the number of generations. This approximation discards terms quadratic in µ, so to use it, not only has µ << 1, but the size of a generation has to be small, too.

But should they use these values? I don't think so: As usual, I modeled their algorithm as a Markov chain and could calculate some exact values. The expected number of generations to get to the target for their choice of parameters is 55,500. (see pic 1)

Some fooling around with their program should have shown that this number can be reduced dramatically by using an more appropriate value for µ. The optimum lies near to µ ~ 0.0005, which takes only 10,600 generations.


And though their approximation doesn't work as fantastically as for the parameter ten times smaller, the numbers are not more than 2.5% off. (see pic 2)

Saturday, September 19, 2009

A New Weasel



At Uncommon Descent, W. Dembski presents us with a new weasel: Someone with the name Oxfordensis programmed years ago two Pascal versions.
The interesting aspect: In contrast to the usual weasels - as discussed in the earlier posts - not every letter is changed with a certain probability in this program, but exactly one letter is changed in each child. Here, the word change is used loosely, as with a probability of 1/27, no letter is changed at all.
Mathematically, this is easier to handle: the transition matrix of the corresponding Markov process gets sparser...
And you have only one parameter which can be changed: the size of the generation.
If I had any reader he may ask himself: Does this algorithm latch? Well, the pic provides the answer: it depends :-)
And here are some values for the probability that the program will show a change to the worse during a run of the program, depending on the size of the population:
  1. 99.999 %
  2. 98.485 %
  3. 83.879 %
  4. 55.944 %
  5. 30.008 %
  6. 14.071 %
  7. 6.168 %
  8. 2.649 %
  9. 1.137 %
  10. 0.491 %
  11. 0.214 %
  12. 0.094 %

So, even with a generation size of one hundred children, one in two hundred runs will show that this algorithm doesn't latch - it's not what W. Dembski and R. Marks describe as a partitioned search.

Wednesday, August 26, 2009

On Oracles and Weasels

It's once again about Richard Dawkins's weasel algorithm. In his book The Blind Watchmaker, Dawkins gives the following description.
Dawkins describes his algorithm in the following way:

It again begins by choosing a random sequence of 28 letters, just as before:

WDLTMNLT DTJBKWIRZREZLMQCO P

It now 'breeds from' this random phrase. It duplicates it repeatedly, but with a certain chance of random error - 'mutation' - in the copying. The computer examines the mutant nonsense phrases, the 'progeny' of the original phrase, and chooses the one which, however slightly, most resembles the target phrase, METHINKS IT IS LIKE A WEASEL. [...] the procedure is repeated.


In my previous posts, I explained how this can be transformed into a program:

  1. chose random string
  2. copy string n times with mutation. NOTE: at this step you don't know which letters are correct in place, so no letter is safe from being mutated!
  3. chose best fitting string. NOTE: best fitting seems to be generally understood to be the string with the most correct letters, the fitting is expressed by a number between 0 and 28
  4. Stop, if the number of correct letters is 28, otherwise
  5. goto 2
Some people (including W. Dembski, I think) are baffled by step 3: seemingly, the program has to know the best string and can extract information about it. Therefore, I restate this algorithm slightly:

  1. chose random string
  2. copy string n times with mutation.
  3. ask oracle which string it likes best
  4. Stop, if the number of correct letters is 28, otherwise
  5. goto 2

Evaluating a fitness function for an algorithm is like questioning an oracle. Even Robert Marks's and Richard Dembski's web site says so - though they don't think it through.

So, how does the oracle work in Dawkins's case? You present it with a list of strings and it just tells you which string is best ("The computer examines the mutant nonsense phrases, the 'progeny' of the original phrase, and chooses the one which, however slightly, most resembles the target phrase, METHINKS IT IS LIKE A WEASEL."). You don't have to know how the oracle deems one string better than the other: it could be that it prefers the string with the most letters correctly in place, or the one which shares the longest substring with its (hidden) target, or perhaps it looks how many letters are right from the begin of the phrase, preferring METHINKS IT IS FROGLIKE XXX to BETHINKS IT IS LIKE A WEASEL, or preferring correct vowels to correct consonants, it could be a combination of these on Fridays, and Saturdays something else....

Could Dembski's algorithm with such an oracle? No. Here is a his algorithm:
  1. chose random string
  2. ask oracle which letters are in place
  3. if all letters are in place: STOP Otherwise:
  4. mutate letters which are not in place
  5. goto 1


What Dembski exploits is the knowledge about the inner processes of the oracle: but knowing how it works allows for a deterministic approach - the Hangman game - which terminates in a few steps...

White Screen & Denise O'Leary

My edits at Uncommon Descent obviously aren't welcomed anymore. What a pity: Now there are at two active threads over there discussing Dawkins's and Dembski's weasel algorithms (as described earlier in A Story of Two Weasels, The Latching Weasel, and Dawkins's weasel: it isn't that bad)

Denise O'Leary asked:

Make the code for the program public. Perhaps Richard Dawkins himself or his friends at RichardDawkins.net can finally provide this code (apparently a program written in BASIC).

I tried to enlighten her. After all, the program is now over 20 years old, and may be rotting in a heap of 5 1/4" floppies...

Dawkins describes his algorithm in the following way:

It again begins by choosing a random sequence of 28 letters, just as before:

WDLTMNLT DTJBKWIRZREZLMQCO P

It now 'breeds from' this random phrase. It duplicates it repeatedly, but with a certain chance of random error - 'mutation' - in the copying. The computer examines the mutant nonsense phrases, the 'progeny' of the original phrase, and chooses the one which, however slightly, most resembles the target phrase, METHINKS IT IS LIKE A WEASEL. [...] the procedure is repeated.

This is enough to replicate his program:
1. chose random string
2. copy string n times with mutation. NOTE: at this step you don't know which letters are correct in place, so no letter is safe from being mutated!
3. chose best fitting string. NOTE: best fitting seems to be generally understood to be the string with the most correct letters, the fitting is expressed by a number between 0 and 28
4. Stop, if the number of correct letters is 28, otherwise
5. goto 2

The parameters which you can chose is the number of copies n, and the probability that a letter in a string is mutated, p. You may even chose another procedure of mutation - but keep in mind the NOTE of step 2.

It's really basic to realize steps 1 - 5 in the programming language of your choice...

While the edit above made it through Uncommon Descent's gates, I'm not allowed to post the follow-up:

Just to elaborate: The interesting part is what Atom describes as oracle, i.e., the application of the fitness function. In Dawkins description we read

The computer examines the mutant nonsense phrases, the 'progeny' of the original phrase, and chooses the one which, however slightly, most resembles the target phrase

An oracle - a black box - which accomplishes this just hints to one string when presented with a population of strings. No further information has to be exchanged. It is not necessary to know how the oracle defines the best string: It could just hint you to the one with the greatest number of correct letters - or perhaps the one which shares the longest substring with the target phrase.


And that's rather a pity: W. Dembski is a guy who is interested in information, and the amount of information which is transferred from a environment into the algorithm. He should be interested in the flow of this information: How is this information exchanged?
The answer: Via the conversation with the oracle. Could his partition search algorithm work with an oracle a described above? No, as it is not enough to name the best string, the position of the correct letters in the string has to be named, too.
Does Dawkins's algorithm work with such an oracle? Obviously...

"Your comment is awaiting moderation...."

Perhaps I should be grateful that I'm not blocked at Uncommon Descent. Update: Perhaps I should have been grateful that I was not blocked at Uncommon Descent.
But it's rather annoying to wait for a couple of hours to get your comment through moderation while a discussion is ongoing: my edit contributed at 2:52 am today (Aug 26, 2009) has still not appeared over six hours later, though there were fifteen other comments in the meantime. What is so hard in moderating the following?
I took the string
SCITAMROFN*IYRANOITULOVE*SAM
and calculated a next generation using Dawkins's algorithms with populations of 10,50 and 100 - and mutation rates of .04, .05 and .1. The tenth string in the list is the second generation given in the paper of Mark and Dembski. The differences with the first generation are in bold face:

1. SCITAMROFN*IYRANOIEULOVE*SAM
2. SCITAMROFN*IYRANOITULOGE*SAM
3. ECITAMRI*N*IYZANOITULOVE*SAM
4. SCITAMROFN*IYRANOITUL*VE*SAM
5. SCITAMROFN*IYRANOITULOVE*SEM
6. SCITAMOOLNOIYRAMOITULOVE*SEM
7. SCITANROFN*IYYANOITULOVE*SAM
8. SCITIMROFN*JYRANOITULOVE*SAM
9. SCITAMROFN*ICRHNOITSLOWE*SAV
10. OOT*DENGISEDESEHT*ERA*NETSIL

Can anyone spot a difference in the design of the strings? Anyone? KF? Anyone?

I'll stay tuned :-)

Update: Now I've two comments waiting in the queue:
232 - DiEb - 08/26/2009 - 9:39 am Your comment is awaiting moderation.
I try to get involved in the discussion, but my last edit (#213) is now in moderation for nearly seven hours…


Update: After eight hours, the comment got through. But I seem to be blocked from now on....

Tuesday, August 25, 2009

The Latching Weasel

I described Dawkins's ''weasel algorithm'' in a previous posts. Some claim that in the process the correct letters are ''explicitly or implicitly'' latched. That's not true, the algorithm isn't necessarily monotonously non-decreasing: a change to the worse from one generation to the next at least once while running the program is possible.
I calculated the probabilities for two case: a population of ten and a population of fifty - both times with a mutation rate of 4%.

10: 95.7 %
50: 0.0000026 %

With a small population, the effect will most probably be observed. But even with the bigger population, a programmer may spot the event - if he is lucky enough.

Why did I chose this population? A population of ten seems to be used by R. Dawkins in his video of 1987, while a population size of fifty was probably used for his book in 1986.
In the first case, it takes 1306 generations on average (σ=924), in the second case, 139 generations (σ=48). So it is quite probable that observe such a reversal, as seen on the video, two. For the bigger population such an occurence is at least more likely than winning a lottery....

Here is a complete view for various rates of mutation ( 1% - 10%) and populations from 1 to 100:

Monday, August 24, 2009

Dawkin's weasel - it isn't that bad...


Richard Dawkins writes about his weasel alogrithm:
Although the monkey/Shakespeare model is useful for explaining the distinction between single-step selection and cumulative selection, it is misleading in important ways.

But it is useful is just another way: It helps to illustrate the importance of a sensible choice of the rate of mutation!
I calculated the expected number of generations for ten different rates of mutation per letter: 1%, 2%,..., 10%. As the graph - and the table - shows, the algorithm works best with a moderate rate of 4% or 5%.













 size of the population
2550100
1 % 534 274145
2 % 335 17596
3 % 280 14883
4 % 263 13979
5 % 266 14079
6 % 296 14682
7 % 434 15988
8 % 1394 17996
9 % 15313 229107
10 %398,256 429123


That's not much of a surprise, but contrasts with the algorithm W. Dembski declared to be the weasel: his version works best with a maximal rate of mutation...

A story of two weasels

In his boock ''The Blind Watchmaker'', Richard Dawkins presents a small program which became known as ''Dawkins Weasel''. He wanted to show the mechanisms of mutation and selection. William Dembski has criticized Dawkins program, and claimed to have done so in his Conservation of Information in Search: Measuring the Cost of Success: Though he doesn't name Dawkins in this essay, in his blog he states:

In it we critique, for instance, Richard Dawkins METHINKS*IT*IS*LIKE*A*WEASEL (p. 1055). Question: When Dawkins introduced this example, was he arguing pro-Darwinism? Yes he was. In critiquing his example and arguing that information is not created by unguided evolutionary processes, we are indeed making an argument that supports ID.''

But does he even use the algorithm which Richard Dawkins had proposed? This is very doubtful...

Description of the algorithms


Here is the lengthy description which Dawkins gave in his book ''The Blind Watchmaker'':

So much for single-step selection of random variation. What about cumulative selection; how much more effective should this be? Very very much more effective, perhaps more so than we at first realize, although it is almost obvious when we reflect further. We again use our computer monkey, but with a crucial difference in its program. It again begins by choosing a random sequence of 28 letters, just as before:

WDLTMNLT DTJBKWIRZREZLMQCO P

It now 'breeds from' this random phrase. It duplicates it repeatedly, but with a certain chance of random error - 'mutation' - in the copying. The computer examines the mutant nonsense phrases, the 'progeny' of the original phrase, and chooses the one which, however slightly, most resembles the target phrase, METHINKS IT IS LIKE A WEASEL. In this instance the winning phrase of the next 'generation' happened to be:

WDLTMNLT DTJBSWIRZREZLMQCO P

Not an obvious improvement! But the procedure is repeated, again mutant 'progeny' are 'bred from' the phrase, and a new 'winner' is chosen. This goes on, generation after generation. After 10 generations, the phrase chosen for 'breeding' was:

MDLDMNLS ITpSWHRZREZ MECS P

After 20 generations it was:

MELDINLS IT ISWPRKE Z WECSEL

By now, the eye of faith fancies that it can see a resemblance to the target phrase. By 30 generations there can be no doubt:

METHINGS IT ISWLIKE B WECSEL

Generation 40 takes us to within one letter of the target:

METHINKS IT IS LIKE I WEASEL

And the target was finally reached in generation 43. A second run of the computer began with the phrase:

Y YVMQKZPFfXWVHGLAWFVCHQXYOPY,

passed through (again reporting only every tenth generation):

Y YVMQKSPFTXWSHLIKEFV HQYSPY

YETHINKSPITXISHLIKEFA WQYSEY

METHINKS IT ISSLIKE A WEFSEY

METHINKS IT ISBLIKE A WEASES

METHINKS IT ISJLIKE A WEASEO

METHINKS IT IS LIKE A WEASEP

and reached the target phrase in generation 64. m a third run the computer started with:

GEWRGZRPBCTPGQMCKHFDBGW ZCCF

and reached METHINKS IT IS LIKE A WEASEL in 41 generations of selective 'breeding'.

And this is Dembski's weasel-algorithm:

Partitioned search s a "divide and conquer" procedure best introduced by example. Consider the L =28 character phrase

METHINKS ∗ IT ∗ IS ∗ LIKE ∗ A ∗ WEASEL. (19)

Suppose that the result of our first query of L =28 characters is

SCITAMROFN ∗ IYRANOITULOVE ∗ SAM. (20)

Two of the letters {E, S} are in the correct position. They are shown in a bold font. In partitioned search, our search for these letters is finished. For the incorrect letters, we select 26 new letters and obtain

OOT ∗ DENGISEDESEHT ∗ ERA∗NETSIL. (21)

Five new letters are found, bringing the cumulative tally of discovered characters to {T, S,E, ∗,E, S,L}. All seven characters are ratcheted into place. The 19 new letters are chosen, and the process is repeated until the entire target phrase is found.

Well, at least it is short.

Obvious differences



  1. While Dawkins mentioned a couple of strings (it duplicates it repeatedly), Dembski only mentions one string.
  2. The mutation rate of Dawkins example seems to be much lower than the Dembski's, at least judging from the first two generations:
    Dawkins:
    1) WDLTMNLT*DTJBKWIRZREZLMQCO*P
    2) WDLTMNLT*DTJBKSIRZREZLMQCO*P

    Dembski:
    1) SCITAMROFN*IYRANOITULOVE*SAM
    2) OOT*DENGISEDESEHT*ERA*NETSIL

    (changes in bold font)
  3. Dawkins gives the number of generations for three experiments: 43, 64, and 41. The expected number of generations G, i.e, E(G), for Dembski's algorithm is 104.55 (=Σk(1-(1-(1/N)k)L), where N is the size of the alphabet and L the number of letters of the phrase, i.e. N = 27, L = 28 .


Claimed differences


Dembski always stressed that in the algorithm a letter once found is ''latched'' or ''ratcheted''. And the strings given as examples in Dawkins's book doesn't exclude that ''latching'' is happening. But Dawkins stated repeatedly that in his algorithm latching doesn't take place.

Conclusion


Even this superficial comparison of the two algorithms shows that both are very different: so to say two kinds of weasels.