What do you mean it’s not a valid statistical test!?

Yesterday I briefly mentioned (in the “Rapid fire ID news” segment) a post by “niwrad” on Uncommon Descent about the apparent dissimilarities between the human and chimpanzee genomes. I was wary about commenting on it, since I have little statistical training and didn’t want to make a fool of myself more than I usually do, but a post by Joe Fenselstein over on The Panda’s Thumb about the analysis “niwrad” conducted has given me the confidence (read: basic understanding) to address its very fundamental flaw.

There’s no point going into detail about the various different types of statistical tests you can perform on genomes to determine sequence similarity – I don’t know enough about them and a majority of you probably don’t want to read about advanced statistics. But the problem with what “niwrad” did is very straightforward and can illustrated simply with an example.

In the well-known paper Initial sequence of the chimpanzee genome and comparison with the human genome, by The Chimpanzee Sequencing and Analysis Consortium (Vol. 437/1 September 2005/doi:10.1038/nature04072), the researchers found that

Single-nucleotide substitutions occur at a mean rate of 1.23% between copies of the human and chimpanzee genome

meaning that for the parts of the chimpanzee and human genomes that are directly comparable, the differences between them are, on average, 1.23% (adding in insertions and deletions raises the differences to about 4%). The chimpanzee genome and the human genome are approximately 99% identical for large-scale conserved sequences.

The analysis performed by “niwrad”, however, finds something different:

On average the overall 30BPM similarity, when all chromosomes are taken into consideration, is approximately 62%.

Whaaaat? 62% similarity? 38% difference? How can this be? Something must be wrong here.

Indeed, something is wrong, and it’s all to do with the “30BPM” technique that “niwrad” uses. Instead of randomly picking single nucleotides from one genome and seeing how often they don’t match the nucleotide in the same position on the other genome, “niwrad” used 1000 30-nucleotide-long sequences to compare between both genomes. Why is this a problem?

Consider two sequences of 100 playing cards each, Deck A and Deck B. They are almost identical, except that B has 10 single card differences from what is found on A. Now, if you did a test and chose a number of single cards to compare between the two decks, on average you would find the difference between A and B to be 10%, the true difference (A and B share 90 cards out of 100), because 10% of the single card samples would not match their cousin sample on the other deck.

However, if you increased the size of the samples used from single cards to 10 cards, then the reported difference between the decks A and B would increase to higher than 10%. This is because you’re now measuring whether or not larger sequences match, and if a single mismatched cards happens to be in that sequence then it is deemed different even though only one of the cards was different – the other 9 cards were identical between the decks.

This effect becomes even more pronounced as the sample sequences increase in size – in fact, if you chose a sample sequence length of 100 cards then the test would conclude that the decks A and B are 100% different! Using sample lengths larger than one clearly produces artificially high results on the differences between two sequences.

This was the problem with “niwrad”‘s analysis. Instead of single nucleotide sample sequences, 30-nucleotide-long sequences were used, which produced a higher calculated difference between the human and chimpanzee genomes than the real difference – 38% as opposed to the real value of 1.23%.

Intelligent design proponents and creationists can’t get enough of attempts to discredit the idea of common descent, even thought none of them have ever been successful. While I don’t know if “niwrad” botched the test on purpose, it’s still shoddy statistics backing up the rejection of a well-established fact in mammalian genomics.