## Wednesday, September 17, 2014

### Baez on Information Geometry

I think I've struck some gold here concerning information geometry in a series by John Carlos Baez.

Start with part 8 where Baez gets into the relationship to evolution.  Some reminder about thermodynamic models:
Physicists love to think about systems that take only a little information to describe. So when they get a system that takes a lot of information to describe, they use a trick called 'statistical mechanics', where you try to ignore most of this information and focus on a few especially important variables. For example, if you hand a physicist a box of gas, they'll try to avoid thinking about the state of each atom, and instead focus on a few macroscopic quantities like the volume and total energy. Ironically, the mathematical concept of information arose first here—although they didn't call it information back then; they called it 'entropy'. The entropy of a box of gas is precisely the amount of information you've decided to forget when you play this trick of focusing on the macroscopic variables. Amazingly, remembering just this—the sheer amount of information you've forgotten—can be extremely useful... at least for the systems physicists like best.
He goes on to say that in biology, there is a lot less information in the system that can be forgotten... This goes back somewhat to the use of "entropy" to correlate to different kinds of information.  The (average) loss of uncertainty/entropy in Shannon information, for example. He goes on to talk about alleles as rival hypotheses.
The analogy is mathematically precise, and fascinating. In rough terms, it says that the process of natural selection resembles the process of Bayesian inference. A population of organisms can be thought of as having various 'hypotheses' about how to survive—each hypothesis corresponding to a different allele. (Roughly, an allele is one of several alternative versions of a gene.) In each successive generation, the process of natural selection modifies the proportion of organisms having each hypothesis, according to Bayes' rule!
It appears that this approach looks at information in terms of a distance from a destination state of stability.  So in that sense, it is more about relative information.
But what does all this have to do with information? . . .  first discovered by Ethan Atkin. Suppose evolution as described by the replicator equation brings the whole list of probabilities p — let's call this list —closer and closer to some stable equilibrium, say q.  Then if a couple of technical conditions hold, the entropy of q relative to p keeps decreasing, and approaches zero.   Remember what I told you about relative entropy. In Bayesian inference, the entropy relative to p is how much information we gain if we start with as our prior and then do an experiment that pushes us to the posterior q. So, in simple rough terms: as it approaches a stable equilibrium, the amount of information a species has left to learn keeps dropping, and goes to zero!  . . .  You can find [precise details] in Section 3.5, which is called "Kullback-Leibler Divergence is a Lyapunov function for the Replicator Dynamic". . . .  'Kullback-Leibler divergence' is just another term for relative entropy. 'Lyapunov function' means that it keeps dropping and goes to zero. And the 'replicator dynamic' is the replicator equation I described above.  . . .  [This approach] uses information geometry to make precise the sense in which evolution is a process of acquiring information
Baez offers some background to this in Gavin E. Crooks' Measuring thermodynamic length and in part 1 of his series.
But when we’ve got lots of observables, there’s something better than the variance of each one. There’s the covariance matrix of the whole lot of them! Each observable $X_i$ fluctuates around its mean value $x_i$… but these fluctuations are not independent! They’re correlated, and the covariance matrix says how.
All this is very visual, at least for me. If you imagine the fluctuations as forming a blurry patch near the point $(x_1, \dots, x_n)$, this patch will be ellipsoidal in shape, at least when all our random fluctuations are Gaussian. And then the shape of this ellipsoid is precisely captured by the covariance matrix! In particular,
the eigenvectors of the covariance matrix will point along the principal axes of this ellipsoid, and the eigenvalues will say how stretched out the ellipsoid is in each direction!

As I recall, the eigenvalues will be in bits of error in terms of the units of the parameters.