# Population analysis I

Having recorded the genomes for 128 organisms per generation for 1920 generations (nearly a quarter of a million genomes), I thought it might be interesting to look for some patterns in how genes emerge and spread through a population.

## Cells of the final generation

I first considered ways to visualise a population of cells from a single generation. I thought the final generation would be most interesting since its cells are most evolved. In the image below, each circle represents a cell from the final generation. Cells are ordered by fitness (left to right, top to bottom). Cell fitness is also represented by the area of each circle. I attempted to show the similarity of cell proteomes by the circle colour. To this end, I calculated the distance between every pair of cells by comparing the number of each type of protein they had. I coloured the two most different cells pure blue and pure green, then gave each other cell a greenness and blueness based on their relative distance from these two extreme cells.

The problem with this graph is that I'm not really sure what it shows. By the 1920th generation, the population is probably quite homogeneous and unsurprisingly the most different organisms are those that are least fit. I suspect they have large chunks of their genomes duplicated or have odd genes that don't do very much, hence the difference. I quite like the colours though.

## Cells of the first generations

Since the fitness of cells increases most in the first 200 generations then plateaus, I decided it might be more interesting to look at the early generations when novel, useful mutations start to appear. I also decided to restrict my analysis to the fittest cells, since the least fit cells generally have fatal or useless mutations.

The graph below shows the fittest 16 cells of the first 16 generations. Again, cell area represents relative fitness and colour represents how similar cells are. In the first generation there is one cell that is significantly more successful than the others, which are so small as to be almost invisible. This mutation spreads through the population, so by generation 7 all the cells in the top 16 are at least as fit. The next big evolutionary breakthrough is at generation 14, which then begins to spread. In this graph the colours do seem to highlight different species of cell reasonably well (e.g. the dark blue species that takes hold in generation 13, but loses its top spot the very next generation).

## Colouring cells with PCA

I was pretty pleased with how the last graph turned out, but still wasn't happy with the colours. Two most extreme cells seemed like a somewhat arbitrary basis for colouring all the cells, especially as the reason they were so different was because they had weird mutations which weren't representative of the population as a whole. The range of colours was also limited since a cell couldn't be both no different from both extreme cells, so could never have maximum blue and green values.

Over two years later, I again considered the problem of converting a genome with a variable number of different genes into two values (degrees of blue and green), in such a way that the similarity of the colours was related to the similarity of the genomes, and realised that principle component analysis (PCA) could help. PCA allows you to convert a set of vectors into a set of vectors with smaller dimensions. If I could represent an organism's genome by a vector (list of numbers), then I could reduce that vector to two dimensions and use those to colour the cell.

## Converting genomes to vectors

For this analysis, I used the fittest 64 organisms (half the population) from the first 200 generations, so had 12,800 genomes to consider. I converted the genomes into vectors by first going through all these genomes and get a list of all the different proteins functions encoded and how often they occurred. I used protein functions rather than genes, so that genes that had mutation with no effect on function, were ignored. I found 2286 different protein functions encoded in the genome set, which is quite surprising, given now limited the chemistry is. I filtered these down to a more manageable 329 by removing any gene that occurred with a frequency of less than 10. The most common function was the F-driven EHase, which occurred 40,835 times, so more than 3 times per cell.

Each genome was converted into a vector of 329 dimensions by counting the number of copies of each of the 329 most frequent proteins it contained. This resulted in a 12,800 x 329 matrix of organisms, which was then subjected to PCA, which took a few minutes. The two principle components were then extracted and each organism converted into a vector of two dimensions, and then a colour made of blues and greens. The result is shown below with each organism filling a 2x2 pixel.

I think this image quite nicely shows sucessful mutations appearing at the top of each generation, and then flowing through the population. The colours seem to work quite well and there is a clear, gradual change over time. I would like to see what it looked like with a population of two species, but I'm not sure I can model enough organisms to make that stable.