I think the reason I'm interested in artificial evolution is that I want a record of every stage of evolution so I can trace the changes and see how complex systems can be built up. However, I've found that even with a complete record of every genome in a system, it can be quite a challenge to untangle evolutionary pathways. I hope that in trying to create programs to analyse the information from my artificial evolution, I might create something that can be used to analyse evolution in the real world.
These pages describe how my simulation works. I have also written about older versions of it:
I've started new cell simulations several times now and got quite far each time before hitting a wall. I've decided to start yet again, building up from the simplest building blocks and recording my progress on this blog (mainly so I can remember how everything works). I'll also be using Git and putting my code for this project on Gihub for all to see and edit.
To keep my program flexible and simple to use, I've tried to write a program module that contains only very simple, relatively self-explanatory commands; these commands call functions in imported modules where all the computations are hidden (essentially an API). For the most part, I'll explain only the commands in the top-level module; if you want to see more detail then you can look at the code on Github.
The overall aim of this simulation is to evolve networks: networks of reactions (i.e. metabolic pathways) and networks of transcription factors and repressors (i.e. regulatory pathways). In biology we look at these as complete systems, so I think it will be interesting to see how such complex interlocking networks can arise bit-by-bit through evolution. I also hope this simulation will be a more open simulation. Below are some of the hopes I have for evolution in my simulation and how they might be encouraged.
In contrast to my previous evolution simulations, I want this one to simulate evolution by natural selection. In all my previous simulations of evolution there has been a specific measue of fitness (a fitness function). For example, in my simulation of heterocysts, cells were selected based on the time it took them to produce a filament of a specific, arbitrary length. This simply led to the evolution of cells that could divide fastest, even if cells used up all their resources by the end of the simulation; once they'd reached the target length it didn't matter how healthy they were
In this simulation the cells that reproduce will reproduce and those that don't, won't. Evolution should therefore be more open ended and allow multiple strategies to be explored. For example, some cells may evolve to reproduce quickly, while others may evolve to divide more slower to ensure their daughter cells have larger reserves of energy.
This is really part of natural selection. Whereas my previous simulations have simulated individuals in isolation and measured their fitness at the end of a 'generation', in this simulation I want to simulate organisms simultaneously. This means that cells will be able to interact with one another directly or indirectly by influencing the environment (see below). I hope this will drive the evolution of ecosystems in which organisms become dependent upon each other's existence.
To further encourage diverse evolutionary strategies, and thus speciation, I aim to simulate an environment that is homogeneous both spatially and temporally. Spatial difference will allow cells to adapt to different niches, while temporal changes will force cells the respond to the environment. In fact, one of the main aims of the simulation is to evolve regulatory networks, and in order for a regulatory network to evolve, there must be changes in the environment to respond to.
The simulation of the pool of water in which the cells will live is effectively a separate project, which I am writing about here.
Whilst all genetics must be constrained to some degree (by which I mean not all phenotypes are possible; at the very least the laws of physics provide some limits), my previous simulations have been very limited. In all cases genes have represented relatively simple and well-defined characteristics, and there has been a set number of genes. in this simulation, I hope to create a sufficiently realistic and open-ended system by creating a function that converts a sequence of amino acids into proteins (i.e. cellular functions). I suspect that getting this mapping correct with be the most important part of building the simulation. Furthermore, there will be no limit to how long a genome can be other than my computer's memory and how much energy it takes for an organism to synthesise the genome.
Evolution can create a huge diversity of organisms, but they are all constrained by physics and chemistry. Organisms can never evolve to synthesise novel elements, to generate energy from nothing or to travel faster than light. One of the (few) things that disappointed me about the game Creatures, was that because of the way that enzymes were defined, one of my creatures evolved to synthesise starch from nothing, so never had to eat.
In my simulation, I want cells to be constrained by a realistic chemistry. However, I don't intend to attempt to simulation real-life chemistry; that would be too computationally complex and not necessarily useful for a simulation of evolution. Instead I intend to invent my own simple chemistry, defining all the chemicals that can exist and all the reactions that can inter-convert between them.
With chemistry defined, organisms will have to evolve ways to exploit it to their own ends. I want to avoid defining specific chemicals as 'food' or sugars and instead allow organisms to metabolise whatever suits them. For example, if a large quantity of a waste product is produced by one organism, another might evolve to exploit that concentration gradient.
In my virtual universe there are eight elements, which I've given unimaginative one-letter names [N.B. I use different letters later in my blog and am in the process of updating everything] and arranged in a simple 4 by 2 periodic table.
The elements could represent atomic elements, but they could equally represent functional groups, such as an organic acid group, it's not really important. I arranged them in a table to indicate that elements that in the same column are chemically similar. The properties of elements, such as their stability is also related to their position in the table.
As in our universe, elements can combine to form compounds, but not all combinations of chemicals are possible. In my universe, there are 12 two-element compounds formed by either combining elements from column 1 with elements from column 4 (e.g. AH) or by combining elements from coloumn 2 with elements from column 2 or 3 (e.g. FC or BB). The relative stability of compounds is a function of the properties of the elements.
Since this is a biological simulation, all reactions will occur in water. Water is not defined in the same way as other chemicals in the simulation, but instead a Solution object is defined. This object has a volume property, which defines how many units of water it consists of, and a dictionary of the chemicals it contains. For example, we create a pool of 64 000 units of water like so:
from solutions import Solution pool = Solution(volume=64000)
We can then add 1000 units of chemical AD:
The Cell object inherits from the Solution object because a cell is essentially just a bag of water containing some chemicals (plus DNA and proteins, which are defined separately in this simulation).
Having defined all the possible chemicals in the universe (elements and compounds), I defined all the ways to convert one into another. There are three classes of reaction and they are all reversible.
In these reactions a compound is split into its constituent elements (or two element combine to form a compound). For example, AD ⇌ A + D
In these reactions an element reacts with a compound and replaces one of the constituent elements of that compound. For example, AD + H ⇌ AH + D
In these reactions two compounds reacts and swap constituent elements. For example, AD + EH ⇌ AH + ED
Below is a diagram showing the four possible reactions that compound AD can undergo.
Each reaction (forward and backward) has a rate constant which is calculated based on the relatively stability of the chemicals involved. The rate constant determines how fast a reaction occurs and the ratio of the rate constants for forward and backward reactions implicitly determines where the equilibrium of a reaction lies. I'll explain more how this works in the section about enzymes since I've decided to allow reactions to occur only in the presence of an enzyme that can catalyses it (i.e. no reaction occurs spontaneously).
As I mentioned in the section about reactions, in order for any reaction to occur in my simulation, an enzyme is required to catalyse it. This potentially reduces the possibilities for interesting chemistry to evolve, but simplifies computation. Enzymes are added to Solution objects (including Cell objects) with the addProtein() function. For example, continuing from the code in the previous post:
Here I've added two units of a protein with the amino acid sequence 'MLMMMM' to the pool. The 15 amino acids in my simulation are represented by the letters L to Z. The amino acid sequence 'MLMMMM' codes for a type of enzyme I call ADases because they catalyse the hydrolysis/synthesis of chemical AD to/from chemicals A and D. In a later post I'll explain exactly how amino acid sequences define protein functions.
Once we have set up a pool of water, we can run the simulation for multiple ticks like so:
The graph below shows how the concentrations change over 1000 ticks in a solution of 1% AD, 0.5% A and 0% D when 0.01% ADase is added.
The net reaction is that AD is hydrolysed until equilibrium is reached. Equilibrium is reached when the rate of AD hydrolysis occurs at the same rate of AD synthesis. In this case, equilibrium is reach when approximately 97% of the AD has been broken down. The relative rates of the forward and reverse reactions depends on the concentrations of the chemicals involved and two rate constants. These rate constants (k1 and k2) are defined by the relative stabilities of the chemicals on either side of the equation.
The hydrolysis of AD into A and D has a rate constant (k1) of 0.6. If the reaction were to occur spontaneously, this would result in 60% of AD in the pool breaking down in each tick of the simulation. However, since this reaction requires an enzyme to catalyse it, the rate is multiplied by the amount of AD bound the enzyme. I've defined the amount of AD bound to the enzyme at any given moment as:
Where [AD] is the concentration of AD, i.e. the amount of AD divided by the volume of the solution. The km is a constant associated with each enzyme, which determines its affinty for its substrates. The amount of AD bound is actually slightly less in the simulation as AD competes for the enzyme binding site with A and D.
The reverse reaction is more complicated as it depends on two substrates. The equation is essentially the same, but is limited by whichever substrate is the least bound to the enzyme. This is a bit of a cheat, but simplifies things and stops reaction rates becoming vanishingly small when many substrates are involved.
The km of an enzyme in my simulation, and in reality, determines how the concentration of a substrate effects the reaction rate. From the equation above, you can see that when the concentraion of a substrate is equal to the km, half the enzyme will be bound to the substrate. A small km therefore means that an enzyme binds a substrate tightly and saturates at a relatively low concentration. A large km means that an enzyme binds the substrate less tightly and saturates at a higher concentration. Enzymes with a low km will catalyse reactions faster but will be less responsive to changes in susbtrate concentration.
The graph below shows how substrate concentration effects the reaction rate of four ADase enzymes with different kms. As the susbtrate concentration increases, the reaction rate asymptotically reaches the maximum rate of 0.6. Only the initial reaction rate is measured so there is no reverse reaction or product inhibition.
Comparison of reaction rates
As I mentioned above, rate constants depend on the relative stability of the chemicals involved in a reaction. AD is relatively unstable so equilibrium is reached when most of it (~97%) has been broken down. On the other hand, chemical FG is relatively stable, so breaks down slowly and reaches equilibrium at a higher concentration (only ~20% broken down). EH is somewhere in between.
Previously I described how enzymes can catalyse a reaction. If this were all enzymes could do then cells would be very limited - they could only ever move subsets of reactions towards equilibrium. Although enzymes are limited to moving reactions to equilibrium, they can expand the repertoire of reactions available by coupling together two or more reactions, effectively, creating a new reaction. For example, real life cells couple the synthesis of DNA to the hydrolysis of GTP. The addition of a nucleotide to a strand of DNA would normally be energetically unfavourable (the equilibrium favours removing a nucleotide), but the hydrolysis of GTP is more favourable. The coupled reaction therefore favours DNA synthesis.
In order for the simulated cells to drive energetically unfavourable reactions, such as DNA synthesis, they therefore require enzymes that can couple reactions. If we add two separate enzymes, ADase and EHase to a solution containing 1% AD and EH, and 0.5% A, D, E and H, then both AD synthesis and EH synthesis are unfavourable, so both are hydrolysed.
(The concentrations of D and H are equal to the concentrations of A and E respectively, so are not shown. And for reference, I used a 0.01% solution of each enzyme, and the km of each was 0.0625.)
However, if we a single enzyme that couples the ADase reaction to the reverse EDase reaction we create, in effect, a new reaction:
AD + E + H ⇌ A + D + EH
Since the AD hydrolysis reaction is more favourable in our solution, it drives the reaction in the direction of EH synthesis. Our enzyme thus functions as an ADase-dependent EH synthase.
Arguably the defining feature of a cell is its membrane, which determines what is part of a cell and what isn't. The membrane is vital for creating spaces in which the distribution of chemicals are not in equilibrium. In particular, the DNA, RNA and proteins must be at a concentration much higher than found outside of a cell. A disequilibrium is achieved by selectively allowing chemicals to cross the membrane.
A pore is a protein that allows a substance to passively diffuse across a membrane. We can consider a pore to be an enzyme that converts a chemical inside the cells into the same chemical outside of the cell. For example, an A-pore, catalyses the reaction:
Acell ⇌ Asolution
In these pore "reactions" the equilbrium constant is always 1, i.e. equilbrium is reached when the concentration of the chemical inside the cell equals the concentraion outside.
Despite the fact that pore only allow passive diffusion of a substrate, they can still be used to concentrate metabolites within a cell. For example, the cell below has an A-pore, an AD-pore, and the enzyme ADase.
If we put this cell containing no metabolites in a solution of AD, then the AD will diffuse in and be broken down by ADase into A and D. Since, it has an A-pore, substrate A will diffuse out of the cell, but substrate D cannot, so is concentrated. Since ADase strongly favours the AD hydrolysis reaction, equilibrium is only reached when D is very, very concentrated.
In the same way that enzymes can be coupled together, as described on the previous page, pores can also be coupled, creating transporters. For example, we can create an A-D symporter by combining an A-pore with a D-pore:
Acell + Dcell ⇌ Asolution + Dsolution
By combining an A-pore with a D-pore in reverse directions, we can create an antiporter:
Acell + Dsolution ⇌ Asolution + Dcell
Furthermore, since we treat pore as a type of enzyme, we can couple a pore to an enzyme to create an active transporter. For example, by coupling an ADase to a B-pore, we create an ADase-driven B-pump, which can use the favourable AD hydrolysis reaction to pump B into the cell against a concentration gradient:
AD + Bsolution ⇌ A + D + Bcell
A couple of posts ago I showed how an unfavourable reaction, EH synthesis, could be driven by the hydrolysis of AD by coupling the two reactions because the reaction: AD + E + H ⇌ A + D + EH was favourable overall (with the given concentrations). Now we have defined transporters we can actually use the EH hydolysis reaction to drive AD synthesis, despite the overall reaction being unfavourable. We do this by making use of a chemical gradient to drive a reaction, which is effectively chemiosmosis. Note that this is an emergent property of the simulation and not something I specifically coded into it.
The cell diagrammed has an EH-pore which allows it to take up EH and hydrolyse it with an EHase. This results in a build up of E and H; the concentration gradient of H is then used to drive AD synthesis by an H-dependent AD synthase.
If the cell is initialised with 1% AD and EH, and 0.5% A, D, E and H, and put in a solution of 1% EH, and 0.5% A and D, the AD hydrolysis reaction is more favourable than the EH hydrolysis reaction, but the concentration gradient of H can be used as an intermediate to drive the ADase reaction in reverse.
The reason this cell is able use the EHase reaction to drive the more favourable ADase reaction in reverse is that by going through the intermediate of the H concentration gradient, it can alter the stoichiometry of the reactions. The graph shows that over the course of 50 000 time units, the concentration of chemical E increase from 0.5% to over 2%. In this simulation, that corresponds to ~1000 units of E, meaning that the net EHase reaction, EH → E + H, has occured ~1000 times. The concentration of AD increases from 1% to ~1.4%, which corresponds to ~250 units. The overall reaction within the cell is therefore roughly:
A + D + 4 EH ⇌ AD + 4 E + 4 H
Biology is the search for the chemistry works -- R. J. P. Williams.
I gave myself the aim of synthesising compound EH, the most energetically unfavourable reaction. I added some EL and FK into the pool, which cells could use to power synthesis reaction (using the concentration gradient of these compound as well as the chemical energy released in their hydrolysis). I imagine that EH is equivalent to ATP or DNA, which cells must synthesise in order to replicate; and EL and FK are equivalent to sugars that cells metabolise to power reactions.
The system that I created is shown below. Black arrows indicate the transport (active or passive) of chemicals into and out of the cell; red arrows indicate reactions.
This metabolism makes use of the concentration gradient and chemical energy of EL to drive EH formation. EL enters the cell by passive diffusion and is broken down to E and L. E can exit the cell, while L accumulates. L accumulates to a high concentration, so that it can drive EH formation as it leaves the cell. I later realised that the E pore is unnecessary, but I left it in the ancestral cell to see if evolution also realised it. In the future, I'll probably just start with a random genome.
In order get cells to evolve I needed to create a way to mutate the proteins in the cell. I did this by giving cells a genome which defined the proteins it contained. The genome is a linear sequence of nucleotides, which I represent with the letters A, B, C and D (equivalent to A, G, C and T in real cells). This sequence can be easily mutated as described below.
The genome is split into genes with the boundaries between genes indicated by the sequence DDAA. Genes are then split into two letter codons, which are interpreted thus:
The two letter codon xx defines a chemical, X, in the case of transporters and a reaction, X, in the case of enzymes. For example BA-AA encodes an E-transporter and BC-AA encodes an EH synthase/hydrolase. The forward and reverse direction of reactions is not important unless reactions are coupled. For example:
If a codon doesn't (yet) have an interpretation, it is ignored, so CA-BA-AA-CB-CC-BA-AB-CD is an E/F symporter as above. Later on I'd like to create a more efficient system whereby the arrangement of two binding sites determines how reactions are coupled.
Each time a cell is created, its DNA is synthesised using a template. Bases are copied from the template one at a time with a 0.5% chance of mutation per base. 75% of mutation are single base changes while 25% cause the DNA to continue replicating from a random position on the template, thus creating insertions and deletions and potentially duplicating the entire genome.
I started a simulation with a pool with a volume of 1 000 000 units and the same concentration of chemicals as above (default concentrations of elements plus extra EL and FK). I then added 64 cells to the pool, each containing the default concentration of elements. The DNA in each cell was based on the genome of my ancestral cell, mutated as described above. The genomes were interpreted and an amount of each protein was added to cells, such that the total amount of protein in each cell was 16 units. This prevented cells simply producing more proteins by duplicating genes. It also meant that cells producing useless proteins were penalised by reducing the amount of other proteins.
Cells were run for 20 000 time units and ranked by the concentration of EH they accumulated during this time. The top 16 cells were selected and allowed to replicate. The selection method allowed more successful cells to have more offspring in a fairly arbitrary manner:
In addition, four "wildcard" cells were picked at random and allowed to produce a daughter. This results in 64 daughter cells to form the next generation. Each generation was added to a fresh pool with the same distribution of chemicals as before, and run for 20 000 times units again.
After setting up the simulation as described in the previous update, I left my cells to evolve for a couple of days, during which time 2800 generations passed. Each generation consisted of 20 000 units of time and the fitness of each cell was defined as the amount of chemical EH it could accumulate in this time. For each generation, I recorded the genome of the fittest cell and the amount of EH it had accumulated.
In the first generation, the top organism accumulated 6.2 units of EH; the highest amount of EH accumulated was 71.4 units in generation 263. Below shows the amount of EH in the top cell of the first 300 generations.
The graph demonstrates an important difference between this simulation and my heterocyst simulation or my genetic algorithm - despite the cells with the highest fitness being selected each generation, the fitness does not go up in every generation. In fact, beyond generation 263, the maximum amount of EH shows no net increase, but bounces up and down between ~55 and ~65 units.
The reason is that the environment changes every generation. While the amount of each chemical in the pool at the beginning of each generation was the same, it was altered over time by the cells' metabolism. The concept of cell fitness therefore becomes dependent one context; what is advantageous one generation may not be successful in the next. This also means that cells can therefore interact with one another, albeit indirectly. This will create opportunities for evolutionary arms-races and co-evolution.
It may also be possible to have two populations of cells, one using say metabolite X, another using metabolite Y. If cells metabolising X start to dominate, then the amount of X in the pool will reduce, allowing cells metabolising Y to prosper, and the situation to reverse. However, I suspect that a population of 64 cells is too small for two separate populations to be stable. Sadly, I only recorded the genome of the top organism in this run of evolution so I can't analyse populations yet.
The ancestral cell in my simulation had six genes, which encoded five different proteins. Their functions are described in the previous update.
By the end of the 2800 generations, the fittest cell had 18 genes, which encoded four different proteins:
Since the total amount of protein in all cells was fixed at 16 units, increasing the number of a particular gene increases the amount of protein encoded by that gene, but decreases the amount of protein encoded by all other genes. The ancestral cell therefore actually has the same amount of EL pore as the final cell (16/6 units).
Below is a graph showing the number of genes in the fittest cell and the amount of EH accumulated by that cell. The graph shows that the big jump around generation 250, in fitness is actually associated with a reduction in gene number (and a couple gene gains). Then between generation ~300 to generation ~2300, there is no change in gene number of the fittest organism. Interestingly, the gain in genes around generation 2300, appears to be associated with a reduction in the fitness of the cell.
While I didn't record the genomes of other cells other than the fittest in the population, the peaks in cell number suggest that there were two competing species the last 100 generations; the genome of the top organism flips between these two variations. One genome is the final genome shown above, the other contained 4 EL pores, 4 EL hydrolases, 8 L/H antiporters and 9 L-driven EH synthases. In an isolated pool, this cell is only slightly less efficient than the final cell. I would predict that under conditions of low H, the increased amount of L/H antiporters in this cell gives it an edge.
As well as changes in gene copy number, two proteins are lost and a novel protein appears over the course of evolution. As I mentioned in the previous update, the E pore is effectively useless and was very quickly lost (in just generation 7, resulting in quite a big jump in fitness). The other big change is that rather than an H pore to allow chemical H to passively diffuse into the cell, the final cell uses an L/H antiporter, making use of the concentration gradient of L already present to drive H into the cell. The loss of the H-pore occurred (in generation 225) and the gain of an L/H antiporter (in generation 255) are resulted in the two the big jumps in fitness shown in the graph above.
I was a bit surprised that the cell evolved to drive H into the cell, especially at the expense of its L gradient, which it needs to drive EH synthesis. However, I later realised that if a cell actively takes up H, then it will lower the concentration of H in the pool, causing H to flow out of competing cells that don't actively transport it. If the final evolved cell is put in a pool with the ancestral cell then the ancestral cell effectively die; it is unable to synthesise EH because H flows out of it.
Even in this simple simulation cells evolved aggressive tactics which I hadn't thought possible.
I think this explains the reduction in the maximum fitness around generation 2300; the reduction is associated with an increased in the number of L/H antiporters (and some other genes, which I guess are required to maintain metabolite balance). I suspect that all the cells in later generations are using their L gradient to take up more H, which results in no net increase in H uptake. When I ran evolution again for 1250 generation, again, the L/H antiporter was the major evolutionary breakthrough.
Running this simulation has given me lots of ideas for changes:
I wanted to re-run my simulation with a couple of changes. First, I wanted to see whether different 'species' could co-exist, so I recorded the genome of every organism, rather than just the fittest organism. I also recorded the concentration of metabolites in the solution at the end of each generation, so I can see which are used most.
Second, in order to encourage different species to evolve I doubled the population size to 128 cells and simplified the way the next generation is selected. Now, the fittest half of a population produces two daughter cells for the next generation. As a result, mutations now take longer to spread through the population, so a single species is less likely to dominate.
To further encourage different species to evolve and in preparation of later updates, I created a more diverse metabolic and challenging environment. The first change to create two potential sources energy, metabolites FG and FK. These provide energy both in the form of a concentration gradient (they are more concentrated outside the cell than inside) and because they release energy when broken down (the reaction is far from equilibrium). I imagine FG and FK are the equivalent of two sugars (such as glucose and fructose). Since they both contain chemical F, this is likely to become a waste product.
While I decided to keep EH as an equivalent to ATP (because its hydrolysis reaction is the most favourable), I decided to make IH the equivalent to DNA, and changed the measure of fitness: now cells must maximise the amount of IH they accumulate. In the ancestral cell, IH formation is driven by EH hydrolysis. I also decided to make KG the equivalent to amino acids, though in this simulation, this still has no effect.
With twice the population size and about twice the number of genes, in the ancestral cell, this simulation was a bit slower to run, but after a few days, it still managed to reach 1920 generations. Below is a graph showing the maximum (light blue) and median (dark blue) fitness of cells in the population, where fitness is defined by concentration of IH a cell accumulates by the end of a generation.
As in the previous simulation, the fitness increases for first 200 or so generations before levelling off. In generation 0, the fittest cell accumulated 1.99 units of IH, by generation 200, it accumulated 21.1 units, and by generation 1920, 20.9 units. After the first 200 generations, there was little difference between the fittest and median fittest organisms, suggesting that the successful genes spread quickly through the population.
Below is a graph showing the concentration of various metabolites relative the the concentration in the 0th generation at the end of each generation. Note that the concentration of metabolites is reset in each generation, so changes all changes are due to changes in the metabolism of the cells. The metabolites not shown did not change significantly and there was no big changes after the 1000th generation.
As predicted F was a major waste product and as the cells evolved they released more of it into the environment. Similarly, K and G, which are also the waste products of the "sugars" increased. Conversely, E was an important metabolite for synthesising EH, so was more was taken up from the environment as the cells evolved. Surprisingly, IL, which was not part of the original metabolic pathway also became an important metabolite, later we'll why.
There are a few large changes in metabolite concentrations around generation 80 and 200, which suggest that some important mutations occured and took hold at these points.
Since I recorded the genome of each organism in this run of evolution (1920 generation each with 128 cells, so nearly 250 000 genomes), I can also plot the fitness of, for example, the 64th fittest organism in each generation, which gives a measure of the median fitness. As you can see, the graph is very similar in general, only the position of fluctuations is different.
A quick look at the proteomes suggest that the advantageous mutation in generation 1 is for an enzyme that catalyses the reaction EH + IL ⇌ EL + IH. This reaction is a more efficient way of using the EH gradient to drive IH production and uses the small amout on IL in the cell. The improved fitness of a cell in generation 14 is likely to be due to a mutation that creates a G/IL antiporter, which allows the cell to take up IL using the G gradient.
Having recorded the genomes for 128 organisms per generation for 1920 generations (nearly a quarter of a million genomes), I thought it might be interesting to look for some patterns in how genes emerge and spread through a population.
I first considered ways to visualise a population of cells from a single generation. I thought the final generation would be most interesting since its cells are most evolved. In the image below, each circle represents a cell from the final generation. Cells are ordered by fitness (left to right, top to bottom). Cell fitness is also represented by the area of each circle. I attempted to show the similarity of cell proteomes by the circle colour. To this end, I calculated the distance between every pair of cells by comparing the number of each type of protein they had. I coloured the two most different cells pure blue and pure green, then gave each other cell a greenness and blueness based on their relative distance from these two extreme cells.
The problem with this graph is that I'm not really sure what it shows. By the 1920th generation, the population is probably quite homogeneous and unsurprisingly the most different organisms are those that are least fit. I suspect they have large chunks of their genomes duplicated or have odd genes that don't do very much, hence the difference. I quite like the colours though.
Since the fitness of cells increases most in the first 200 generations then plateaus, I decided it might be more interesting to look at the early generations when novel, useful mutations start to appear. I also decided to restrict my analysis to the fittest cells, since the least fit cells generally have fatal or useless mutations.
The graph below shows the fittest 16 cells of the first 16 generations. Again, cell area represents relative fitness and colour represents how similar cells are. In the first generation there is one cell that is significantly more successful than the others, which are so small as to be almost invisible. This mutation spreads through the population, so by generation 7 all the cells in the top 16 are at least as fit. The next big evolutionary breakthrough is at generation 14, which then begins to spread. In this graph the colours do seem to highlight different species of cell reasonably well (e.g. the dark blue species that takes hold in generation 13, but loses its top spot the very next generation).
I was pretty pleased with how the last graph turned out, but still wasn't happy with the colours. Two most extreme cells seemed like a somewhat arbitrary basis for colouring all the cells, especially as the reason they were so different was because they had weird mutations which weren't representative of the population as a whole. The range of colours was also limited since a cell couldn't be both no different from both extreme cells, so could never have maximum blue and green values.
Over two years later, I again considered the problem of converting a genome with a variable number of different genes into two values (degrees of blue and green), in such a way that the similarity of the colours was related to the similarity of the genomes, and realised that principle component analysis (PCA) could help. PCA allows you to convert a set of vectors into a set of vectors with smaller dimensions. If I could represent an organism's genome by a vector (list of numbers), then I could reduce that vector to two dimensions and use those to colour the cell.
For this analysis, I used the fittest 64 organisms (half the population) from the first 200 generations, so had 12,800 genomes to consider. I converted the genomes into vectors by first going through all these genomes and get a list of all the different proteins functions encoded and how often they occurred. I used protein functions rather than genes, so that genes that had mutation with no effect on function, were ignored. I found 2286 different protein functions encoded in the genome set, which is quite surprising, given now limited the chemistry is. I filtered these down to a more manageable 329 by removing any gene that occurred with a frequency of less than 10. The most common function was the F-driven EHase, which occurred 40,835 times, so more than 3 times per cell.
Each genome was converted into a vector of 329 dimensions by counting the number of copies of each of the 329 most frequent proteins it contained. This resulted in a 12,800 x 329 matrix of organisms, which was then subjected to PCA, which took a few minutes. The two principle components were then extracted and each organism converted into a vector of two dimensions, and then a colour made of blues and greens. The result is shown below with each organism filling a 2x2 pixel.
I think this image quite nicely shows sucessful mutations appearing at the top of each generation, and then flowing through the population. The colours seem to work quite well and there is a clear, gradual change over time. I would like to see what it looked like with a population of two species, but I'm not sure I can model enough organisms to make that stable.