What is a "codon" in grammatical evolution?

Question

The term codon is used in the context of grammatical evolution (GE), sometimes, without being explicitly defined. For example, it is used in this paper, which introduces and describes PonyGE 2, a Python library for GE, but it's not clearly defined. So, what is a codon?

nbro · Accepted Answer · 2020-12-29T00:03:06.360

Grammatical evolution

To understand what a codon is, we need to understand what GE is, so let me first provide a brief description of this approach.

Grammatical evolution (GE) is an approach to genetic programming where the genotypes are binary (or integer) arrays, which are mapped to the phenotypes (i.e. the actual solutions, which can be represented as trees, which, in turn, represent programs or functions), using a grammar (for example, expressed in Backus-Naur form). So, the genotypes (i.e. what is mutated, combined, or searched) and the phenotypes (the actual solutions, which are programs) are different in GE, and the genotype needs to be mapped to the phenotype to get the actual solution (or program), but this is not the case in all GP approaches (for example, in tree-based GP, the genotype and the phenotype are the same, i.e. trees, which represent functions).

Codons

In GE, a codon is a subsequence of $m$ bits of the genotype (assuming that genotypes are binary arrays). For example, let's say that we have only two symbols in our grammar, i.e. a (a number) and b (another number). In this case, we only need 2 bits to differentiate the two. If we had 3 symbols, a, b and + (the addition operator), we would need at least a sequence of 2 bits to define each symbol. So, in this case, we could have the following mapping

a is represented by 00 (or the integer 0),
b is represented by 01 (or 1), and
+ is represented by 10 (or 2)

The operation a+b could then be represented by the binary sequence 001001 (or the integer sequence 021). The 2-bit subsequences 00, 01 and 10 (or their integer counterparts) are the codons.

What do we need codons for?

In GE, codons are used to index the specific choice of a production rule. To understand this, let's define a simple grammar, which is composed of

a set of non-terminals (e.g. functions) $N = \{ \langle \text{expr} \rangle, \langle \text{op} \rangle, \langle \text{operand} \rangle, \langle \text{var} \rangle \}$ ,
a set of terminals (e.g. specific numbers or letters) $\mathrm{T}=\{1,2,3,4,+,-, /, *, \mathrm{x}, \mathrm{y}\}$,
a set of production rules $P$, and
an initial production rule $S = \langle \text{expr} \rangle $.

In this case, the set of production rules $P$ is defined as follows

\begin{align} \langle \text{expr} \rangle & ::= \langle \text{expr} \rangle \langle \text{op} \rangle \langle \text{expr} \rangle \; | \; \langle \text{operand} \rangle \\ \langle \text{op} \rangle & ::= + \; | \; - \; | \; * \; | \; / \\ \langle \text{operand} \rangle & ::= 1 \; | \; 2 \; | \; 3 \; | \; 4 \; | \; \langle \text{var} \rangle \\ \langle \text{var} \rangle & ::= \mathrm{x} \; | \; \mathrm{y} \end{align} So, there are four productions rules. To be clear, $\langle \text{var} \rangle ::= \mathrm{x} \; | \; \mathrm{y}$ is a production rule. The symbol $|$ means "or", so the left-hand side of each production is a non-terminal (and note that all non-terminals are denoted with angle brackets $\langle \cdot \rangle$), which is defined as (or can be replaced with) one of the right-hand choices, which can be a non-terminal or terminal. The first choice of each production rule is at index $0$. The second choice at index $1$, and so on. So, for example, in the case of the production $\langle \text{var} \rangle ::= \mathrm{x} \; | \; \mathrm{y}$, $\mathrm{x}$ is the choice at index $0$ and $\mathrm{y}$ is the choice at index $1$.

The codons are the indices that we use to select the production rule's choice while transforming (or mapping) the genotype into a phenotype (an actual program).

So, we start with the first production, in this case, $S = \langle \text{expr} \rangle$. If it's a non-terminal, then we replace it with one of its right-hand side choices. In this case, there are two choices

$\langle \text{expr} \rangle \langle \text{op} \rangle \langle \text{expr} \rangle$ (choice at index $0$)
$\langle \text{operand} \rangle$ (choice at index $1$)

If our genotype (integer representation) is, for example, $01$ (note that this is a sequence of integers), we would replace $\langle \text{expr} \rangle$ with $\langle \text{expr} \rangle \langle \text{op} \rangle \langle \text{expr} \rangle$, then we would replace the first $\langle \text{expr} \rangle$ in $\langle \text{expr} \rangle \langle \text{op} \rangle \langle \text{expr} \rangle$ with $\langle \text{operand} \rangle$, so we would get $\langle \text{operand} \rangle \langle \text{op} \rangle \langle \text{expr} \rangle$, and so on and so forth, until we get an expression that only contains terminals, or the genotype is terminated. There can be other ways of mapping the genotype to the phenotype, but this is the purpose of codons.

Codons in biology

The term codon has its origins in biology: a subsequence of 3 nucleotides is known as a codon, which is mapped to an amino acid in order to produce the proteins. The set of all mappings from codons to amino acids is known as genetic code. Take a look at this article for a gentle introduction to the subject. So, in GE, codons also have a similar role to the role of codons in biology, i.e. they are used to build the actual phenotypes (in biology, the phenotypes would be the proteins or, ultimately, the organism).

Notes

Codons do not have to be 2-bit subsequences, but they can be $m$-bit subsequences, for some arbitrary $m$. The term codon is similar to the term gene, which is also often used to refer to specific subsequences of the genotype (for example, in genetic algorithms), although they may not be synonymous (at least in biology, genes are made up of sequences of codons, so they are not synonymous). Moreover, the binary $m$-bit codons can first be mapped to integers, so codons can also just be integers, as e.g. used here or illustrated in figure 2.2 of this chapter.