~$

A Formal Introduction to the Amino Acids

A Formal Introduction to the Amino Acids

When you Google “amino acids,” one of the first images you’ll see is this colorful diagram, showing the skeletal formulas of twenty molecules grouped by the chemical character of their side chains. Throughout many biochemistry courses, I have found myself memorizing this exact chart.

The chart of the 20 standard amino acids I have memorized too many times. Source: Technology Networks

I’ve often glanced these over and moved on to higher-level concepts like enzyme mechanics or structure analysis, but this time I wanted to appreciate the amino acids a bit more and use this as an excuse to try the blog feature of this website.

A brief aside on the amino acid synthetases

Yeast aspartyl-tRNA synthetase (blue) bound to tRNA (orange/red). The synthetase recognizes both the tRNA and the amino acid, ensuring the correct pairing. Each organism inherits 20 of these enzymes, and that set determines which amino acids get loaded onto which tRNAs and therefore mapped to codons. If you change the synthetases, you change the genetic code itself. PDB 1ASY

There are over 800 naturally occurring amino acids humans have found in nature, but only 20 are directly encoded by the genetic code and incorporated into proteins during translation. These are the proteinogenic amino acids. What distinguishes them from the other 780+ is that each one is selected for by one of the 20 unique aminoacyl-tRNA synthetases (aaRS), each of which loads a specific amino acid onto the corresponding transfer RNAs (tRNAs), specified by their codons.

In my experience, the synthetases are often overlooked in introductory courses, but are arguably among the most important enzymes in molecular biology. They are among the oldest enzymes in cells, predating the divergence of the domains of life. The 20 synthetases fall into two structurally unrelated classes (Class I and Class II), each with 10 members, that likely evolved independently. Despite having completely different protein folds, both classes solve the same problem: recognizing one amino acid out of twenty and attaching it to the right tRNA with an overall translation error rate below 1 in 10,000. Consider that some amino acids are chemically near-identical: valine and isoleucine differ by the placement of a single methyl group, yet the synthetases reliably distinguish them. Several synthetases have a dedicated proofreading domain, a second active site that hydrolyzes incorrectly charged tRNAs before the mischarged tRNA can reach the ribosome. The accuracy of the synthetases is so exceptional that they are distinguished with the term “superspecificity”. Not all synthetases have a dedicated editing domain; those that lack one compensate through highly specific binding and activation of their cognate amino acid. Interestingly, the accuracy of these attachments also depends on stoichiometry: cells must maintain a precise ratio of each synthetase to its cognate tRNAs, because overproduction of a synthetase leads to increased misacylation (the wrong amino acid is attached to the wrong tRNA).

61 sense codons (out of $4^3=64$ total, with 3 stop codons) map to 20 amino acids, mediated by roughly 45 distinct tRNA species and exactly 20 synthetases, one for each amino acid (plus a 21st, selenocysteine, which we’ll get to in the next section).

The $21^*$ Proteinogenic Amino Acids

Relative abundance of the 20 standard amino acids in proteins, grouped by side chain property. Frequencies from UniProtKB/Swiss-Prot release 2026_01 (574,627 entries, 208M amino acids).

Hydrophobic Side Chains

The largest group contains eight amino acids whose side chains are made mostly or entirely of carbon and hydrogen, making them nonpolar and hydrophobic. In a folded protein, these residues tend to cluster in the interior, away from water, forming the hydrophobic core that drives protein folding.

Alanine

Alanine structure
Ala | A | MW: 89.09 Da
Charge Neutral
Heavy R atoms 1
Frequency ~8.3%
Alanine was one of the first amino acids detected in meteorites. The simplest chiral amino acid, alanine's side chain is just a methyl group, making it a small, inert building block that fits almost anywhere in a protein. Its simplicity makes it the residue of choice in alanine scanning mutagenesis, where residues are systematically replaced with alanine to identify which side chains are functionally important.*

Valine

Valine structure
Val | V | MW: 117.15 Da
Charge Neutral
Heavy R atoms 3
Frequency ~6.9%
Valine is one of three branched-chain amino acids (BCAAs) metabolized in muscle rather than the liver. Its bulky, forked side chain makes it a common resident of hydrophobic cores. The single-nucleotide mutation that replaces glutamate with valine at position 6 of the hemoglobin β-chain causes sickle cell disease, one of the most well-known examples of how a single amino acid substitution can have dramatic consequences.

Isoleucine

Isoleucine structure
Ile | I | MW: 131.17 Da
Charge Neutral
Heavy R atoms 4
Frequency ~5.9%
Isoleucine is one of only two amino acids (with threonine) that have two chiral centers. An isomer of leucine with the same molecular formula but a different branching pattern, only the (2S,3S) form is incorporated into proteins. Distinguishing isoleucine from valine is one of the classic challenges for the aminoacyl-tRNA synthetases, since the two differ by just a single methyl group.

Leucine

Leucine structure
Leu | L | MW: 131.17 Da
Charge Neutral
Heavy R atoms 4
Frequency ~9.6%
Leucine is the most abundant amino acid in proteins and the strongest activator of mTOR, the master regulator of cell growth and protein synthesis. This dual role as both building block and growth signal makes it a key amino acid in nutrition and muscle biology.

Methionine

Methionine structure
Met | M | MW: 149.21 Da
Charge Neutral
Heavy R atoms 4
Frequency ~2.4%
AUG (Met) is the universal start codon, so nearly every protein begins its life as a methionine (though most organisms later cleave it off). Paradoxically, methionine is the rarest of the hydrophobic amino acids at just ~2.4% frequency, partly because the sulfur in its side chain makes it metabolically expensive to produce. Methionine is also the precursor to S-adenosylmethionine (SAM), the cell's universal methyl donor.

Phenylalanine

Phenylalanine structure
Phe | F | MW: 165.19 Da
Charge Neutral
Heavy R atoms 7
Frequency ~3.9%
Phenylalanine is the reason diet soda cans say "contains phenylalanine": the artificial sweetener aspartame is a dipeptide of aspartate and phenylalanine, which is dangerous for people with phenylketonuria (PKU), a genetic disorder in which phenylalanine cannot be properly metabolized and accumulates to toxic levels.

Tyrosine

Tyrosine structure
Tyr | Y | MW: 181.19 Da
Charge Neutral
Heavy R atoms 8
Frequency ~2.9%
Named from the Greek tyros (cheese), tyrosine is the precursor to dopamine, adrenaline, and thyroid hormones. Structurally it is phenylalanine with a hydroxyl group on the ring, placing it at the boundary between hydrophobic and polar. That hydroxyl is a key target for phosphorylation by tyrosine kinases, and dysregulated tyrosine kinase signaling is implicated in many cancers, making tyrosine kinase inhibitors (like imatinib) among the most successful targeted cancer therapies.

Tryptophan

Tryptophan structure
Trp | W | MW: 204.23 Da
Charge Neutral
Heavy R atoms 10
Frequency ~1.1%
The rarest standard amino acid and the biosynthetic precursor to serotonin and melatonin. Tryptophan's indole ring system absorbs UV light at 280 nm, which is why protein concentration is routinely measured by UV absorbance at that wavelength. Despite popular belief, turkey does not contain unusually high levels of tryptophan; post-Thanksgiving drowsiness is more likely from overeating carbohydrates, which increase tryptophan transport across the blood-brain barrier.

Electrically Charged Side Chains

Five amino acids carry a net charge at physiological pH. Three are positively charged (arginine, histidine, lysine) and two are negatively charged (aspartate, glutamate). These residues are almost always found on protein surfaces, where they interact with water, form salt bridges with oppositely charged residues, and participate in catalysis.

Arginine

Arginine structure
Arg | R | MW: 174.20 Da
Charge +1
Heavy R atoms 7
Frequency ~5.5%
Arginine's guanidinium group (pKa ~12.5) is almost always protonated and can form up to five hydrogen bonds simultaneously, making it the amino acid most frequently found interacting with phosphate groups in DNA-binding proteins. Arginine is also the precursor to nitric oxide (NO), whose discovery as a signaling molecule won the 1998 Nobel Prize.

Histidine

Histidine structure
His | H | MW: 155.16 Da
Charge ~0 (pKa ~6.0)
Heavy R atoms 5
Frequency ~2.3%
Histidine's imidazole side chain has a pKa (~6.0) near physiological pH, making it the only amino acid that can readily toggle between protonated and deprotonated states under biological conditions. This is why histidine appears in more enzyme active sites than any other residue relative to its abundance, acting as both a proton donor and acceptor in catalysis.

Lysine

Lysine structure
Lys | K | MW: 146.19 Da
Charge +1
Heavy R atoms 5
Frequency ~5.8%
Lysine's long, flexible side chain ending in an ε-amino group (pKa ~10.5) makes it the primary target for ubiquitination (the tag that marks proteins for degradation) and for histone acetylation/methylation, which regulate gene expression. The versatility of lysine's post-translational modifications makes it arguably the most heavily regulated residue in epigenetics.

Aspartate

Aspartate structure
Asp | D | MW: 133.10 Da
Charge -1
Heavy R atoms 3
Frequency ~5.5%
Aspartate racemization (slow conversion from L to D form) accumulates over a human lifetime and is used as a molecular clock for forensic age estimation from teeth and eye lens proteins, which do not turn over. Aspartate is also the shorter of the two negatively charged amino acids, making it a common ligand for metal ions in enzyme active sites.

Glutamate

Glutamate structure
Glu | E | MW: 147.13 Da
Charge -1
Heavy R atoms 4
Frequency ~6.7%
Glutamate is the most abundant excitatory neurotransmitter in the brain and the source of "umami," the fifth basic taste, discovered by Kikunae Ikeda in 1908 when he isolated monosodium glutamate (MSG) from kelp broth. In proteins, glutamate's extra methylene group compared to aspartate gives it more conformational flexibility for forming salt bridges.

Polar Uncharged Side Chains

These four amino acids have side chains that can form hydrogen bonds with water and other polar groups, but carry no net charge at physiological pH. This makes them common on protein surfaces and at active sites, where hydrogen bonding is critical.

Serine

Serine structure
Ser | S | MW: 105.09 Da
Charge Neutral
Heavy R atoms 2
Frequency ~6.7%
Serine is the most commonly phosphorylated amino acid in eukaryotic cells: roughly 86% of all protein phosphorylation events occur on serine residues (vs. ~12% threonine, ~2% tyrosine). Its small hydroxyl group also makes it a key nucleophile in the active sites of serine proteases, one of the largest enzyme families.

Threonine

Threonine structure
Thr | T | MW: 119.12 Da
Charge Neutral
Heavy R atoms 3
Frequency ~5.4%
Threonine was the last of the 20 standard amino acids to be discovered (by William Rose in 1935), and it was named after threose, the four-carbon sugar it resembles. Like isoleucine, threonine has two chiral centers. Its discovery led Rose to define the concept of essential amino acids.

Asparagine

Asparagine structure
Asn | N | MW: 132.12 Da
Charge Neutral
Heavy R atoms 4
Frequency ~4.1%
Asparagine was the very first amino acid to be isolated (from asparagus juice in 1806 by Vauquelin and Robiquet). Asparagine is the most common site for N-linked glycosylation, one of the most important post-translational modifications, where sugar chains are attached to the protein surface.

Glutamine

Glutamine structure
Gln | Q | MW: 146.15 Da
Charge Neutral
Heavy R atoms 5
Frequency ~3.9%
Glutamine is the most abundant free amino acid in human blood plasma (~500-900 μM), serving as a nitrogen shuttle between organs. Rapidly dividing cells, including immune cells and cancer cells, consume glutamine in enormous quantities, a phenomenon called "glutamine addiction" that is now a target for cancer therapy.

Special Cases

These three amino acids have unusual structural properties that set them apart from the other groups.

Cysteine

Cysteine structure
Cys | C | MW: 121.16 Da
Charge Neutral
Heavy R atoms 2
Frequency ~1.4%
Two cysteines can form a disulfide bond (cystine), which acts like a molecular staple holding protein structures together. This is also the chemistry behind hair perms: breaking and reforming disulfide bonds in keratin reshapes the hair. Cysteine's thiol group (pKa ~8.3) also makes it a key catalytic nucleophile in many enzymes.

Glycine

Glycine structure
Gly | G | MW: 75.03 Da
Charge Neutral
Heavy R atoms 0
Frequency ~7.1%
Glycine is the only achiral amino acid (no stereocenters) and the smallest. Because it lacks a side chain, glycine is uniquely flexible and dominates the tight turns in collagen's triple helix, where every third residue must be glycine to fit inside the helix (the Gly-X-Y repeat). Glycine was also among the amino acids found in meteorites and detected in interstellar space.

Proline

Proline structure
Pro | P | MW: 115.13 Da
Charge Neutral
Heavy R atoms 3
Frequency ~4.7%
Proline is the only standard amino acid with a secondary amine: its side chain cyclizes back onto the backbone nitrogen, locking the backbone into a rigid conformation. This rigidity makes proline a "helix breaker" and allows it to uniquely adopt a cis peptide bond (~5% of the time vs. less than 0.1% for other residues). The cis-trans isomerization is so slow that dedicated enzymes (prolyl isomerases) exist to catalyze it, sometimes being the rate-limiting step in protein folding.

The 21st Proteinogenic Amino Acid

Selenocysteine (Sec, U) is a structural analog of cysteine with a selenium atom in place of sulfur. It is found across all three domains of life but is not universal (fungi and higher plants have lost it, for example). What makes it remarkable is how it ends up in proteins. Every other proteinogenic amino acid is specified by one or more sense codons, delivered by a dedicated tRNA, and charged by its own aminoacyl-tRNA synthetase. Selenocysteine has none of these things.

Selenocysteine has neither a codon of its own nor a dedicated aminoacyl-tRNA synthetase. It is encoded by UGA, normally one of three stop codons, and the specialized tRNA^Sec that carries it is first charged with serine by seryl-tRNA synthetase, then converted to selenocysteine directly on the tRNA by selenocysteine synthase. And the ribosome only recodes UGA from “stop” to “selenocysteine” when the mRNA contains a downstream stem-loop called a SECIS element (selenocysteine insertion sequence), which is bound by a specialized elongation factor (SelB in bacteria, EFSec in eukaryotes) that recruits the charged tRNA^Sec to the UGA codon.

Selenocysteine

Selenocysteine structure
Sec | U | MW: 168.05 Da
Charge Neutral
Heavy R atoms 2
Frequency < 0.01%
Selenocysteine's selenol group is a much stronger nucleophile than cysteine's thiol (pKa ~5.2 vs. ~8.3), so at physiological pH it is already deprotonated and reactive. It sits in the active sites of roughly 25 human selenoproteins, including glutathione peroxidases that protect cells from oxidative damage and thyroid hormone deiodinases that regulate thyroid function.

Substitution Matrices

Proteins evolve, and when we compare homologous proteins from different species we find that some amino acids substitute for each other constantly while others almost never do. The pattern is not random; it reflects which substitutions the protein can tolerate, which in turn reflects the physicochemical similarity of the amino acids involved.

The standard way to quantify this is a substitution matrix: a 20×20 table where each entry is a score for how likely it is that one amino acid replaces another in evolution. Positive scores mean “more common than chance” (the substitution is tolerated), and negative scores mean “less common than chance” (the substitution is avoided). The most widely used is BLOSUM62, derived in 1992 by Steven and Jorja Henikoff from conserved blocks of homologous protein sequences. The “62” refers to the clustering threshold: before counting substitutions, sequence pairs with ≥62% identity were clustered together so that closely-related sequences wouldn’t dominate the statistics. BLOSUM62 is the default matrix in BLAST and in most protein alignment tools.

BLOSUM scores aren’t arbitrary numbers; they’re log-odds ratios. The score for substituting amino acid $i$ with amino acid $j$ is:

\[S_{ij} = \frac{1}{\lambda} \log_2 \frac{p_{ij}}{q_i q_j}\]

where $p_{ij}$ is the observed probability that $i$ and $j$ appear aligned in conserved blocks of homologous proteins, $q_i$ and $q_j$ are the background frequencies of each amino acid in the dataset, and $\lambda$ is a scaling factor chosen to make the final scores convenient integers.

The ratio $p_{ij} / (q_i q_j)$ compares the observed substitution frequency to what you’d expect if the two amino acids paired up purely by chance. A positive $S_{ij}$ means the substitution is more common than chance (the pair co-occurs in conserved positions more often than random pairing would predict), a negative $S_{ij}$ means it’s rarer than chance (evolution avoids it), and a score of zero means observed matches expected. Taking the log turns this ratio into an additive score, so that when you score an alignment of two sequences, you can simply sum the per-position scores to get a total log-odds score for the alignment.

BLOSUM62 substitution scores for all 20 standard amino acids, grouped by side chain property and sorted within each group by number of side chain heavy atoms (shown in parentheses). Red = favored substitution (positive score); blue = avoided substitution (negative score). The diagonal is masked in black because self-substitution scores are always the largest.

A few patterns jump out. Within each property group, most substitutions score near zero or positive, especially between amino acids of similar size: the branched-chain hydrophobic residues (Val, Ile, Leu) interchange readily, as do Asp and Glu (the two negatively charged residues) and Lys and Arg (two of the three positively charged ones). The hydrophobic aromatic residues (Phe, Tyr, Trp) also cluster together. Between groups, the scores turn negative: substituting a small hydrophobic residue for a charged one almost never happens in conserved positions, because the physicochemical mismatch is too large.

The most striking row in the matrix belongs to tryptophan. Trp is the largest and rarest amino acid, and its substitution scores are among the most negative in the matrix. When a tryptophan appears in a conserved position in a protein, it’s almost always doing something specific; evolution rarely allows it to be swapped out.

The Genetic Code

Substitution matrices tell us which amino acids are exchangeable in protein space, but there’s a second layer we haven’t touched: the DNA itself. Every substitution that makes it into a protein has to survive translation, which means it has to be reachable by a small number of point mutations in the underlying mRNA. The mapping from nucleotide triplets to amino acids, the genetic code, is what determines which substitutions are “close” in mutational space and which are far away.

The standard RNA codon wheel. Read from the center outward: the first base (center), then the second base, then the third base, to find the encoded amino acid on the outer ring. Public domain via Wikimedia Commons.

Three things stand out from the standard code. First, it is redundant: 61 sense codons map to 20 amino acids, so most amino acids are encoded by multiple codons (leucine, serine, and arginine each get six; only methionine and tryptophan get exactly one). Second, the redundancy is concentrated at the third position of the codon; changing the third base often leaves the amino acid unchanged. This is the wobble position, and it acts as a buffer against point mutations in DNA: a random mutation at the third position is the mutation most likely to be silent. Third, even when a point mutation does change the amino acid, the code is structured so that the replacement is usually chemically similar. Mutations at the first position tend to swap hydrophobic residues for other hydrophobic residues; mutations at the second position are the most likely to cause a radical change in physicochemical properties, but even then the code is biased toward minimizing the damage.

We can quantify this directly. For each pair of amino acids, we can count how many ways a single-nucleotide substitution in any of their codons converts one into the other. This gives us a 20×20 connectivity matrix that reflects which amino acids are “mutational neighbors.”

Single-nucleotide substitution paths between amino acids. Each cell counts the number of ways a single point mutation in any codon for one amino acid can produce a codon for the other. Amino acids are grouped by property (same colors as before) so that the block-diagonal structure jumps out: most single-nucleotide paths stay within a property group.

The connectivity matrix has a clear block-diagonal structure. Most single-nucleotide paths stay within a property group, which means the genetic code has been sculpted (by natural selection, or by the ancient history of which tRNAs matched which codons) to minimize the physicochemical impact of a random mutation. This “error-minimizing” property is sometimes called the genetic code’s robustness, and it is widely considered one of the strongest pieces of evidence that the code is not arbitrary; it has been optimized.

The alignment between substitution matrices and mutational proximity is not accidental. The substitution matrices we saw in the previous section are measured from real protein evolution, which is the combined outcome of (a) which mutations occur at the DNA level and (b) which mutations survive selection at the protein level. The genetic code’s error-minimizing layout means that (a) already pre-filters for physicochemical similarity, and selection imposes the remaining constraint.


* Why alanine and not glycine, the simplest amino acid? Glycine's lack of any side chain gives it unusual backbone flexibility, so substituting glycine would change the protein's conformational dynamics, not just remove the side chain's chemistry. Alanine's methyl group constrains the backbone like a normal amino acid while contributing almost nothing chemically.

Click or hold to give kudos!
Found an issue or want to contribute? GitHub issues and PRs are welcome!