Paleontology Meets Computing
Articles,  Blog

Paleontology Meets Computing

Hello. After seven year old Philip
saw the Jurassic Park movie, he had a dream of seeing
a real dinosaur one day. The world’s most famous paleontologist,
Jack Horner, is now 70 years old, but
he has the same dream. But how can we recreate the dinosaur
if we don’t know its genome? Jack Horner was shy and
introverted when he was growing up. He progressed so slowly in reading and mathematics that other
kids called him dumb. However, his high school project on
dinosaurs won the science fair and he was admitted to
the University of Montana. However, after failing five
consecutive quarters, he dropped out. Fortunately for Horner,
he eventually found his calling. After working as a truck driver, he accepted a job as
a technician at Princeton, where he quickly established
a reputation as a brilliant researcher. He would go on to become
an advisor to Steven Spielberg for the Jurassic Park movie. By that time, Horner had learned
that he suffers from dyslexia, a disorder that is characterized
by difficulty with mathematics. He was able to succeed despite
dyslexia because paleontologists hardly ever use mathematics. However, Horner’s own
student would show that even paleontology is not immune from computing. 15 years ago, Horner was exploring
his favorite Dinosaur graveyard and discovered a 68 million
years old T-Rex fossil. He gave a chunk of this fossil
to his student, Mary Schweitzer, who de-mineralized it and
sent it to mass spectrometrist John Asara. In 2007, after analyzing thousands of
spectra, Asara, Horner, and Schweitzer published a paper in science announcing
the discovery of T-Rex peptides. Amazingly, these T-Rex peptides were
nearly identical to chicken peptides. Thus supporting the controversial
hypothesis that birds evolved from dinosaurs. Horner even published a book
called “How to Build a Dinosaur,” to explain how to genetically modify
a chicken to re-create a dinosaur. Yet, some scientists remain skeptical. While previous dinosaur studies
did not require much computation, T-Rex analysis was powered by
a complex and error prone algorithm. But how can we know which side is correct? Today, we will investigate the T-Rex
peptide by developing a protein, rather than DNA, sequencing algorithm. We have already talked
about Frederick Sanger and his invention of DNA sequencing
technology four decades ago. Yet, Sanger had already won his first
Nobel prize six decades ago for determining the amino
acid sequence of insulin. Similar to how scientists sequence genomes,
Sanger broke multiple molecules of insulin into short peptides,
sequenced those peptides, and then assembled them into
the amino acid sequence of insulin. Also, protein sequencing was difficult in
the 1950s, but DNA sequencing was impossible. Today, DNA sequencing is trivial, but
protein sequencing remains difficult. That is why most proteins are discovered
by first sequencing a genome and then predicting all of the genes
that this genome encodes. By translating the nucleotide sequence
of each protein coding gene into an amino acid sequence, biologists
derive a putative proteome of a species. However, different cells in an organism
express different proteins. For example, brain cells express
proteins giving rise to neuropeptides. But kidney cells do not. An important problem
in the study of proteins, or proteomics, is to identify which specific proteins are present in each biological tissue,
and how they interact. Today, instead of using Sanger’s
old protein sequencing approach, biologists use mass spectrometers: Expensive and very accurate
molecular scales. But modern mass spectrometers
cannot read individual amino acids. Instead, they generate a cryptic
fingerprint of each peptide, called a “mass spectrum”. Our goal today is to
decode these fingerprints. To analyze proteins, we need to
start breaking them into pieces and measuring masses of
the resulting fragments, using mass spectrometers, of course. Let’s recall that different amino
acids have different masses. For example, the mass of glycine is 57,
but the mass of alanine is 71. A mass spectrometer generally breaks
each protein molecule into two parts, that we call “suffix” and “prefix” fragments, and measures their masses. It’s important to realize that, when
biologists analyze samples, there are millions of different molecules of
the same peptides in the sample, and each of these peptides may break
individually over individual bonds. Mass spectrometers measure the
masses of all these fragments. This simple scenario is a little
bit more complicated in practice because, in reality,
most mass spectrometers can only measure masses of relatively short fragments,
maybe 30-40 amino acid long. To bypass this limitation,
the biologists usually use proteases, such as trypsin, to break proteins
into smaller pieces called “peptides”. Afterwards, a mass spectrometer
breaks this peptide into even smaller charged fragment ions and measures the mass-to-charge ratio and
intensity of each ion. Intensity is a proxy for
the number of fragments in the sample observed in the experiment. Please note that, for
simplicity in this chapter, we will assume that all masses
are integers and all charges are one. Our goal is to reconstruct the peptide
from this rather complex fingerprint. Here is one of T-Rex’s
spectra published in 2007. Try to figure out what peptide
generated this spectrum, and you will learn something
about T-Rex proteins.

Leave a Reply

Your email address will not be published. Required fields are marked *