James Lee0:04
I see some people here. If I had known they were going to be here, I would have told you to have them give this talk. Anyway, I'll be talking about some basics of genetics and genetic prediction.
The human genome is an object that can be viewed from various perspectives, ranging from the molecular to that of information theory. Physically speaking, the human genome is split across 23 chromosomes, as you can see they're numbered in order of decreasing length.
Now let's zoom in on just one of these chromosomes, chromosome six. For most of our purposes, we can take a very barren, stylized view of the genome, neglect all these molecules and so on that make up its constituents, and view the genome as a string of symbols, discrete slots where only one of a few possibilities can appear.
So it turns out that this is actually a real stretch of chromosome six. There are some positions in the genome that we're especially concerned with because they're what we call polymorphic, which means that the gene content that can fill up that slot can vary across individuals.
So this stretch of chromosome six, the fifth slot is one of these polymorphic sites. We call them SNPs, it rhymes with hips. You can see that there are two possible kinds of genes that can appear at the site: C and A. You can't see the text on the bottom of the slide, but this particular individual has the genotype CA at this particular SNP, but other individuals might be AA or CC. That's where we get this term polymorphic from.
Now you can see that the genome in some sense is inherited by each individual in duplicate. One half of your genetic material comes from one of your parents, the other half comes from the other parent. If the genome were passed on intact to all of your offspring, then it would double in size in each individual's generation, which doesn't happen.
The question naturally arises from the genome's duplicate status: when it comes time for an organism to become a parent, which of the two genes present at any given site is transmitted to the offspring? The answer to this question was given by Mendel back in the 19th century. Mendel's first law states that upon becoming a parent, an organism passes on a randomly selected member of each gene pair to its offspring.
This point arises where an organism becomes a parent in its own turn, and the question is which of its two genes present at any one of its sites is the one that goes on into the offspring. The answer given by Mendel is that it's basically decided by essentially a coin flip; each one is equally likely, and you cannot predict in advance which one it will be. In this example, it was the C allele that was transmitted, but it equally well could have been A. That's Laplace's doctrine of all possibilities equally likely, and that's Mendel's first law.
Now there's a question of what exactly is the unit that is randomly selected within each pair. It turns out that it's not quite the entire chromosome, because in this example, what is passed on to this individual's offspring is in some sense a hybrid of the chromosome sixes of the two grandparents in this pedigree. We see that from sites one through five, this DNA was contributed by one grandparent; sites six through nine were contributed by the other grandparent. The technical term is that there was a recombination event between the fifth and sixth positions on this particular chromosome in the transmission of the DNA from this parent to its offspring.
Here are some important consequences of Mendel's laws for causal inference in genetic studies. Importantly, any genetic differences between parents and their offspring, or between different offspring of the same parents, are randomly determined. This means that the different genotypes within a family are equivalent to levels of a treatment in a randomized experiment. In a particular kind of study, GWAS, for example, if it's a within-family design, if there is a correlation between a SNP and a phenotype, that SNP or one very close to it must have a true causal effect on the phenotype.
Ronald Fisher, who actually introduced randomization in experimental design, was inspired even in some of the terminology, such as factorial, from his earlier work on genetics. In a later paper in the 1950s, when he was looking back on this, he pointed out that geneticists have been blessed by Providence in this field against confounding factors. A more perfect control of conditions is scarcely possible than that of different genotypes appearing in the same litter.
There is another law that Mendel gave: genes at distinct positions are transmitted independently of each other. This basically says that if you have two different SNPs, what a parent transmits at one has no bearing on what the parent transmits at the other. Now it turns out that this is one of those scientific laws with exceptions. In fact, Mendel was very lucky he didn't run into any of these exceptions. Some people think he did some exploratory data analysis, but anyway, this law is true for SNPs that are far apart, for example on different chromosomes. However, for correlated SNPs that are close together, that coupling tends to persist across generations because being close together, it's not very likely that recombination events will randomize what allele is paired with what in these particular chromosomes.
That's why earlier I said that when one SNP shows a significant correlation with a phenotype, we can't actually be sure that that is the causal SNP; it might be one that is very close by and highly correlated with it. In these so-called Manhattan plots, this is just positions along the genome on the x-axis, and the y-axis is the strength of association with a particular phenotype. This happens to be schizophrenia. You see that whenever there's a strong signal, there's actually a cluster of highly correlated signals next to it. For example, in chromosome six, you have hundreds of these signals rising up. That means there's a block of SNPs in there that are highly correlated with each other, and they all show similar correlations with the phenotype you're studying.
GWAS is pretty simple. It consists of genotyping or imputing, which means taking advantage of this correlational structure to impute genotypes that you did not directly assay with your genotyping chip, many individuals, often in the millions now, at many SNPs, again in the millions, and testing the SNPs for a correlation with some phenotype. In practice, for various logistical reasons, we do this one by one, just regressing the phenotype on each of the SNPs in turn.
If you have what we call a hit, which means a SNP that clears a certain threshold of statistical significance, if you graph the conditional means, this is sort of what it looks like. We designate one allele as the increasing allele or plus allele. With each additional copy of it you have in your genotype, going from zero to two, if the phenotypic conditional average increases by enough to clear the threshold of statistical significance, we call that a hit. Notice the scale on the y-axis is somewhat realistic. Usually in GWAS, we have very small effect sizes.
The answer: what if the three means are clearly different from each other but they're not on a line? There are methods to test for that. You could add an extra degree of freedom to your statistical test of association. You could address that possibility. That's all I'll say about that now.
So it turns out that in GWAS, sometimes we call it a genome-wide study, but there are some SNPs that are typically not included in GWAS. At a given SNP, the two alleles have what we call allele frequencies, which is just the number of genes in the population at this particular SNP that fall into a particular allelic type divided by the total number of genes in the population. If your population is 100 people, and you count at this SNP there are alleles G and C, and in 100 people there are 200 genes because each person has two genes, one from each parent, and you count 120 G and 80 C, we say the frequency of G is 0.6 and the frequency of C is 0.4.
In GWAS, we mostly focus on what we call common SNPs, that is, SNPs where one or the other allele has a frequency of at least 0.01. There are various reasons for that, including statistical power. When a SNP is not a common SNP, we call it a rare SNP, and the statistical power to detect an association can go down precipitously.
I talked earlier about Mendel's first law and causal inference. In fact, most GWAS do not employ families. Logistically, it is more difficult to sample family units, trios or sibling pairs. Most GWAS instead employ unrelated individuals. However, often we can test or estimate these aggregate quantities. Here we have these ratios of within-family to normal polygenic score standard polygenic scores, which roughly can be thought of as the average ratio over all SNPs of the unconfounded within-family estimate to the possibly confounded estimate derived from unrelated individuals. You can see that years of education is a trait with this ratio being somewhat low, which implies that there is some kind of confounding in the population GWAS. Interestingly, we can estimate these quantities with a pretty small standard error.
Let me talk about some of the mechanisms through which genetic variation might bring about phenotypic variation. Several regions of the genome serve as what we might call templates or recipes for the construction of special molecules called proteins. Nowadays we call these special regions genes, although that terminology is a bit confusing since before Watson and Crick we used to use the word genes to mean any token of genetic material in any region of the genome. A more useful term in this context is cistron, which I've actually never heard anyone say, but it's still in the glossaries of textbooks, so why don't we just keep using it. Cistrons code proteins. Proteins are the basic functional and structural units of your body. They include enzymes that speed up critical biochemical reactions such as degradation of neurotransmitters no longer needed, basic components of cellular machinery such as neurotransmitter receptors.
Most of the cells in your body have their own copy of your genome. You can think of each cell as a factory that uses its copy of the genome as a blueprint for cranking out whatever proteins it needs to do its job. If two people have different genotypes at a SNP that is inside a cistron, they might end up with different variants of the same protein. If you actually string out the protein, you see all its components; the two versions are actually different, and those two versions might have different properties. However, cistrons actually make up less than 10 percent of the human genome. Most SNPs do not fall inside one of them, and so most SNPs almost certainly do not act by qualitatively altering the encoded protein. Most SNPs probably act through some kind of regulatory mechanism, which just means some kind of subtle effect on the abundance or timing of the protein's expression inside the cell.
We only look at common SNPs typically in GWAS. There are about eight million of these in Europeans. If you go all the way down to rare SNPs where the MAF is down to 10 to the negative 3, 10 to the negative 4, 10 to the negative 6, we're talking about more than 100 million SNPs. Common SNPs are basically a subset of all the SNPs we have. This is why when we estimate heritability from DNA-level data using techniques such as GREML, it is not necessarily the case, and empirically it is not the case, that these heritability estimates you get are equal to those we get from other designs using, for example, twins or other kinships. In fact, typically when you use these DNA-based methods to estimate heritability, you get back numbers that are between a third to two-thirds of what you get from twin studies. It's somewhat of an open issue whether we can actually close that gap by just including more SNPs in our studies.
When we do a GWAS, we often output the results in what we call summary statistics. Typically a summary statistic file has way more columns than this, but these are probably the most important ones. You get a unique identifier for the SNP, its minor allele frequency, which allele is the minor allele that you're counting in that column, some kind of statistic that shows how strongly associated the SNP is with the phenotype, and so on. This is often 99 percent of the time what we use to do various downstream analyses, including the construction of polygenic scores, which I suppose is the most important thing for me to define.
A polygenic score: if you have an individual in some data set and that individual has genetic data, you have this equation that says the phenotype is the sum of a genetic part and a residual. The genetic part itself is some overall of SNPs. If you actually have the X's for some individual and you also have the GWAS data, so you convert the Z statistics from a table like the one I just showed you into alpha hats, what you get will be an estimated polygenic score for that individual. Then you can use those polygenic scores for further analysis. That's all I have prepared.
The W variable then, if you have the three SNPs or not. What are the values that X takes? X typically can take on the value zero, one, or two. You designate one of the two alleles at the SNP to be counted. It doesn't matter which one, although usually there's a consistent convention, say the minor allele or the derived allele versus the ancestral allele. As long as there's no ambiguity, you just designate an allele to be counted, and so X basically means how many alleles of the counter type does that individual have in his genotype: zero, one, or two.
Thank you to James. I think, first of all, don't panic. I know we're a little bit behind, but we have 20 minutes at the end of the schedule built in for exactly this, so we're absolutely fine. Everything gets shifted up by 20 minutes. I think though, let's take if there are any clarifying nuts and bolts basic molecular genetic fact questions in the audience. One question: they're using zero, one, and two. Isn't that a good linear approximation given that we know that some alleles are... it's appropriate, it's proportional to the number of this.
I'll give two answers. Even if the true relation between the genotypes and the phenotype is not linear, not additive, there are still theoretically important properties of the least squares approximation. It turns out that the variance in the least squares polygenic scores is this thing that we call the additive genetic variance, and that is important for various purposes in evolutionary biology and agricultural genetics. There are certain family designs that are basically estimating the additive genetic variance. So if you want to know whether your GWAS results are converging or coinciding with other ways of estimating something like the heritability of a trait, it's important to keep in mind that we're doing this linear model.
The second answer I would give is that empirically, when we test the adequacy of just an additive or linear model, it turns out to be remarkably good. For example, we've just done a GWAS where we tested for each SNP whether the three data points in fact lie on a line or not, and in a sample size of close to 3 million, we cannot detect any statistically significant deviations from the three conditional means lying on a straight line. This is for the phenotype years of education, which is kind of strange, but that is the results we have.
There's a really questionable interactions question. Interactions: for the next one could be an offer on Facebook. I would say that we could worry about non-additivity within a single SNP. We could also think about non-additivity between SNPs. Geneticists have jargon for those two cases: they call non-additivity within a SNP dominance, and they call non-additivity between SNPs epistasis. We could statistically test for epistasis. It becomes a problem. We could maybe do pairs of SNPs, but could we do triplets, quadruplets? There's a combinatorial explosion, so it's an interesting thing to think about.
Another thing to say is that if we look at studies of identical twins, it turns out that if we look at correlations between different kinships, siblings, half-siblings, cousins, parents and offspring, identical twins who are genetically identical, if there were strong interactions between different SNPs, what that would predict is that identical twins will be much more similar than all other types of relatives. The basic idea is that through the Mendelian randomizing process that I showed you, even siblings within the same family will be quite genetically different unless they are monozygotic twins, and those genetic differences will break up all these effects that depend on precise configurations of alleles at different SNPs.
Empirically, however, that is not observed. Usually, parents and offspring or ordinary siblings are half as similar as monozygotic twins, and then half-siblings are half as similar again. We just don't observe the pattern where most relative kinship types are not very similar in a particular trait, but then the monozygotic twins are correlated like 0.5 or 0.6 or 0.8 or something like that. This leads us to expect that these non-additive possible interactions between different SNPs do not account for a large portion of the total variance.