What is ancestry?

This post is by Dr. Joe Pickrell (@joe_pickrell)

Anyone who has used commercial genetic testing products like those offered by 23andMe or AncestryDNA will be familiar with the idea of “genetic ancestry”. After mailing in a saliva kit, these companies return a report with seemingly-precise numbers that tell you “what percent of your DNA” (to quote from the 23andMe report) can be traced back to different populations around the world.

At a superficial level, it seems like getting this estimate should be straighforward — look at someone’s genome, apply some fancy statistics, and out pop numbers like “20.7% British and Irish” or “40% Great Britain, 6% Ireland” (these are numbers from my own 23andMe and AncestryDNA reports, respectively. An astute reader might be wondering: wait, shouldn’t those numbers be the same? Hold that thought).

Once you state the problem of “ancestry inference” in more precise terms, however, you very quickly find yourself in the realm of sociology and psychology rather than statistics and genetics. In this series of blog posts, I’m going to discuss the approach to ancestry inference we’ve taken in DNA.Land, and in the process answer some frequently asked questions (as well as some infrequently asked ones) about our estimates.

But in this first post, I want to start at the beginning, with a discussion that’s more broad than Dna.Land: what exactly is the goal of “ancestry inference” anyway?

What is “ancestry”?

A useful question for anyone working on algorithms for learning about ancestry from genetic data is: “How would you describe your ancestry?”. Try to answer the question yourself. Ask your friends. Bug some strangers on the Internet.

If the people you talk to are anything like the people I’ve talked to, the answers will generally break down into two broad categories:

Many people use geographic labels to describe their ancestry, often based on current political borders. E.g. “French” or “Chinese”
Many people use ethnic labels to describe their ancestry. E.g. “Jewish” or “Caucasian”[1].

Let’s take it for granted that the “correct” definition of “ancestry” is something that aligns with these intuitive responses. This suggests that people expect a genetic “ancestry test” to predict the geographic and/or ethnic labels of their ancestors.

Unfortunately, if you sit down and try to write an algorithm to do this, you will immediately come across two huge and mostly intractable problems.

Problem #1: What time depth are we talking about?

Obviously we all have ancestors that lived at different times. You had maybe 8 ancestors living 100 years ago, but many thousands that lived 500 years ago. So whose geographic and/or ethnic labels should we try to guess — those of your ancestors living 100 years ago, or those living 500 years ago? (Or 1,000 years ago? Or…?).

A reasonable first guess is that when people talk about their ancestry they’re generally talking about recent ancestors, such that the “correct” answer to this question is something like 100 years ago. But this isn’t satisfying: in the United States there are many people whose ancestors immigrated to the country hundreds of years ago but who think of their ancestry as (for example) “British” or “Chinese” rather than “Michigander” or “Californian”.

So it’s not totally clear what time depth people generally think of when they think about their ancestry. Indeed, it seems plausible that the “correct” time depth to report in an ancestry test depends on a user’s…ancestry. This should be a hint that this is not something that can be objectively read from DNA.

Problem #2: Ancestry identifiers are influenced by social and political factors

This becomes even clearer when you notice the fundamental problem that some of these labels that we think of as “ancestry” are strongly influenced (and indeed sometimes determined by) social and political factors. Obviously no genetic markers change when someone converts to Judaism, or when the territory where someone lives is annexed by a neighboring country. But these events often have dramatic influences on how the descendants of these individuals think of their ancestry, via cultural transmission of things like languages and traditions.

Indeed, construction of a shared ancestral identity was (and remains) a method for consolidation of political power over diverse cultures (see e.g. Franco in Spain). This is largely invisible to genetics, except after hundreds or thousands of years (if shared identities influence subsequent marriage and/or migration patterns).

A solution

To get around all of these problems, what you would ideally like to have is a detailed list of your ancestors at different time depths, each labeled with their geographic location and any ethnic self-identifiers. You could then say, for example, that 100 years ago 25% of your ancestors lived in Illinois and identified as Jewish, while 500 years ago 5% of your ancestors lived in present-day Andalucia and identified as Muslim [2].

Unfortunately genetic tests are about as useful as Ouija boards for obtaining much of this information, so we’re going to have to compromise with some dramatic approximations [3]. Specifically, the approach taken by all of the commercial companies (and that we take as well) is to try to estimate the general geographic regions where your ancestors lived (and in a select small number of cases their ethnic identifiers) some indeterminate time in that past, probably something like a few hundred years ago.

Does this all sound a bit vague? It should because it is. The precision suggested by these reports is an illusion — there’s plenty of wiggle room in the definition of “general geographic regions” and “some indeterminate time in the past” to allow for very different interpretations [4].

But the key is this: if we replace an impossible goal of perfectly understanding the geography and ethnicity of your ancestors with the more realistic goal of getting a general understanding about some of them, we can now make some progress. This might seem a bit disappointing, in that we’ve abandoned the exactness and objectivity that seem promised by a “genetic test”.

But in many cases even an approximate understanding can be quite meaningful. Millions of people around the world have purchased these tests. Some have uncovered aspects of their family history that were kept secret for fear of discrimination (indeed I’m one of them). Some have discovered hospital mixups that led to puzzling mismatches between their cultural and genetic ancestries. Still others have confronted the genetic legacy of slavery in their own genomes.

So these types of inference, despite their important limitations, are nothing to scoff at. In the next post, I’ll discuss the exact methodologies used for ancestry inference by the commercial companies, and then the software we developed for dna.land.

References:

[1] Though the Caucasus is a geographic region, the word “Caucasian” is used in the United States as an ethnic identifier approximately synonymous with “white”.

[2] You might also be interested in whether you actually inherited any genetic material from each ancestor, but let’s avoid opening that can of worms for now and assume the properties of your geneaological ancestors are the same as those of your genetic ancestors.

[3] It’s plausible that by including things like written records you might be able to do reasonably well for recent ancestors, as is done by Ancestry.com without DNA. But this of course relies on all your ancestors having lived in places where written records are available and reliable.

[4] Note the different ancestry proportions reported to me by 23andMe and AncestryDNA. Most people think of these differences as different algorithmic solutions to the same question, but it’s entirely possible that the algorithms used by the two companies are answering slightly different questions! For example, it may be possible that the 23andMe algorithm is looking at slightly more recent ancestry on average than the AncestryDNA algorithm (I actually think this is indeed the case, for what it’s worth). On this general topic it’s worth reading this great post by Debbie Kennett comparing ancestry composition results across companies.