インド・ヨーロッパ語族の起源に関する新たな洞察(New insights into the origin of the Indo-European languages)


2023-07-27 マックス・プランク研究所

A hybrid hypothesis for the origin and spread of the Indo-European languages. The language family began to diverge from around 8100 years ago, out of a homeland immediately south of the Caucasus. One migration reached the Pontic-Caspian and Forest Steppe around 7000 years ago, and from there subsequent migrations spread into parts of Europe around 5000 years ago.




サンプリングされた祖先を持つ言語樹は、インド・ヨーロッパ語族の起源に関するハイブリッドモデルを支持する Language trees with sampled ancestors support a hybrid model for the origin of Indo-European languages

Paul Heggarty,Cormac Anderson,Matthew Scarborough,Benedict King,Remco Bouckaert,Lechosław Jocz,Martin Joachim Kümmel,Thomas Jügel,Britta Irslinger,Roland Pooth,Henrik Liljegren,Richard F. Strand,Geoffrey Haig,Martin Macák,Ronald I. Kim,Erik Anonby,Tijmen Pronk,Oleg Belyaev,Tonya Kim, Dewey-Findell,Matthew Boutilier,Cassandra Freiberg ,Robert Tegethoff,Matilde Serangeli ,Nikos Liosis Krzysztof Stroński ,Kim Schulte,Ganesh Kumar Gupta,Wolfgang Haak,Johannes Krause,Quentin D. Atkinson,Simon J. Greenhill,Denise Kühnert,Russell D. Gray

Science  Published:28 Jul 2023


Editor’s summary

Languages of the Indo-European family are spoken by almost half of the world’s population, but their origins and patterns of spread are disputed. Heggarty et al. present a database of 109 modern and 52 time-calibrated historical Indo-European languages, which they analyzed with models of Bayesian phylogenetic inference. Their results suggest an emergence of Indo-European languages around 8000 years before present. This is a deeper root date than previously thought, and it fits with an initial origin south of the Caucasus followed by a branch northward into the Steppe region. These findings lead to a “hybrid hypothesis” that reconciles current linguistic and ancient DNA evidence from both the eastern Fertile Crescent (as a primary source) and the steppe (as a secondary homeland). —SNV

Structured Abstract


Almost half the world’s population speaks a language of the Indo-European language family. It remains unclear, however, where this family’s common ancestral language (Proto-Indo-European) was initially spoken and when and why it spread through Eurasia. The “Steppe” hypothesis posits an expansion out of the Pontic-Caspian Steppe, no earlier than 6500 years before present (yr B.P.), and mostly with horse-based pastoralism from ~5000 yr B.P. An alternative “Anatolian” or “farming” hypothesis posits that Indo-European dispersed with agriculture out of parts of the Fertile Crescent, beginning as early as ~9500 to 8500 yr B.P. Ancient DNA (aDNA) is now bringing valuable new perspectives, but these remain only indirect interpretations of language prehistory. In this study, we tested between the time-depth predictions of the Anatolian and Steppe hypotheses, directly from language data. We report a new framework for the chronology and divergence sequence of Indo-European, using Bayesian phylogenetic methods applied to an extensive new dataset of core vocabulary across 161 Indo-European languages.


Previous phylolinguistic analyses have produced conflicting results. We diagnosed and resolved the causes of this discrepancy, two in particular. First, the datasets used had limited language sampling and widespread coding inconsistency. Second, some analyses enforced the assumption that modern spoken languages derive directly from ancient written languages rather than from parallel spoken varieties. Together, these methodological problems distorted branch-length estimates and date inferences. We present a new dataset of cognacy (shared word origins) across Indo-European. This dataset eliminates past inconsistencies and provides a fuller and more balanced language sample, including 52 nonmodern languages for a denser set of time-calibration points. We applied ancestry-enabled Bayesian phylogenetic analysis to test rather than enforce direct ancestry assumptions.


Few ancient written languages are returned as direct ancestors of modern clades. We find a median root age for Indo-European of ~8120 yr B.P. (95% highest posterior density: 6740 to 9610 yr B.P.). Our chronology is robust across a range of alternative phylogenetic models and sensitivity analyses that vary data subsets and other parameters. Indo-European had already diverged rapidly into multiple major branches by ~7000 yr B.P., without a coherent non-Anatolian core. Indo-Iranic has no close relationship with Balto-Slavic, weakening the case for it having spread via the steppe.


Our results are not entirely consistent with either the Steppe hypothesis or the farming hypothesis. Recent aDNA evidence suggests that the Anatolian branch cannot be sourced to the steppe but rather to south of the Caucasus. For other branches, potential candidate expansion(s) out of the Yamnaya culture are detectable in aDNA, but some had only limited genetic impact. Our results reveal that these expansions from ~5000 yr B.P. onward also came too late for the language chronology of Indo-European divergence. They are consistent, however, with an ultimate homeland south of the Caucasus and a subsequent branch northward onto the steppe, as a secondary homeland for some branches of Indo-European entering Europe with the later Corded Ware–associated expansions. Language phylogenetics and aDNA thus combine to suggest that the resolution to the 200-year-old Indo-European enigma lies in a hybrid of the farming and Steppe hypotheses.

A DensiTree showing the probability distribution of tree topologies for the Indo-European language family.

The time axis shows the estimated chronology of the family’s geographical expansion and divergence, calibrated on 52 nonmodern written languages. Annotations add chronological context relative to selected archaeological cultures and expansions of significant ancestry components in the aDNA record. CHG, Caucasus hunter-gatherers; EHG, Eastern (European) hunter-gatherers; BMAC, Bactria-Margiana Archaeological Complex.


The origins of the Indo-European language family are hotly disputed. Bayesian phylogenetic analyses of core vocabulary have produced conflicting results, with some supporting a farming expansion out of Anatolia ~9000 years before present (yr B.P.), while others support a spread with horse-based pastoralism out of the Pontic-Caspian Steppe ~6000 yr B.P. Here we present an extensive database of Indo-European core vocabulary that eliminates past inconsistencies in cognate coding. Ancestry-enabled phylogenetic analysis of this dataset indicates that few ancient languages are direct ancestors of modern clades and produces a root age of ~8120 yr B.P. for the family. Although this date is not consistent with the Steppe hypothesis, it does not rule out an initial homeland south of the Caucasus, with a subsequent branch northward onto the steppe and then across Europe. We reconcile this hybrid hypothesis with recently published ancient DNA evidence from the steppe and the northern Fertile Crescent.