In order to place historical scholarship at the service of shared, transnational identities, our traditional emphasis on states must be replaced with a focus on the development of larger themes or topics situated within broader European landscapes over longer periods of time. These themes must be documented by mature datasets.
The available datasets come in a continuum that ranges from small and highly—often manually—curated sets for purposes of qualitative comparative research (A-type datasets) to large and structured datasets that serve libraries and archives as authority files and are fit for quantitative research (B-type datasets). Both are rich and simply serve different purposes. A-type and B-type datasets are especially valuable because the contents are confirmed through either editorial curation or by reference to other knowledge.
Separate from these two types are sources that are the result of automated extraction of entities by means of data mining from full-text sources (C-type datasets). These local interpretation processes generate noisy but potentially still rich data, that needs another phase to mature. We refer to this maturing process as crystallisation. For each text source, a context can be generated based on the available knowledge in the A-type and B-type datasets. On this basis, extracted entities can be resolved into named entities—for example, this string of text contains a name, and based on this context we can establish in a semi-automated manner with a sufficient level of certainty that the entity is actually the person Winston Churchill who lived from 1620 to 1688 and not the statesman who led the United-Kingdom during World War II. When more and more data from different sources crystallises, the uncrystallised data from earlier mined texts is revisited for a new attempt at resolution.
The crystallised C-Type datasets are combined with the A-Types and B-Types to form a body of knowledge. Identified data objects specifying the agents (who), objects (what), times (when) and places (where) constitute one part, while the links and relationships between these entities—either already existing in the original datasets or newly inferred—complete it. The full body is known as a Knowledge Graph capable of semantically characterizing historical observations: ‘who did what, when and where?’ For example, ‘<Rembrandt van Rijn> painted the <Night Watch> in <1642> in <Amsterdam, the Dutch Republic>’. Successive observations of this type help form the basis for interpretative, historical narratives.
Agents are data objects that answer ‘who’-questions. An agent can be a person or a group of persons, such as a family, guild, institution or enterprise. To illustrate the complexity of dealing with historical people, consider the early modern period. Early modern European proper names are combinations of finite quantities of given names, surnames, and titles. This produces ambiguity in two main ways. One is when multiple people share similar names. For instance, the Thesaurus compiled by the Consortium of European Research Libraries (CERL) includes 111 different people between 1500 and 1700 by the name of ‘Johannes Fabricius’ (the Latin translation of ‘John Smith’). The other form of ambiguity occurs when the same person is referred to by many different names, sometimes in each of several different languages. For instance, within the letters of Hugo Grotius, the Swedish general, Lennart Torstensson, is referred to under 39 different name forms.
Disambiguating such references requires reference to biographical authority files. But the leading authority databases (such as the CERL, VIAF, and their national equivalents) are typically derived from national libraries, bibliographies, and biographical dictionaries, which focus on relatively small numbers of relatively well-known individuals. Integration of far larger numbers of less famous individuals requires far larger authority files and means of assembling all the available scraps of biographical data needed to identify and disambiguate them. The more we know about two historical John Smiths, the more confident we can be about attributing new data to one rather than the other. The CommonPlace authority files must therefore incorporate a prosopographical model capable of capturing biographical data in an organised, structured form suited to the lives of many kinds of people throughout Europe.
Objects are data objects that are typically the result of actions performed by agents. Often these are tangible artefacts from archive or museum collections such as works of art or archaeological evidence. For innumerable objects no longer extant, we possess merely the documentation that they once existed: ships mentioned in trade logs, for instance, or paintings recorded in probate inventories. The leading European authorities (above all Europeana) are limited to existing tangible objects derived from museum collections, libraries and archives. The great majority of objects are therefore not preserved in these repositories. Also problematic are abstract and immaterial objects, like laws or privileges. CommonPlace cannot aspire to catalogue all the tangible and intangible objects which have ever been documented but it must offer a means of capturing selections of these data in an organised, structured form, based on a variety of data models or ontologies.
Time is represented by data objects that answer ‘when’ questions. CommonPlace covers two notions of time: (1) dates and (2) events. Both involve computational problems which CommonPlace infrastructure will resolve efficiently.
Dates. European timekeeping before the 20th century is complicated by the simultaneous use of many different calendars. The basic difficulty is the differentiated transition in different parts of Europe from the ‘old style’ Julian calendar (which began the year on March 25th) via the ‘new style’ Julian calendar (which began the year on January 1st) to the Gregorian calendar. For instance, Spain, Poland and Portugal adopted the Gregorian as soon as it was introduced on 5 October 1582; Prussia in 1610, Alsace in 1648, Denmark in 1700, Great Britain in 1752 (with Scotland transitioning from old style Julian while England moved from new style Julian) and Russia only in 1918. Moreover, during this long period, the difference between the new style Julian and the even newer Gregorian grew from ten days initially to eleven days on 28 February 1700 onwards and to twelve days on 28 February 1800. To complicate matters even further, many documents in the medieval and early modern periods are dated using (a) ecclesiastical calendars that reference major feasts that were not celebrated in every region at the same time, or (b) Roman and Latin calendars through which scholars displayed their prowess in classical knowledge, not to mention (c) exchanges where Jewish, Islamic or Coptic calendars were the norm, (d) those imposed by political regimes like the Calendrier Républicain Français of the revolutionary years in France, or (e) those used by specific movements and societies such as the Freemasons’ calendar. The resulting complexities greatly encumber the conversion of pre-modern to modern dates. But dates, being ultimately derived from the regular movement of celestial bodies, are regular enough to be modelled by a tool drawing on converters and uncertainty flags. Automated and semi-automated tools for such calendar conversion are indispensable if multiple, international datasets are to be rigorously reduced to a uniform chronology.
Events are the “described” event types encountered in sources such as “after the fire of London”. Events are also historical observations in and of themselves that are made up of who did what, when and where. Moreover, many large data sets consist of innumerable repetitions of the same basic event (matriculating at a university, or publishing a book, or posting a letter). Rigorously modelling and populating an expanding knowledge-graph of interlinked basic event types is therefore the precondition for generating new transnational narratives from masses of granular data. By ‘basic events’ we do not mean the great movements which make up the grand narrative of European history (e.g. the Protestant Reformation) with uncertain or contested boundaries, but rather the unique dates and periods within those grand narratives (e.g. the posting of Luther’s Ninety-Five Theses), and the elemental, routinized actions central to the life of a community, repeated and documented countless times (e.g. the writing of a ledger, charters granting land to an individual, marriage, births and deaths, or the printing and reprinting of a pamphlet).
A good example of the inherent difficulty of dealing with places is provided by the city known today by the Polish name of Wrocław. Wrocław is also known as Breslau (in German), Vratislav (in Czech), and Vratislavia (in Latin) — not to mention its name in Hungarian (Boroszló), Hebrew (ורוצלב or Vrotsláv), Yiddish (ברעסלוי or Bresloi), or Silesian German (Brassel). More ambiguity results from the fact that Breslau also gave its name to one of the duchies of Silesia.
A second kind of complexity regards the shifting location of both city and duchy within larger political entities. From 1469 to 1490, Breslau was subject to the king of Hungary. From 1490 to 1526, it was incorporated (along with Bohemia, Moravia, and Lusatia) into the lands of the Bohemian crown, which in turn formed part of the Holy Roman Empire. In 1526, the crown of Bohemia (and with it Silesia) passed to the Austrian Habsburgs. After occupation during the War of the Austrian Succession, Breslau was formally ceded with most of Silesia to the Kingdom of Prussia in 1742. With the unification of Germany in 1871 it became part of the second German Reich. After the First World War, it became the capital of Prussian Province of Lower Silesia. After the Second World War, most of Silesia was transferred to Poland, where it remains today.
Modern gazetteers (e.g. GeoNames, Getty TGN, and even WikiData), on which most linked data projects rely, are typically restricted to modern usage and current administrative and political arrangements. The analysis of transnational sets of historical data is impossible without taking account of shifting historical usage and political contexts.