In order to place historical scholarship at the service of shared, transnational identities, our traditional emphasis on states must be replaced with a focus on the development of larger themes or topics situated within broader European landscapes over longer periods of time. These themes must be documented by mature datasets.

The available datasets come in a continuum that ranges from small and highly—often manually—curated sets for purposes of qualitative comparative research (A-type datasets) to large and structured datasets that serve libraries and archives as authority files and are fit for quantitative research (B-type datasets). Both are rich and simply serve different purposes. A-type and B-type datasets are especially valuable because the contents are confirmed through either editorial curation or by reference to other knowledge.

Separate from these two types are sources that are the result of automated extraction of entities by means of data mining from full-text sources (C-type datasets). These local interpretation processes generate noisy but potentially still rich data, that needs another phase to mature. We refer to this maturing process as crystallisation. For each text source, a context can be generated based on the available knowledge in the A-type and B-type datasets. On this basis, extracted entities can be resolved into named entities—for example, this string of text contains a name, and based on this context we can establish in a semi-automated manner with a sufficient level of certainty that the entity is actually the person Winston Churchill who lived from 1620 to 1688 and not the statesman who led the United-Kingdom during World War II. When more and more data from different sources crystallises, the uncrystallised data from earlier mined texts is revisited for a new attempt at resolution.

Read more