Data Transformations

This page presents the results of a series of data transformation tests carried out at ICS-FORTH, in the framework of the Harmony + CIMI Collaboration: Interoperability and metadata vocabularies. Data formats from different organisations were mapped to a simple XML DTD compatible with the CIDOC CRM. In the sequence, sample data were transformed automatically using commercial tools into instances of this DTD. These instances can best be viewed with a style sheet or XSL file like those provided here.

Alternatively, instead of a DTD we could have used an RDFS formulation of the CIDOC CRM. The respective RDFS instances could be formally validated wrt the CIDOC CRM, as RDFS allows to implement the IsA hierarchies of the CIDOC CRM. The logic of the transformation is however identical for this DTD, and a proper RDF Schema, only the output syntax would differ. We have choosen a DTD here, because it is easier to display and to explain as an example.

The CIDOC CRM Entity DTD can be regarded as a new approach to the definition of data transport formats. Let us assume, that all properties of the CIDOC CRM would be declared at the highest entity in the IsA hierarchy. Via inheritance, all properties would appear at the entities they were foreseen for, and at all others. The CIDOC CRM properties are optional, they need not be used, in particular those, which appear now at entities that should not have such a property. Therefore a correct instance of the correct CIDOC CRM is also a correct instance of this simplified model. Moreover, the entities and properties it instantiates are consistent with respect to both models. Only, the simplified model does not provide any guidance to the correct combination of properties and entities. Therefore it is a "transport model" rather than a "validation model".

Based on these considerations, we have defined a recursive DTD, were the elements correspond to properties rather than entities. All properties are allowed for the "root" entity, and within each element representing a property except for properties with a primitive value (string, time, number). The "forward" and "backward" use of a property is defined as a different element. The classification of a node is described by another element called "in class". So, the DTD can be used to transport valid instances of the CIDOC CRM, e.g. of an RDFS formulation of the CIDOC CRM.

The DTD is based on the test CIDOC CRM version 3.0. As these results are quite new, we apologize for possible minor inconsistencies between the latest versions of the CIDOC CRM, the DTD, and the transformed data. We shall try to eliminate those in the days to come.

For presentation purposes we have created a first XSL and CSS presentation format. It simulates the format of the "CIDOC CRM example", and leads to a readable form, which by sure stands improvements. Necessarily we need ":before" elements to show the name of an element. This feature is not supported by Microsoft Internet Explorer, but by some XML editors. Therefore we add the XSL form. Interesting enough, with these simple means any data transformed into a CIDOC CRM-compatible form can be automatically displayed in a readable form!

Science Museum of London

Transformations of sample data of the Science Museum of London to the CIDOC CRM, are described in the following document:
Martin Doerr & Iraklis Karvasonas, Converting object documentation into a CRM-compatible XML form using Data Junction 7.5 , ICS-FORTH, Heraklion, Greece, May 2001. Available: word file (128 Kb).

Harmony + CIMI Tests

In the framework of the Harmony + CIMI Collaboration: Interoperability and metadata vocabularies, the extended CIDOC CRM has been base for data transfer experiments from museum data of 4 organisations to the CIDOC CRM (National Museum of Denmark, AMOL, RLG, The John Clayton Herbarium of The Natural History Museum, London). ICS-FORTH is assisting Harmony in this collaboration with respect to the use of the CIDOC CRM.

Data sample from the National Museum of Denmark (NMD)
The current schema of the NMD database GENREG is shown graphically on the following images created with ACCESS: 

This file shows the ACCESS representation of the data dictionary of the NMD database with comments about the mapping to the CIDOC CRM (word file 682 Kb). The GENREG model is event-centric. As reasonable in a Relational implementation, it keeps the number of tables small. Therefore fine-granularity distinctions between events as in the CIDOC CRM are expressed by types of events and types of roles. Naturally, there is no built-in mechanism top constrain event types to the allowed roles. Such a service could be implemented using the CIDOC CRM. In this mapping we have not analyzed in depth all types of events used in the NMD base to achieve an optimal mapping to its coprresponding CIDOC CRM subclasses of Event. The idea of this test was to demonstrate the feasibility. This demonstrates that in general a mapping is based on the input schema and on type definitions used in the input data.

A peculiarity of the NMD data is the default event of classification and measurement: Classification if not otherwise specified is implied in the "use event" (Brug), and measurement in the acquisition event. We have traced these cases and interpreted those events as multiple instantiation of both implied CIDOC CRM classes. Note, that we regard a collection as a physical object, similar to a set of chessmen, a bikini, a set of plates. The argument is, that a collection has a total weight, can be destroyed, shares a common life-cycle. Coming and going of parts is neither unusual to other objects, just look at your computer.

Here now the result of the transformation, data from the ethnographic collection of the NMD, with embedded images. This file must be viewed with an XSL-enabled viewer.

NMD sample in CIDOC CRM form: (xml file 273 Kb)* 

Data sample from the Australian Museums On-Line (AMOL)
The schema of the Australian Museums On-Line (AMOL) database is a flat list of attributes shown in: AMOL schema (word file 59 KB).

The fields of the AMOL schema have a loose semantic connection to the data in it. They are more on the level of a document structure than of a conceptual model. Therefore a direct mapping of AMOL field semantics to CIDOC CRM notions is not possible besides a few fields. The Clayton example below is just the opposite. All data fields can be interpreted with high precision, but they provide few structuring. They allow however for complete automatic data transformation to the CIDOC CRM. We have mapped in a first step all such structuring fields of the AMOL data to the CIDOC CRM "has note" property, and the interpretable fields to the respective CIDOC CRM properties.

Here now the result of the automatic transformation, data from the Australian Museums On-Line (AMOL), with embedded images. This file must be viewed with an XSL-enabled viewer.

  AMOL sample in CIDOC CRM form, part 1: (xml file 83 Kb)*
  AMOL sample in CIDOC CRM form, part 2: (xml file 86 Kb)*
  AMOL sample in CIDOC CRM form, part 3: (xml file 95 Kb)*
  AMOL sample in CIDOC CRM form, part 4: (xml file 95 Kb)*

In a second step we have analyzed one AMOL record by hand in order to demonstrate that the meanings referred in these records are completely covered by the CIDOC CRM. A satisfactory automatic transformation of the AMOL data to the CIDOC CRM could be achieved by the use of text parsers, based on heuristics and by comparison with place name and person name authorities, as usual in data mining and citation index generation. This was however beyond the resources we could assign to this test. The complexity of such an analysis could be greatly reduced, if a certain displine of separating person names,organisation names and place names would be applied. The field "subject" exhibits a certain object type dependent polysemy, which could have been better analyzed by us. The notion of modelling "subject" in the librarians' sense is an issue still under discussion in the CIDOC CRM.

Here now the result of the transformation by hand of one record from the Australian Museums On-Line (AMOL), with embedded image. This file must be viewed with an XSL-enabled viewer.

 AMOL sample in CIDOC CRM form, complete mapping: (xml file 4 Kb)*

Data sample from the The John Clayton Herbarium of The Natural History Museum, London
The schema of the Clayton database is a flat list of attributes shown in: "Clayton schema" .

This transformation contistutes the first test of natural history data with the CIDOC CRM. The Clayton example consists of reasoning between object types, their names and types, classification and prototypicality of specimen. The only aspect we could identify not to be covered by the CIDOC CRM already is the "Type Specimen", which could be generalized as a property: E55 Type: has prototype (is prototype of). Else, the events of classification, the distinction between names and types, and the recent (Agios Pavlos Extensions) subordination of E55 Type to CIDOC CRM Entity seemed to us to be satisfactory to capture this reasoning. We kindly ask Natural History experts and particularly the curators of the Clayton collection to provide us with feedback to our interpretation of these data.

Here now the first result of the transformation, data from the John Clayton Herbarium, with embedded images. This file must be viewed with an XSL-enabled viewer.

  Clayton sample in CIDOC CRM form, part 1: (xml file 77 Kb)*
  Clayton sample in CIDOC CRM form, part 2: (xml file 76 Kb)*
  Clayton sample in CIDOC CRM form, part 3: (xml file 77 Kb)*
  Clayton sample in CIDOC CRM form, part 4: (xml file 81 Kb)*

* To view XML files MSIE 5+ or Netscape 6 are recommended

A full report of the transformations will be shown in this page soon.

Martin Doerr
Chair, CIDOC CRM SIG,
June 2001