In the HUMBOLDT project, considerable effort was spent in mapping the landscape of tools and methods that are used to harmonize spatial data or that might be applied to this process. One area of focus was the process of schema transformation. We thus conducted studies on tools in 2007 and in 2010/2011, and continued this work afterwards. In these studies, we used a framework to classify these approaches, since we felt we were comparing apples and oranges all the time! This post defines the core classification categories for schema transformation approaches, as I also presented them at the INSPIRE KEN Schema Transformation Workshop.
There are multiple aspects or dimensions we can use to classify different approaches for schema transformation. Note my use of the term “approach” is meant to abstract from language, method or its implementation in a tool.
In Schema Transformation projects, several phases are characteristic, very much like in software development or engineering projects. These phases include the following ones:
- Design: Define the correspondances and functions to use independent of implementation details, e.g. in matching tables or using UML
- Development: Coding in a programming language such as Xquery, or building a Pipes-and-Filters Graph visually as in FME or Talend
- Debugging: Analysing the Schema Transformation’s behaviour
- Validation: Testing with the full range of data, quality assurance
- Documentation: Documenting parameters, the process, limitations and assumptions of the schema transformation, provide lineage on the transformed dataset
- Maintenance: Keep track of changes, iterate through the other activities for updated/new datasets, new schemas…
Different Approaches put their focus on different phases. As an example, a matching table is a good design and documentation tool, but has very limited use in the transformation development. We furthermore differentiate between explicit support and implicit support. Explicit support means that the approach has facilities designed to support the phase, while implicit means the approach has facilities that can be (mis)used to support the phase. As an example for implicit support, consider XSLT: Since it’s text, programmer’s maintenance and documentation features can be used.
Originally used to classify computer programming languages, paradigms can help us understand what kind of patterns to use in the development phase. We differentiated two major paradigms:
- Declarative: Describe the logic of a computation without describing its control flow. Leave optimization and actual execution order to the runtime engine.
- Examples: XSLT, EDOAL/gOML
- Procedural: Describe a computation by giving its control flow through a series of functions.
- Examples: Python GeoProcessing Tool, FME
Of course, there are other approaches that can’t be fit into these two, such as Aspect-Oriented Programming or Agend-based Programming. Furthermore, there are approaches that contain elements of both. If, for example, has a procedural and a declarative sublanguage. XQuery as a rule-based approach also has a declarative and a procedural part.
A classic property of schema transformation approaches is the abstraction level they work on – the meta-model, the conceptual model, the logical model or the physical model.
In practical terms, each level focuses on different aspects of the tranformation – conceptual/semantic integrity at the top level, adherence to structural rules on the logical level and value transformation at the physical level. Consequently, higher-level transformation definitions do not focus on minutiae such as the format of a date string. In a true model-driven architecture, the availability of vertical mappings means that you only have to define the schema transformation on the conceptual level, and the necessary transformations for the logical and physical levels are derived automatically. In most cases, the number of decisions or statements that a user needs to make increases significantly from the conceptual level to the physical level.
Instance- or Schema-Driven Execution
In this classification, the are two categories:
- Instance-driven, where the execution of a schema transformation is driven by properties of a (set of) features
- Schema-driven, where execution of the schema transformation is driven by properties of the schema elements
Furthermore, especially in semantic web research, more and more combined approaches are developed that use a combined approach. As an example, consider EDOAL/OML and its implementation in HALE: HALE sets up a transformation graph based on the schema, but then modifies it during execution when it encounters individual features with specific properties that make this necessary, e.g. because of varying cardinalities or formats of string to date conversions. From a practitioner’s perspective, the main difference between schema-driven and instance driven is that only schema-driven approaches can be “complete”, i.e. cover all possible kinds of data valid according to a particular schema. However, with instance-driven methods you often save development time, since focus is put on the part of the schema that actually contains the data.
Another means of classifying a schema transformation approach is to look at its primary representation form – textual, graphical or a combined approach. Textual forms have several advantages, such as versioning (and merging) and the fact that they tend to be less tool-bound. You can open an XSLT file in any old text editor, after all. Graphical forms such as the transformations graphs we have come to be accustomed to from Talend, FME or GeoKettle emphasize data flow and often represent a more intuitive syntax than textual forms.
The final criterion that is typically used is the actual expressivity of the approach – can I do everything that I need to with the language or tool? Is it powerful enough, in other words? Some approaches such as XSLT are effectively general-purpose programming languages and have been shown to be turing-complete. For assessing the suitability for spatial data schema transformation, I use Matt Beare’s classification from the 2010 INSPIRE Schema Transformation Network Service Pilot project. This classification has six levels of functions:
- 1 – Renaming classes and attributes
- 2 – Simple attribute derivation
- 3 – Aggregating input records
- 4 – Complex derivation and dynamic type selection
- 5 – Deriving values based on multiple features
- 6 – Conflation and model generalisation
In total, the classification lists 25 functions that approaches would need to be considered complete for the given spatial data schema transformation use cases. As an example, under level 2 the following functions are listed:
- Transforming data types (e.g. numbers into text or strings into timestamps)
- Transformation based on basic geometric functions (e.g. bounding box, convex hull, area)
- Transformation based on non-spatial functions (e.g. uppercase, truncate, substring, round, regular expression)
- Transforming units of measure
- Setting default values where data is not supplied
- Replacing values based on lookup tables (e.g. code lists)
- Entering identifiers for referenced objects (e.g. based on GML xlink or relational database foreign key in the source data).
I have listed six criteria that can be used to assess a specific approach to schema transformation. These are not necessarily all that are important – there are others such as maturity and verbosity . Furthermore, the actual classification within each criterion is often a subject of discussion. As an example, RIF and 1spatial use a rule-based paradigm that has elements of declarative and procedural paradigms, but it could be argued that it mandates a category of its own.