INSPIRE KEN Schema Transformation Workshop in Marne-la-Vallee – Day 1 report

On 8th and 9th of October, around 50 people gathered for the joint EuroSDR/INSPIRE Knowledge Exchange Network (KEN) Workshop on Schema Transformation. The workshop gave all participants the opportunity to get an overview of pretty much all approaches that are in market to help complete schema transformation projects.

For the full program, all slides and video recordings of the workshop, please go the the Eurogeographics website. What follows is not a detailed report of every presentation, but rather an account of my personal highlights, including the two discussion sessions that completed each day. Marie-Lise Vautier (IGN France) and myself started with presentations setting the frame by providing definitions of what schema transformation is and what general approaches are available. Morten then continued with experiences highlighting the schema matching methods originally developed by ESDIN and now widely used by Cadastral Agencies and other LMOs. He also mentioned that especially matching tables can get hard to create and maintain.

Marie-Lise Vautier gives the first presentation for the INSPIRE KEN Schema Transformation Workshop

Marie-Lise Vautier gives the first presentation for the INSPIRE KEN Schema Transformation Workshop

After a break, Just van den Broecke opened the block on Open Source Schema Transformation software. He has developed a streaming ETL framework called STETL which is based on GDAL/OGR, XSLT and other libraries and ties everything together using Python. Python support throughout the world of geospatial tools is very good – you can see it becoming a lingua franca in GIS for scripting. Just, like me in my earlier presentation, made it clear that schema transformation projects are essentially programming projects and thus have a certain complexity level. I fully agree, but see it as a disadvantage that for using STETL you have to learn multiple languages. Consequently I see STETL mostly as a tool for programmers who want to use the tools it is based on anyhow and need a rich “boilerplate”. Just also held up the flag for Open Source as a major community enabler, which I see as especially important for INSPIRE.

I was also very interested in the presentations on GeoKettle and Talend Spatial Data Integrator, which on a first glance seem to have close capabilities. Both presentations were given by users who had completed transformation projects using them. About both, I like that they are derived from general-purpose, non-GIS, which proves tool reusability. Talend was showcased by Jean-Loup Delaveau of CERTU. He explained how to create INSPIRE Planned Land Use Data by setting up a workflow in Talend that used components such as XSLT translators. An interesting note from his side was that GML should really be used as a machine-to-machine exchange format, and that providers and users should not see much of it.

Jean-Loup Delaveau of CERTU explains the plan4all workflow to create PLU data

Jean-Loup Delaveau of CERTU explains the plan4all workflow to create PLU data

Edith Vautard of IGN France explained how her group evaluated GeoKettle for INSPIRE Administrative Units generation. One thing that really impresses me is that IGN France is very open and trying out many approaches and tools to collect rich internal knowledge. On GeoKettle, I made note that I’d like to investigate their workspace format a bit. Edith ended with an overall positive assessment of Geokettle, citing from her slides:

  • + It’s intuitive and easy to use
  • + powerful and performant
  • + provides a sufficient diversity of functions
  • + reads the schema from the data
  • – Transformations are only stored in the internal XML format and cannot be exported as executable files (e.g. XSLT)
  • – INSPIRE complex structures are not supported, nor can you create non-simple GML 3.2.1
  • – There is no help in the software, and documentation is light; however, there is good support.

The first day was then completed by an update on the model driven WFS work done by TU Munich, presented by Tatjana Kutzner. She highlighted findings of her recent research, which has been published under the title “Critical Remarks on the Use of Conceptual Schemas in Geospatial Data Modelling — A Schema Translation Perspective” (Kutzer, Donaubauer 2012). The core question they researched was what a core model of all UML profiles being used would look like and how to provide encoding rules for conceptual models in machine-readable formats.

After Tatjana’s presentation, only the discussion round stood between us and dinner – and everybody stayed for an interesting, engaged discussion, with these core findings on the subject “what are the main drivers to choose methods and tools for schema transformation” (citing from Dominique Laurent’s summary):

  • Maintenance and documentation of tools are significant criteria
  • Choice of tools depends on the business models of data providers: some want the best tool for each step (even if using many tools increases complexity), some want only a single supplier (or at least a small number of tool suppliers) and tender accordingly
  • Choice of tools depend also on national policy; there may be order to use open-source tools
  • Skills will also influence the methods and tools: if limited skills, would be better to choose a tool simple to use and/or to envisage training
  • Choice of tools and methods will depend on the existing systems already in place (tools, data, …) and on the organization (e.g. one or several data producers)

Another item of discussion started from my earlier presentation on schema transformation approach classification: “To be able to choose our tools and methods, we need [a framework] to analyse the potential ones, to get an overview”. Meanwhile, I have posted a more extensive description of the framework presented in Paris here. The day then really ended with a very nice dinner :).

Join HALE on Joinup.eu

HALE is now an approved open source project on the joinup platform. The joinup platform is about sharing and reusing interoperability solutions for public administrations and has been created by the European Commission and is part of the ISA (Interoperability Solutions for public Administrations) programme. It’s specifically targeted at open source software solutions and knowledge repositories. A tool such as HALE that ensures semantic interoperability seems like a good offering for this platform.

If you’re interested in HALE’s profile over there, want to rate it or recommend it, want to tell others you are using it or even want to describe how you are using HALE so that others in public administration can benefit from your experience, follow this link.

So, what happened at INTERGEO 2013?

… that is the question I asked Christian Malewski (Fraunhofer IGD), one of the developers of HALE and active in the EU research project plan4business. He had given presentations and shown demos of the software to the interested audience and provided me with the following answer:

Indicated by the large number of software providers and presentations which dwelt on the topic, focus was on geodata interoperability and data transformation. These themes were particularly addressed within the scope of the 2nd National INSPIRE conference that was held as part of the INTERGEO exhibition. Likewise, the INSPIRE conformity for diverse geodata infrastructure components was addressed by several talks in the Open Source Park. The presentations on data harmonisation that I’d like to point you to are (note that some are in German):

  • MeTaDor – Metadaten-Verwaltung for GDI-DE and INSPIRE (slides)
  • Inspider, the user-friendly software stack to implement INSPIRE and others SDI’s (link to project, slides)
  • Geodatenmanagement und -harmonisierung mit GeoKettle (slides)

HALE On a 55″ Multitouch table

As a special treat to INTERGEO visitors, HALE was available on a 55″ multitouch table at the Fraunhofer IGD booth. In addition, the EU project plan4business presented its portal for the transformation of regional land use plans into the INSPIRE land use theme. The portal is built on HALE and other open source components such as OpenLayers. To use the portal, municipal land use data is uploaded to the portal along with its corresponding mapping rules, which were specified using HALE. These mapping rules were interpreted and executed on the server side in order to get the transformed and reclassified INSPIRE conformal land use data. Access to the data is realized through creating a Web Mapping Service.

Following our HALE presentation with the title “Make your data ready for INSPIRE with HALE” (Slides available here) there were interesting questions and discussion, some of which I’ll document here as well:

Q: Is HALE an extension to FME?
A: HALE is a free and open source standalone tool, but a free plugin for FME 2014 has been developed and will be distributed officially soon.

Q: What about a direct database connection?
A: This is currently the number one user wish and it’s implementation is underway. You will be able to write to a PostGIS database via a JDBC driver; reading of PostGIS databases will be added with the 2.7.1 or the 2.8.0 release.

Q: What are the next steps for HALE?
A: We are putting major effort into the upcoming collaboration platform, so that data communities finally get a hub to exchange best practices and experiences, but also challenges. The User Interface is also dear and expensive to us, so expect a big update with the next major release. Also, there are several research projects that use HALE as a basis to integrate more semantic web technology, which HALE was originally built upon. Check out the HALE Roadmap if you are curious about our detailed plans!

Q: What kind of community develops HALE, who is responsible for it?
A: It’s a community-driven Open Source project under the LGPL license. The main drivers are the partners in the data harmonisation panel under the lead of Fraunhofer IGD, where most of the development still happens, plus individual members such as Thorsten Reitz (of the Esri R&D Center Zurich), Andrea Antonello and Silvia Franceschi (of Hydrologis).

Q: How flexible is HALE ? Can you transform to other schemas than just INSPIRE?
A: Yes, it can transform to any schema, and is particularly good at transforming to complex XML, GML or Database schemas.

Classifying Schema Transformation Approaches and Tools

In the HUMBOLDT project, considerable effort was spent in mapping the landscape of tools and methods that are used to harmonize spatial data or that might be applied to this process. One area of focus was the process of schema transformation. We thus conducted studies on tools in 2007 and in 2010/2011, and continued this work afterwards. In these studies, we used a framework to classify these approaches, since we felt we were comparing apples and oranges all the time! This post defines the core classification categories for schema transformation approaches, as I also presented them at the INSPIRE KEN Schema Transformation Workshop.

There are multiple aspects or dimensions we can use to classify different approaches for schema transformation. Note my use of the term “approach” is meant to abstract from language, method or its implementation in a tool.

Activity/Phase

In Schema Transformation projects, several phases are characteristic, very much like in software development or engineering projects. These phases include the following ones:

  • Design: Define the correspondances and functions to use independent of implementation details, e.g. in matching tables or using UML
  • Development: Coding in a programming language such as Xquery, or building a Pipes-and-Filters Graph visually as in FME or Talend
  • Debugging: Analysing the Schema Transformation’s behaviour
  • Validation: Testing with the full range of data, quality assurance
  • Documentation: Documenting parameters, the process, limitations and assumptions of the schema transformation, provide lineage on the transformed dataset
  • Maintenance: Keep track of changes, iterate through the other activities for updated/new datasets, new schemas…

Different Approaches put their focus on different phases. As an example, a matching table is a good design and documentation tool, but has very limited use in the transformation development. We furthermore differentiate between explicit support and implicit support. Explicit support means that the approach has facilities designed to support the phase, while implicit means the approach has facilities that can be (mis)used to support the phase. As an example for implicit support, consider XSLT: Since it’s text, programmer’s maintenance and documentation features can be used.

Paradigm

Originally used to classify computer programming languages, paradigms can help us understand what kind of patterns to use in the development phase. We differentiated two major paradigms:

  • Declarative: Describe the logic of a computation without describing its control flow. Leave optimization and actual execution order to the runtime engine.
    • Examples: XSLT, EDOAL/gOML
  • Procedural: Describe a computation by giving its control flow through a series of functions.
    • Examples: Python GeoProcessing Tool, FME

Of course, there are other approaches that can’t be fit into these two, such as Aspect-Oriented Programming or Agend-based Programming. Furthermore, there are approaches that contain elements of both. If, for example, has a procedural and a declarative sublanguage. XQuery as a rule-based approach also has a declarative and a procedural part.

Model Level

A classic property of schema transformation approaches is the abstraction level they work on – the meta-model, the conceptual model, the logical model or the physical model.

Schema Transformation - Abstraction levels from the Model Driven Architecture Approach

Schema Transformation – Abstraction levels from the Model Driven Architecture Approach

In practical terms, each level focuses on different aspects of the tranformation – conceptual/semantic integrity at the top level, adherence to structural rules on the logical level and value transformation at the physical level. Consequently, higher-level transformation definitions do not focus on minutiae such as the format of a date string. In a true model-driven architecture, the availability of vertical mappings means that you only have to define the schema transformation on the conceptual level, and the necessary transformations for the logical and physical levels are derived automatically. In most cases, the number of decisions or statements that a user needs to make increases significantly from the conceptual level to the physical level.

Instance- or Schema-Driven Execution

In this classification, the are two categories:

  • Instance-driven, where the execution of a schema transformation is driven by properties of a (set of) features
  • Schema-driven, where execution of the schema transformation is driven by properties of the schema elements

Furthermore, especially in semantic web research, more and more combined approaches are developed that use a combined approach. As an example, consider EDOAL/OML and its implementation in HALE: HALE sets up a transformation graph based on the schema, but then modifies it during execution when it encounters individual features with specific properties that make this necessary, e.g. because of varying cardinalities or formats of string to date conversions. From a practitioner’s perspective, the main difference between schema-driven and instance driven is that only schema-driven approaches can be “complete”, i.e. cover all possible kinds of data valid according to a particular schema. However, with instance-driven methods you often save development time, since focus is put on the part of the schema that actually contains the data.

Representation

Another means of classifying a schema transformation approach is to look at its primary representation form – textual, graphical or a combined approach. Textual forms have several advantages, such as versioning (and merging) and the fact that they tend to be less tool-bound. You can open an XSLT file in any old text editor, after all. Graphical forms such as the transformations graphs we have come to be accustomed to from Talend, FME or GeoKettle emphasize data flow and often represent a more intuitive syntax than textual forms.

Graphical (in FME) and Textual (XQuery) Representations of Schema Transformation Languages

Graphical (in FME) and Textual (XQuery) Representations of Schema Transformation Languages

Expressivity

The final criterion that is typically used is the actual expressivity of the approach – can I do everything that I need to with the language or tool? Is it powerful enough, in other words? Some approaches such as XSLT are effectively general-purpose programming languages and have been shown to be turing-complete. For assessing the suitability for spatial data schema transformation, I use Matt Beare’s classification from the 2010 INSPIRE Schema Transformation Network Service Pilot project. This classification has six levels of functions:

  • 1 – Renaming classes and attributes
  • 2 – Simple attribute derivation
  • 3 – Aggregating input records
  • 4 – Complex derivation and dynamic type selection
  • 5 – Deriving values based on multiple features
  • 6 – Conflation and model generalisation

In total, the classification lists 25 functions that approaches would need to be considered complete for the given spatial data schema transformation use cases. As an example, under level 2 the following functions are listed:

  • Transforming data types (e.g. numbers into text or strings into timestamps)
  • Transformation based on basic geometric functions (e.g. bounding box, convex hull, area)
  • Transformation based on non-spatial functions (e.g. uppercase, truncate, substring, round, regular expression)
  • Transforming units of measure
  • Setting default values where data is not supplied
  • Replacing values based on lookup tables (e.g. code lists)
  • Entering identifiers for referenced objects (e.g. based on GML xlink or relational database foreign key in the source data).

Conclusion

I have listed six criteria that can be used to assess a specific approach to schema transformation. These are not necessarily all that are important – there are others such as maturity and verbosity . Furthermore, the actual classification within each criterion is often a subject of discussion. As an example, RIF and 1spatial use a rule-based paradigm that has elements of declarative and procedural paradigms, but it could be argued that it mandates a category of its own.

HALE 2.7.0 with focus on Collaboration features

Originally we had planned our Autumn release to be a minor release – 2.6.1 – but this time we actually got far more done than originally expected. In the end, the new HALE version has so much to offer that it also deserves an intermediate version number increase – so we’re now at 2.7.0 (downloads, documentation). These are the things that you can now do with HALE:

Project Templates

To make your start with HALE easier, HALE now offers the possibility to use pre-configured project templates, e.g. for mapping to the INSPIRE Application Schemas. This saves you steps such as loading the schemas and setting up codelists. You can share your own projects as templates, example, or even as reference mappings online, to let others in your community profit from them:

http://hale.igd.fraunhofer.de/templates/

To select a template and to load it in HALE, use the New project from template option in the File menu or use the Project URL copied from the template website and use Open Alignment project and From URL.

HALE Web Templates

The Join Retype operation

Joins were on the backlog for a long time, since they are complex to resolve in our declarative approach. Now however HALE offers attribute based joins of different feature classes – to an arbitrary depth. To create a Join select multiple source types and one target type and choose the Join function.

Configure a Join Retype in HALE

Project and resources view

The new Project view lists all resources like source schemas, target schemas or code lists that are associated to the project. Also, you can edit basic project information in combination with the Properties view. Some types of resources, like code lists and lookup tables, can be removed here from the project using the context menu.

Export to JSON/GeoJSON

Transformed data can now be exported to JSON or GeoJSON, independently of what kind of schema the data is associated to. Objects are generically encoded as JSON/GeoJSON according to their structure.

Improved support for INSPIRE

HALE now supports the new code list XML format introduced recently by the INSPIRE registry. These code lists are relevant for the latest versions of the Annex II and III Application Schemas. In addition, transformed INSPIRE compliant features can now be saved to GML directly as an INSPIRE SpatialDataSet instead of the deprecated GML FeatureCollection.

On some of these features such as the Join functionality and the different types of template projects we will post separate workflow descriptions in the next days. Enjoy your work with HALE 2.7.0!

INTERGEO 2013 – Fraunhofer IGD represents the DHP

INTERGEO is a central event for the geospatial community in Europe. This year, it takes place in Essen, from Tuesday 8th of October to Thursday 10th of October. The data harmonisation panel is represented by Simon Templer, Eva Klien and Joachim Rix from the Fraunhofer Institute for Computer Graphics IGD. You can find them at the AED SICAD in hall 1 / booth B1.030. You can see the newest release of HALE and CST there in action, as well as see what other things we have up our sleeves. Furthermore, Simon will be presenting HALE to the fair visitors, in German:

Simon Templer: Machen Sie Ihre Daten bereit für INSPIRE mit HALE
Wednesday, 09.10.2013 and Thursday 10.10.2013, at 12:40 – 13:00.
at the OpenSource Park in hall 1 / booth H1.033.

Enjoy your visit to INTERGEO!