Thoughts about Git for IFC (3)

In my previous posts I described why we are in need of a versioning system for IFC, and what the current drawbacks are when if comes to using IFC and Git.

In this post I'll describe my thoughts on how we could implement a version tracking system for IFC by defining a new way to write IFC data to disk.

First, let's define some starting points.

Starting points

I've taken the following starting points into account:
  • Stick to the principles of the IFC graph as much as possible: keep data linked and separate, don't store entities in entities (which is tempting in hierarchical file-structures), since that results in data duplication.
  • Use the hash of an entity as the identification method of the entity.
  • Entity hashing and storage should be such that no data duplication occurs.
  • Entity hashing should be such that entity uniqueness remains, even on non-rooted entities.
  • Human readability would be nice but is not a necessity. Speed is more important, the repository should reflect an database model as much as possible.
The last point is important to understand, especially when it come to deciding how to store IFC entities in folders. I believe that the IFC Git repository will never be the actual live BIM editor (e.g. Revit) schema. The file system is just to slow for that. I think that in the end, such a repository will be read by the BIM editor, and stored internally in a graph database, and when saving, the data will be dumped back into the repository again. In the most optimal situation, commits, branches, etc, could even be stored inside a graph database, and later be exported to the repository. 

To me, human readability of the repository would be nice, but in the end, we can always make a GUI on top of the repository to help us in analyzing the data.

Uniqueness and Hashing

Defining uniqueness of entities is essential in versioning entities. IFC already recognizes this importance, since rooted entities all have a GlobalId property. However, this is not enough. Take the following items for example:

IFCWALLSTANDARDCASE('1zsFittbH2hBUSaLfoZE9n',#41,'A name',$,'The type',#118,#168,'A tag');
IFCWALLSTANDARDCASE('1zsFittbH2hBUSaLfoZE9n',#41,'A new name',$,'The type',#118,#168,'A tag');

The above example shows that the name of the entity changed, but (of course) the GlobalId did not, since the object itself remained the same. However, when it comes to versioning, this minor change is actually essential to detect.

This is where hashing comes in. If we would take the SHA-1 hash of each line, the hash would be completely different. In Git, if we would replace the contents of a file holding the first line with the second line, Git would take care of the versioning, and the Git tools could easily be used for detecting and showing the differences.

However, the examples above also still show the file-scoped unique entity id's (#). We need to find a solution to get rid of those in order to be able to start versioning our IFC files.

Hashes and links

We need to find a way to step away from the internal, file-scoped link id's. One trick could be to replace the integers with the GlobalId codes. Since data storage limitations are not longer an issue, this could be a solution. However, this trick fails when it comes to linking to Non-Rooted entities. Since these entities do not have a GlobalId, there is nothing to link to!

I believe the best solution would be to use the unique hash as the link identifier. This would mean that the above example would look like:
IFCWALLSTANDARDCASE('1zsFittbH2hBUSaLfoZE9n',#356a192b7913b04c54574d18c28d46e6395428ab,'A name',$,'The type',#17ba0791499db908433b80f37c5fbc89b870084b,#fa35e192121eabf3dabf9f5ea6abdbcbc107ac3b,'A tag');

Great! We could easily traverse the IFC graph from bottom to top, starting with the entities with no links (like IFCCARTESIANPOINT), and work our way up. This would be necessary, because of the fact that we would be using hashes as links within entities for which we than have to calculate the hash.

This is also where the drawback is. A minor change in the bottom of the graph could lead to a large number of changes in the whole graph.

One more real-life example would be the fact that IFC exporters in Revit use the current Revit user as the IFCPERSON. If I was to export a model, and my colleague would export the exact same model, there would be huge differences detected between the IFC data, since the change in IFCPERSON would reflect a hash change in IFCOWNERHISTORY which would generate a hash change in all Rooted entities! A big problem.

Introducing IFCLINK

The example above shows that we need to separate the link definition from the entity definition: the link should not be stored in the entity file.

I believe one option would be to create a separate folder in the IFC repository called IFCLINK. This is where links between the entities are stored, again using hashes.

How would this work?

Let's say that the above IFCWALLSTANDARDCASE has a hash "A", and is stored as /IFCWALLSTANDARDCASE/[GlobalId].json. Now let's say that the IFCOWNERHISTORY has a hash "B" and is stored as /IFCOWNERHISTORY/B.json.

A link file would be stored as /IFCLINK/[GlobalId]-B.json and would have the following contents:
{
   "from"       : "[GlobalId]",
   "from_class" : "IFCWALLSTANDARDCASE",
   "to"         : "B",
   "to_class"   : "IFCOWNERHISTORY",
   "index"      : 0
}

The classes would be necessary to quickly retrieve the right links and files when processing the repository. The index is important when it comes to handling LIST and ARRAY as defined in the IFC Express specification.

By using IFCLINK, a change in IFCOWNERHISTORY would lead to a lot of changes in the IFCLINK folder, but not to changes in the linked entities. This is exactly how it would work in a graph database, where the edges from entities to the IFCOWNERHISTORY node would have to be redefined.

Also, by separating the links from the entity, we could now store the entity as:
IFCWALLSTANDARDCASE('1zsFittbH2hBUSaLfoZE9n',,'A name',$,'The type',,,'A tag');
Or we could decide to store it as JSON of course. However, removing the links doesn't work in all cases.

Rooted vs Non-Rooted entities

In the above paragraph I showed that we could store entities without their links when storing the links in a separate folder. While this does work for Rooted entities, this fails for most Non-Rooted entities.

Take for example the following definition:
IFCPOLYLINE((#1,#2,#3));
This defines a polyline with 3 points. When we were to replace the links with IFCLINKs, we would end up with the following definition:
IFCPOLYLINE((,,));
Now, this fails, since:
IFCPOLYLINE((#2,#3,#1));
would also lead to:
IFCPOLYLINE((,,));
which generates the same hash, while the polyline is completely different!

In Non-Rooted entities, there is no compulsory unique tag or unique identifier. So in the case of Non-Rooted entities I believe we need to stick with the trick to store entities such that the links are the hashes of the entities we link to. This would solve the IFCPOLYLINE problem mentioned above.

The IFC versioning rules

Let's try to define some IFC versioning rules for us to reflect upon:
  • Each entity is to be stored in a single file;
  • The entity file holds an IFC entity in JSON;
  • Rooted entities have empty attributes where normally links would reside;
  • Non-rooted entities use the hash of the linked entities as the link definition;
  • There is a separate folder called IFCLINK which holds the link files as described above;
  • IFCLINK files are named such that they reflect the entity they link from and the entity they link to;
  • There is a folder for each entity type (possibly with subfolders consisting of the first two characters of the filenames for optimization).
  • Rooted entities filenames are formed using the GlobalId attribute;
  • Non-rooted entities filenames are formed using the hash of the file contents;
  • Hashes are all in line with the way Git hashes;
The above versioning rules imply that the software which generates the IFC repository calculates the hashes of (non-rooted) entities in a predefined way. This is necessary because we need to use these hashes as links.

Wrap up

I believe that, with the rules defined above, we could be using Git as versioning system for IFC data. We would than be able to version the IFC data on entity level, created branches, work parallel on design options, and merge changes to the master branch. Also, we would end up with a chain of commits (changes), which would be a great thing to have in the construction industry.

Before I forget: these are just early thoughts. I have yet to implement the ideas mentioned above to fully test all of this. And for sure there is room for discussion (folder structure, IFCLINK, JSON) and optimization. Feel free to comment and share your own thoughts!

Reacties

Populaire posts van deze blog

Thoughts about Git for IFC

Thoughts about Git for IFC (2)