RDF Dataset as a Log-Compacted Topic

Pieter Colpaert and Piotr Sowiński recently proposed RDF Messages, where an RDF Message is defined as “an RDF Dataset that is intended to be interpreted atomically as a single communicative act.”

But the proposal frames the problem narrowly: how do you exchange RDF as a “discrete unit of communication.” I think the more interesting problem is how you describe an entire RDF Dataset as a persistent topic that can be exchanged and replayed.

In UDA, this is exactly what we built. Our message is a named graph, and the same name that identifies the graph also identifies the message. This is the design I want to put on record here. The reframing it makes possible is much bigger than the proposal acknowledges.

The Stream/Dataset Duality

If you have worked with Kafka, you have probably come across the stream/table duality: a stream is the changelog of a table, and a table is the materialization of a stream. Jay Kreps made the case for this years ago.

The same duality applies to RDF Datasets. Each named graph can be seen as a keyed message on a log-compacted Kafka topic, with the graph name as the key. Log compaction keeps only the latest message for each key, so the topic always carries the current value of every named graph.

What you get is the RDF Dataset itself, materialized as a topic and shared between producers and consumers. You can replay it from any point in time and sink it into any RDF Store. The proposal stops at the message-level dataset and never considers the topic itself as a dataset.

Where RDF Messages Fits

The proposal identifies a real case where a single keyed message is not enough: atomic updates across several named graphs. The RDF Dataset we stream in UDA is only eventually consistent, and SHACL validation across graphs cannot happen in-stream, only after the fact against the consolidated dataset. That is an honest limitation, and an RDF Message grouping those graphs would be exactly the right unit for it.

In my opinion, keyed messages should be the starting point. An RDF Message in the proposal is a message-level dataset that may contain several named graphs, so it has no natural key. This forecloses log compaction, and with it the duality.

In our experience, the most common streaming primitive is a size-1 RDF Message: a message-level dataset with a single named graph and nothing else. The proposal does not give it that status; it is just one configuration of the general dataset abstraction, sharing the same structural limits.

On Standardization

In my experience, an RDF Dataset works much better as a log-compacted topic than as a sequence of communicative acts. The keyed message is the right starting point, with RDF Messages as the natural layer for atomic grouping on top. I hope this framing finds its way into the proposal.