bertails.org

An RDF abstraction for the JVM

2015-06-17T00:00:00-05:00

Commons RDF is an effort from the Jena and Sesame communities to define a common library for RDF 1.1 on the JVM. In my opinion, the current proposal suffers from design issues which seriously limit interoperability despite the stated objective. In this article, I will explain the limits of the current design and discuss alternatives to address the flaws.

This article is as much about RDF on the JVM as it is about API design and abstractions in Java. No prior knowledge with RDF is required as I will introduce the RDF model itself. So you might end up learning what RDF is as a side-effect :-)

the problem RDF Commons wants to solve

For a long time now, if you wanted to do RDF (and SPARQL) stuff in Java, you basically had the choice between Jena and Sesame. Those two libraries were developed independently and didn’t share much, despite the fact that they are both implementations of well-defined Web standards.

So people have come up with ways to go back-and-forth between those two worlds: object adapters, meta APIs, ad-hoc APIs, etc. For example, let’s say you wanted to use that awesome asynchronous parser library for Turtle. It returns a Jena graph while your stack is mainly Sesame? Well it’s too bad for you. So you use an adapter which wraps every single objects composing your graph.

So let’s say you have the opportunity to solve those problems. What would you do? If you have done software development for a while, especially if it was in Java, your first thought might be about defining a common class hierarchy coupled with an abstract factory. Then you could go back to the author of the Turtle library with a Pull Request using the new common interfaces, and everybody is happy, right?

Let’s see how this pans out in the case of Commons RDF 0.1.

Commons RDF

Commons RDF closely follows the concepts defined in RDF 1.1, including the terms used. It specifically targets plain RDF (as opposed to Generalized RDF) and wants to be type-safe as much as possible e.g. only IRIs and blank nodes are accepted in the subject position for a triple.

Here is a overview of the design of Commons RDF:

link to original image

each RDF concept is mapped onto a Java interface: Graph, Triple, RDFTerm, IRI, BlankNode, Literal
there is an additional concept: BlankNodeOrIRI
there are sub-type relationships between RDFTerm, BlankNodeOrIRI, IRI, BlankNode, and Literal
the interfaces expose methods to access their components
the factory RDFTermFactory knows how to create concrete instances of the interfaces

Here is a quick look at what RDF actually is in the Commons RDF world (this is basically copied from the source code):

package org.apache.commons.rdf.api;

public interface Graph {
  void add(Triple triple);
  boolean contains(Triple triple);
  Stream<? extends Triple> getTriples();
  ...
}

public interface Triple {
  BlankNodeOrIRI getSubject();
  IRI getPredicate();
  RDFTerm getObject();
}

public interface RDFTerm {
  String ntriplesString();
}

public interface BlankNodeOrIRI extends RDFTerm { }

public interface IRI extends BlankNodeOrIRI {
  String getIRIString();
}

public interface BlankNode extends BlankNodeOrIRI {
  String uniqueReference();
}

public interface Literal extends RDFTerm {
  String getLexicalForm();
  IRI getDatatype();
  Optional<String> getLanguageTag();
}

public interface RDFTermFactory {
  default Graph createGraph() throws UnsupportedOperationException { ... }
  default IRI createIRI(String iri)
    throws IllegalArgumentException, UnsupportedOperationException { ... }
  /** The returned blank node MUST NOT be equal to any existing */
  default BlankNode createBlankNode()
    throws UnsupportedOperationException { ... }
  /** All `BlankNode`s created with the given `name` MUST be equivalent */
  default BlankNode createBlankNode(String name)
    throws UnsupportedOperationException { ... }
  default Literal createLiteral(String lexicalForm)
    throws IllegalArgumentException, UnsupportedOperationException { ... }
  default Literal createLiteral(String lexicalForm, IRI dataType)
    throws IllegalArgumentException, UnsupportedOperationException { ... }
  default Literal createLiteral(String lexicalForm, String languageTag)
    throws IllegalArgumentException, UnsupportedOperationException { ... }
  default Triple createTriple(BlankNodeOrIRI subject, IRI predicate, RDFTerm object)
    throws IllegalArgumentException, UnsupportedOperationException { ... }
}

Everything looks actually good and pretty standard, right? So you might be wondering why I am not that thrilled by this approach. Keep on reading then :-)

class-based design

As a reminder, in most static languages, types are only a compile time information. In Java, classes and interfaces in Java are just a reified version of types (up to generics which get erased by the JVM), meaning that they are an (incomplete by design) abstraction for types that can be manipulated at runtime.

RDF Commons decided to model the RDF types using interfaces. In Java, interfaces and classes rely on what we call nominal subtyping. It means that a concrete implementation is required to explicitly extend (or implement) an interface to be considered a subtype.

In other words, despite java.lang.UUID being a perfectly acceptable candidate for being a BlankNode, it is impossible to use it directly because UUID does not implement BlankNode, so UUID has to be wrapped. There are actually many other cases like that: java.net.URI or akka.http.model.Uri are acceptable candidates for IRI, java.lang.String or java.lang.Integer for Literal, etc.

So here is my first and main complaint about Commons RDF: it forces implementers to coerce their types into its own class hierarchy and there is no good reason for doing so.

generics

How can be define abstract types and operations on them without relying on class/interface inheritance? You already know the answer, as it is the same story than with java.util.Comparator<T> and java.lang.Comparable<T>.

Let’s see what the factory would look like with this approach:

public interface RDFTermFactory<Graph,
                                Triple,
                                RDFTerm,
                                BlankNodeOrIRI extends RDFTerm,
                                IRI extends BlankNodeOrIRI,
                                BlankNode extends BlankNodeOrIRI,
                                Literal extends RDFTerm> {

  /* same factory functions as before go here */

}

Instead of referring to Java interfaces, we now refer to the new introduced generics. In a way, generics are more abstract than interfaces. Also, generics let you express the subtype relationship using extends.

As you have probably already noticed, that only gives us a way to create inhabitants for those types. We also need a way to access their components.

RDF module

Accessing components was the role of the methods defined on the interfaces. So we just have to move them into the factory and make them functions instead. And because the factory is now made of all the operations actually defining the RDF model, we can refer to it as the RDF module.

public interface RDF<Graph,
                     Triple,
                     RDFTerm,
                     BlankNodeOrIRI extends RDFTerm,
                     IRI extends BlankNodeOrIRI,
                     BlankNode extends BlankNodeOrIRI,
                     Literal extends RDFTerm> {

  // from org.apache.commons.rdf.api.RDFTermFactory
  BlankNode createBlankNode();
  BlankNode createBlankNode(String name);
  Graph createGraph();
  IRI createIRI(String iri) throws IllegalArgumentException;
  Literal createLiteral(String lexicalForm) throws IllegalArgumentException;
  Literal createLiteral(String lexicalForm, IRI dataType) throws IllegalArgumentException;
  Literal createLiteral(String lexicalForm, String languageTag) throws IllegalArgumentException;
  Triple createTriple(BlankNodeOrIRI subject, IRI predicate, RDFTerm object) throws IllegalArgumentException;
  
  // from org.apache.commons.rdf.api.Graph
  Graph add(Graph graph, BlankNodeOrIRI subject, IRI predicate, RDFTerm object);
  Graph add(Graph graph, Triple triple);
  Graph remove(Graph graph, BlankNodeOrIRI subject, IRI predicate, RDFTerm object);
  boolean contains(Graph graph, BlankNodeOrIRI subject, IRI predicate, RDFTerm object);
  Stream<? extends Triple> getTriplesAsStream(Graph graph);
  Iterable<Triple> getTriplesAsIterable(Graph graph, BlankNodeOrIRI subject, IRI predicate, RDFTerm object);
  long size(Graph graph);

  // from org.apache.commons.rdf.api.Triple
  BlankNodeOrIRI getSubject(Triple triple);
  IRI getPredicate(Triple triple);
  RDFTerm getObject(Triple triple);

  // from org.apache.commons.rdf.api.RDFTerm
  <T> T visit(RDFTerm t,
              Function<IRI, T> fIRI,
	      Function<BlankNode, T> fBNode,
	      Function<Literal, T> fLiteral);

  // from org.apache.commons.rdf.api.IRI
  String getIRIString(IRI iri);

  // from org.apache.commons.rdf.api.BlankNode
  String uniqueReference(BlankNode bnode);

  // from org.apache.commons.rdf.api.Literal
  IRI getDatatype(Literal literal);
  Optional<String> getLanguageTag(Literal literal);
  String getLexicalForm(Literal literal);
  
}

We are doing exactly the same thing as Python does with self: class methods are just functions where the first argument used to be the receiver (aka the object) of the methods.

For the sake of brevity, I am actually showing you the final result for the RDF module. Let’s discuss the other issues that were fixed at the same time.

visitor

In Commons RDF 0.1, an RDFTerm is either an IRI or a BlankNode or a Literal. It is not clear to me how a user can dispatch a function over an RDFTerm based on its actual nature.

My best guess is that one is expected to use instanceof to discriminate between the possible interfaces. In practice, this cannot really work. As a counter-example, consider this implementation of RDFTerm which relies on the N-Triples encoding of the term:

public class NTriplesBased implements RDFTerm, IRI, BlankNode, Literal {
  private String ntriplesRepresentation;
  ...
}

So how does one visit a class-hierarchy in Java? By using the Gang of Four’s Visitor Pattern of course! Ah ah, just kidding. It’s 2015, we can now have a stateless and polymorphic version of the visitor pattern. Actually, we can do even better using Java 8’s lambdas.

The RDF#visit function defined above in the RDF module is a visitor on steroids:

<T> T visit(RDFTerm t,
            Function<IRI, T> fIRI,
            Function<BlankNode, T> fBNode,
            Function<Literal, T> fLiteral);

The contract for RDF#visit is pretty simple: dispatch the right function – fIRI or fBNode or fLiteral – by case, depending on what the RDFTerm t actually is. Note that the function itself is parameterized on the return type, so that any computation can be defined. And finally, as explained before, the element part of the visitor – the RDFTerm itself – has become the first argument of the function, instead of being the receiver of a method.

Finally, here is what it looks like on the user site:

RDFTerm term = ???;
String someString = rdf.visit(term,
                              iri     -> rdf.getIRIString(iri),
                              bnode   -> rdf.uniqueReference(bnode),
                              literal -> rdf.getLexicalForm(literal));

downcasting

RDFTermFactory follows the Abstract Factory pattern which is in practice very limited. Pretty often, seeing the generic interface is just not enough and people end up downcasting anyway because other functionalities may need to be exposed from the sub-types.

In my opinion, this is a big issue in something Commons RDF and we can do better. In fact, it comes for free in the RDF module defined above, as the user sees the types that were actually bound to the generics.

immutable graph

If you look at Graph#add(Triple) you’ll see that it returns void: graphs in Commons RDF 0.1 have to be mutated in place and there is no way around it. This is wrong but do not expect me to use this post for making the case for alowing immutable graphs: it’s 2015 and I should not have to do that.

Especially when the fix is very simple: just make add return a new Graph. That’s actually what Graph RDF#add(Graph,Triple) does.

Note that with this approach, one can still manipulate mutable graphs. It’s just that code using RDF#add should always use the returned Graph, even if it happens to have been mutated in place.

stateless blank node generator

This is how one can create new blank nodes in Commons RDF 0.1 (full javadoc here):

/** The returned blank node MUST NOT be equal to any existing */
default BlankNode createBlankNode()
  throws UnsupportedOperationException { ... }
/** All `BlankNode`s created with the given `name` MUST be equivalent */
default BlankNode createBlankNode(String name)
  throws UnsupportedOperationException { ... }

The contract on the second createBlankNode is problematic as a map from names to previously allocated BlankNodes has to be maintained somewhere. Of course, I am ruling out strategies relying on hashes e.g. UUID#nameUUIDFromBytes, because the BlankNodes would no longer be scoped and two different blank nodes _:b1 from two different Turtle documents would return the “equivalent BlankNode”. So that means that RDFTermFactory is not stateless and whether the state is within the factory or in a shared state is not relevant.

I believe that this is outside of the RDF model and that it has no place in the framework. The mapping from name to BlankNode can always be maintained on the user site, using the strategy that fits the best. Still you can see that I defined BlankNode RDF#createBlankNode(String). It’s because I think another contract can be useful here, where a String can be passed as a hint to be retrieved later e.g. when using RDF#uniqueReference. But it’s only a hint, it has no impact on the model itself.

UnsupportedOperationException

I just do not understand the value in specifying methods that can throw a UnsupportedOperationException in the context of Commons RDF. I mean, how am I expected to recover from such an exception? Does it make sense to allow for partial implementation?

Until I see a good use case for that, I have simply removed those exceptions declarations from the functions defined in the RDF module.

user side

Finally, let’s see how a library user could define a parser/serializer using the RDF module:

public class WeirdTurtle<Graph, Triple, RDFTerm, BlankNodeOrIRI extends RDFTerm, IRI extends BlankNodeOrIRI, BlankNode extends BlankNodeOrIRI, Literal extends RDFTerm> {

  private RDF<Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI, BlankNode, Literal> rdf;

  WeirdTurtle(RDF<Graph, Triple, RDFTerm, BlankNodeOrIRI, IRI, BlankNode, Literal> rdf) {
    this.rdf = rdf;
  }

  /* a very silly parser */
  public Graph parse(String input) {
    Triple triple =
      rdf.createTriple(rdf.createIRI("http://example.com/Alice"),
                       rdf.createIRI("http://example.com/name"),
                       rdf.createLiteral("Alice"));
    Graph graph = rdf.createGraph();
    return rdf.add(graph, triple);
  }

  /* a very silly serializer */
  public String serialize(Graph graph) {
    Triple triple = rdf.getTriplesAsIterable(graph, null, null, null).iterator().next();
    RDFTerm o = rdf.getObject(triple);
    return rdf.visit(o,
                     iri -> rdf.getIRIString(iri),
                     bn -> rdf.uniqueReference(bn),
                     lit -> rdf.getLexicalForm(lit));
  }
    
}

summary

Please allow me to be harsh: I believe that Commons RDF is mostly useless in its current form as it suffers from the many flaws I have described in this article.

As you can expect, I have already shared those concerns on the Commons RDF mailing list but I was told that it would be much more valuable to see a patch about your proposal than a quick hack from scratch. Sadly this is no “quick hack” and there is no small patch.

The good news is that the approach described here works with any RDF implementation on the JVM, including Jena, Sesame, or banana-rdf. And more importantly, it works today!

So if you are interested in a classless – but still classy – and immutable-friendly RDF abstraction for the JVM, I invite you to get in touch with me so that we can define that abstraction together.

How to read a PGP-encrypted email from the command-line

2015-02-21T00:00:00-05:00

I received a PGP-encrypted email a couple days ago with confidential information. As I use GMail, I do not have direct support for PGP. Here are the steps I followed in order to extract the message and the files in it, from the command-line.

Disclaimer: I already knew the big picture for PGP but it was the first time I had to effectively use it.

The message was empty with just a msg.asc file in it:

$ head msg.asc 
-----BEGIN PGP MESSAGE-----
Version: GnuPG v1

hQEMA5Rm9tOuXUEGAQgAlcrBh++K7tBf6UhLPR3MM1S3N94xfSRamHWLXMBj5dp6
9fg+a2GuQDRnta+QRgmlkgXha/6vU9eFzqx9Fh7neeFOC2aOc+8wq7KSNXjUaX0o
wRdm1Jbh7fKy9ygNKGcTkikrpuVtYj1GrLjKD5CJ0gdGvv9vQIr8bUuVE+WwKgOr
hIv4sWDXChiWahDtY8A/LktfAWd0eVZ47FzQQ/LKo89v8POxvqPACmyzDRNKkNhy
AJSu2kjA44k/f79n880lMKZ89GMYjzKISxkxWYi4ccZPOmXgYFIrx5SFDhJNPhaw
1gd3InrLpBdTYGuJZxwRcZ1SpY4v5siDLoXQHnuHONLtAZh02Viq/F0cwWuRyMk5
km2lb3OREW2bHEzHTL5U4/Vb71cup0U7js7J7WvxOR7TCzizShX4w+uRAbfuLmH+

So I basically knew this had been encrypted using PGP with my public PGP key. I already had one because I need to sign artifacts when publishing on Sonatype.

First, I install the GPG toolkit for Ubuntu:

$ sudo apt-get install pgpgpg

Then I get the sender’s public key from his website and add it to my keyring (source):

$ pgp -ka send.pubkey

And I can finally extract the content.

$ pgp msg.asc

The resulting file msg is a typical email attachment:

Content-Type: multipart/mixed; boundary="a8Wt8u1KmwUX3Y2C"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit


--a8Wt8u1KmwUX3Y2C
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

...

--a8Wt8u1KmwUX3Y2C
Content-Type: application/pdf
Content-Disposition: attachment; filename="the-file.pdf"
Content-Transfer-Encoding: base64

...

--a8Wt8u1KmwUX3Y2C--

I still need to extract the PDF. For that, I used munpack:

$ sudo apt-get install mpack
$ munpack msg
the-file.pdf (application/pdf)

Et voilà!

Bonus

If you have forgotten your passphrase to unlock your PGP key, you can use this command found on stackoverflow:

$ echo "1234" | gpg --no-use-agent -o /dev/null --local-user 'Alexandre Bertails <alexandre@bertails.org>' --no-greeting -as - && echo "The correct passphrase was entered for this key"

You can also use the --passphrase parameter if needed.

Abstract Algebraic Data Type

2015-02-15T00:00:00-05:00

Scala’s sealed class hierarchies (aka. Algebraic Data Types) are for sure one of its most praised features. Yet, they have one downside: they don’t let us abstract over the type hierarchy as traits and classes are all about constructing new concrete types.

In this post, we will explore how we can relax this constraint so that we can get an abstracted version of scala.Option, which would allow us to switch implementations.

Deconstructing Scala's algebraic data types

As a reminder, here is how Scala’s Options are implemented:

sealed abstract class Option[+A]
final case class Some[+A](x: A) extends Option[A]
case object None extends Option[Nothing]

Here is the corresponding scaladoc diagram:

There are quite a few things happening here: * we need to be able to speak about the types and their relationships; * then we need a way to inject values in those types; * finally we need a way to inspect the values for those types to extract their content.

On types and subtyping

There is a subtyping relationship between Some/None and Option.

Actually, None itself is not a type but a value, whose type is None.type, a subtype of Option. Also, Option and Some are not technically types, but type constructors (aka. higher kinded types): we need to provide a type A to produce an Option[A] type.

Finally, Option is covariant in its parameterized type A, so that Option[Nothing] is a subtype of Option[A] because Nothing is a subtype of any type.

On injectors

We have two (here somewhat equivalent) ways of injecting a value of type A into a Some[A]:

we can do that through the class constructor, eg. new Some(42)
or more natually through Some.apply, eg. Some(42)

None is a singleton object, therefore it is the only inhabitant of None.type.

On extractors

Given an Option[A], we can reason by cases using pattern matching.

This is achieved through the unapply extractor methods on the Option companion object. And because Option is sealed, the type checker will be able to check for exhaustiveness.

Finally, given a Some[A], we can retrieve its content through the x field accessor, or again using the Some.unapply extractor.

Abstracting over types

My colleague Dan explored how to encode modules in Scala in a previous blog article. If you haven’t read it yet, I warmly recommend you to do it, even if not strictly necessary for understanding what is going on here. Also here, I will choose yet another encoding using a typeclass approach.

First, let’s define the entire type hierarchy in one place:

import scala.language.higherKinds

trait OptionSig {
  type Option[+_]
  type Some[+A] <: Option[A]
  type None <: Option[Nothing]
}

We used the Sig suffix as if OptionSig was an ML module signature but this is not the complete signature as there are no function defined in this trait.

This is just a convenient way to gather several types into a single one, a bit like a record, but for types. Given an OptionSig, we can now speak about one of the types it contains using a type projection, eg. OptionSig#Option[A].

Abstracting over operations

Now that we have a type hierarchy, we can complete our signature with the operations that must be defined over it:

abstract class OptionOps[Sig <: OptionSig] {
  def some[A](x: A): Sig#Some[A]
  def none: Sig#None
  def fold[A, B](opt: Sig#Option[A])(ifNone: => B, ifSome: A => B): B
}

You might be wondering why we need this Sig as a subtype for OptionSig, as this is usually not needed for typeclasses. It’s because we need to be able to project its inner types.

some[A] is the injector for Sig#Some[A]. none doesn’t take any parameter, so it really acts as a singleton value for Sig#None.

fold[A, B] is the essence of the Sig#Option[A] type: given the two passed functions, it can react on the actual type for opt at runtime:

if opt was a Sig#None, then the value for ifNone is returned (notice that it is a lazy parameter which is only computed if needed)
if opt was a Sig#Some[A], then the ifSome function has access to the contained value to compute its result

By the way, an algebra defined through a fold is called a catamorphism!

Finally, we can define a helper to retrieve an instance of OptionOps[Sig] given a signature, if it is available:

object OptionOps {

  def apply[Sig <: OptionSig](implicit ops: OptionOps[Sig]): OptionOps[Sig] = ops

}

Functions over `OptionSig`/`OptionOps`

We now want to define new structures that depends on our module. For this, we need something similar to an ML functor.

For example, let’s define a functor that can construct instances of scalaz.Show for us:

import scalaz.Show

class OptionShow[Sig <: OptionSig : OptionOps] {

  def optionShow[A : Show]: Show[Sig#Option[A]] = {

    // retrieving the typeclass instances
    val showA = Show[A]
    val ops = OptionOps[Sig]

    val instance = new Show[Sig#Option[A]] {
      override def shows(opt: Sig#Option[A]): String = ops.fold(opt)(
        "none",
        x => s"some(${showA.shows(x)})"
      )
    }

    instance
  }

}

object OptionShow {

  implicit def apply[Sig <: OptionSig : OptionOps]: OptionShow[Sig] = new OptionShow[Sig]

}

That is a lot of weird Scala notations that you may not be familiar with. Let’s decompose them.

OptionShow[Sig <: OptionSig : OptionOps] means that OptionShow is parameterized by a Sig, which is required to be a subtype of OptionSig. Also an instance of OptionOps[Sig] must be implicitly available.

def optionShow[A : Show]: Show[Sig#Option[A]] means that if we can provide an instance of Show[A], then optionShow can construct an instance of Show[Sig#Option[A]] for us.

scalaz.Show is a simple yet powerful typeclass from Scalaz. It simply provides a shows function for instances of the provided type (here Sig#Option[A]). The trick here is that unlike Object#toString(), our Show instances are driven by types, so we can rely on a Show[A] being available.

A simple implementation

We almost have everything we need in place. We just need to provide an implementation for our module.

scala.Option looks like a good candidate for a first implementation, after all that’s where we started from:

trait ScalaOption extends OptionSig {

  type Option[+A] = scala.Option[A]
  type Some[+A]   = scala.Some[A]
  type None       = scala.None.type

}

object ScalaOption {

  implicit object ops extends OptionOps[ScalaOption] {

    def some[A](x: A): ScalaOption#Some[A] = scala.Some(x)

    val none: ScalaOption#None = scala.None

    def fold[A, B](opt: ScalaOption#Option[A])(ifNone: => B, ifSome: A => B): B =
      opt match {
        case scala.None    => ifNone
        case scala.Some(x) => ifSome(x)
      }

  }

}

Nothing fancy here. We just plugged (aka. aliased) our types to the concrete ones. some and none respectively delegate to the Some.apply function and the None singleton. Finally, the fold implementation relies on pattern matching.

Just note that the typeclass instance for OptionOps[ScalaOption] is made available in the companion object for ScalaOption so that it will always be picked up by Scala when looking for such an implicit.

Using our option

Finally, we can write a program using our shiny abstractions :-)

class Program[Sig <: OptionSig : OptionOps] extends App {

  val ops = OptionOps[Sig]
  import ops._

  // a little dance to derive our Show instance
  import scalaz.std.anyVal.intInstance
  val showOptOptInt = {
    implicit val showOptInt = OptionShow[Sig].optionShow[Int]
    OptionShow[Sig].optionShow[Sig#Option[Int]]
  }

  // scalaz's syntax tricks are awesome
  import showOptOptInt.showSyntax._

  val optOpt = some(some(42))

  println("optOpt: " + optOpt.shows)

  val optNone = some(none)

  println("optNone: " + optNone.shows)

}

And we plug everything together:

scala> object MainWithScalaOption extends Program[ScalaOption]
defined object MainWithScalaOption

scala> MainWithScalaOption.main(Array())
optOpt: some(some(42))
optNone: some(none)

Our own module implementation

Turns out there are many ways to implement our module.

Here is a version of our module where we provide our own classes:

object MyOption extends OptionSig {

  sealed abstract class Option[+A]

  final case class Some[+A](x: A) extends Option[A]

  sealed abstract class None extends Option[Nothing]
  case object None extends None

  implicit object ops extends OptionOps[MyOption.type] {

    def some[A](x: A): MyOption.type#Some[A] = Some(x)

    val none: MyOption.type#None = None

    def fold[A, B](opt: MyOption.type#Option[A])(ifNone: => B, ifSome: A => B): B =
      opt match {
        case None    => ifNone
        case Some(x) => ifSome(x)
      }
  }

}

Notice that our signature lies in the singleton type MyOption.type. Scala will have no issue finding the implicit instance in itself because the companion object for a singleton object is itself!

We have introduced an abstract class None so that we don’t need to define a type alias type None = None.type. It also is interesting to see that Scala doesn’t require us to define our classes outside of MyOption to later alias them: we just do everything at once.

Java8-based implementation

Now, let’s reuse Java 8 java.util.Optional!

import java.util.Optional

trait Java8Option extends OptionSig {

  type Option[+A] = Optional[_ <: A]
  type Some[+A]   = Optional[_ <: A]
  type None       = Optional[Nothing]

}

object Java8Option {

  implicit object ops extends OptionOps[Java8Option] {

    def some[A](x: A): Java8Option#Some[A] = Optional.of(x)

    val none: Java8Option#None = Optional.empty()

    def fold[A, B](opt: Java8Option#Option[A])(ifNone: => B, ifSome: A => B): B = {
      import java.util.function.{ Function => F, Supplier }
      def f = new F[A, B] { def apply(a: A): B = ifSome(a) }
      def supplier = new Supplier[B] { def get(): B = ifNone }
      opt.map[B](f).orElseGet(supplier)
    }

  }

}

Java 8’s Optional has only one class for the two cases, and it was made invariant. Still, we can easily fix that on the Scala side with [_ <: A].

`Any`-based implementation

Remember all the rage wars on Option vs null? Or the problem with boxing? Look at that:

trait NullOption extends OptionSig {

  type Option[+A] = Any
  type Some[+A]   = Any
  type None       = Null

}

object NullOption {

  implicit object ops extends OptionOps[NullOption] {

    def some[A](x: A): NullOption#Some[A] = x

    val none: NullOption#None = null

    def fold[A, B](opt: NullOption#Option[A])(ifNone: => B, ifSome: A => B): B = {
      if (opt == null) ifNone
      else ifSome(opt.asInstanceOf[A])
    }

  }

}

Yes, that’s right, we are relying on null for the None case while the Some case is the value itself :-)

But this is completely typesafe as it never leaks outside of the abstraction. The trick is that Null is a subtype of Any. And you can note that that there is no wrapping involved.

Back to our program

We now have four implementations of our option module, all behaving the same way:

scala> object MainWithScalaOption extends Program[ScalaOption]
defined object MainWithScalaOption

scala> MainWithScalaOption.main(Array())
optOpt: some(some(42))
optNone: some(none)

scala> object MainWithJava8Option extends Program[Java8Option]
defined object MainWithScalaOption

scala> MainWithJava8Option.main(Array())
optOpt: some(some(42))
optNone: some(none)

scala> object MainWithMyOption extends Program[MyOption.type]
defined object MainWithMyOption

scala> MainWithMyOption.main(Array())
optOpt: some(some(42))
optNone: some(none)

scala> object MainWithNullOption extends Program[NullOption]
defined object MainWithNullOption

scala> MainWithNullOption.main(Array())
optOpt: some(some(42))
optNone: none

How cool is that?

Summary

In the process, we have shown that typeclasses are a great alternative to the cake pattern when it comes to encode modules in Scala.

In practice, some variations are possible. For example, we could have ignored the subtyping relationships altogether. We would have end up with something closer to what happens in OCaml or Haskell as the constructors would both return a OptionSig#Option[A] instead of a subtype. Also, it would be easy to define some syntax enhancement, so that one could directly write something like myOption.fold("42", x => x.toString).

Finally, if you are interested in a more complex example using the techniques described here, have a look at Banana-RDF and its data model for RDF. The project provides five different implementations: (1) Jena and (2) Sesame, two competting Java libraries for RDF, (3) a pure Scala implementation that compiles down to JVM bytecode as well as (4) to Javascript through Scala-js, and finally (5) a pure Javascript implementation bound to rdfstore-js, again using Scala-js.

Available for hire

2015-02-04T00:00:00-05:00

Yesterday, the whole engineering team at Pellucid got laid off. I am now looking for new adventures.

My resume is available here on my website. Please contact me at alexandre@bertails.org.

I am open to relocation (almost) anywhere in the world, especially for interesting projects relying on Scala and RDF.

why you should hire me

I am a strong Scala developer with many years of experience and a good background in Computer Science. I love learning new skills and I get involved in Open Source projects, leading banana-rdf. I frequently give presentations (my next talk is at Scala Days San Francisco in just a few weeks) and organize conferences (nescala).

I am a Linked Data expert who worked at the W3C closely with Director Tim Berners-Lee, inventor of the World Wide Web, and gained a thorough and practical knowledge of HTTP and REST APIs. I am also the editor of two major Web standards: RDB2RDF Direct Mapping and Linked Data Patch Format.

Bonus: I have a lovely French accent :-)

what might make a difference

My first interest is in your product and the technologies you use. I am looking for a position where I will bring my expertise to the table but also where I will be challenged.

Then I will look at how you work as a team. I have learned over the years how culture can shape teamwork and I am eager to discuss with you what values and attitudes you encourage and nurture in your workplace.

Scala.JS will be for Javascript what Scala is for Java

2015-02-01T00:00:00-05:00

I am writing this article on my way back to New York, after a wonderful nescala 2015 in Boston. Definitely a grand cru. One of the hot topics there was Scala.JS, which is a technology we have recently started to use in banana-rdf. The various discussions and interactions I had during the conference made me realize this:

Prediction: Scala.JS will be for Javascript what Scala is for Java/JVM.
— Alexandre Bertails (@bertails) February 1, 2015

As one could have expected when someone makes such a prediction about programming languages, this sparked an interesting thread on Twitter :-) So let me try to refine what I think the value proposition is for Scala.JS and how I base it on what happened to Scala.

I don’t know many people who got interested in Scala for its own merits (I am not sure I know any…). In fact, we hear many voices pointing out its quirks, and they are real, but that misses the point: I don’t think that Scala would have become as mainstream as it is today if it was not for Java. Many of us came to Scala from Java because it hit a sweet spot: 1) it enables serious functional programming (no, lambdas are not enough…), 2) it gives us a richer and more robust static type system, and 3) it remains completely interoperable with Java. About that last point: we could code in Scala as if it were Java and easily interact with existing libraries.

My claim is that Scala.JS is doing something similar for Javascript, so let’s see how the previous points apply to it.

1) Functional programming has become more prevalent in the IT industry. Developers not only know it exits, they learn its merits and are trained to practice it. Actually, we have seen this trend in Javascript itself and two examples come to my mind: imperative callbacks are being replaced by more composable Promises and immutable datastructures are now a thing. Now, despite the fact that Javascript is becoming more functional, I don’t think it feels very natural yet for FP practitioners, while Scala is already offering a better solution in that area, both in the language itself and in its standard library.

2) I would claim that functional programming becomes interesting only when you are given a way to speak statically about the things you manipulate. This is why a robust and powerful type system is so important for so many people. Scala shines in that area. Look at projects like scala-js-jquery and imagine how easy it becomes to write jQuery code, being guided by the types while having the compiler checking for you that you are using the library correctly.

3) Scala is extremely versatile and captures surprisingly well Javascript’s specificities. At the language level, everything you can do in Javascript can be mapped to Scala almost 1-to-1, and Scala’s dynamic compabilities even let you interact with the lack of types when working with Javascript libraries. Based on my experience, writing typed facades for existing libraries is straightforward, and the main challenge is actually figuring out how to properly use the underlying libraries because there are no types to guide you.

Then there is the obvious stuff: all of a sudden, plenty of efficient immutable datastructures and libraries from the Scala world become available in the browser; tools like IDEs finally become usable with code completion and type checking; the code can be optimized because the types are statically known; and finally, Scala.JS being just Scala, it comes with a rich ecosystem and community.

@bertails Isn't that what Coffee was supposed to be?
— Robin Berjon (@robinberjon) February 1, 2015

@bertails @mandubian Clojure(Script) looks to me a better candidate on top of JS than Scala. much more close (loosely typed, functional)
— Gaëtan Renaudeau (@greweb) February 1, 2015

Just like with Java, many people will be happy to write plain Javascript for possibly quite a long time. But let’s say you disagree with one or more of my points above: you still have plenty of contenders to choose from. But my gut feeling is that there is a huge community out there waiting for a compelling alternative that would bring the triptych functional-programming/static-typing/good-js-interop. Scala.JS just hits that sweet spot and the most exciting times are ahead!

Why LD Patch

2014-09-20T00:00:00-05:00

The LDP Working Group recently published LD Patch, a format for describing changes to apply to Linked Data. It is suitable for use with HTTP PATCH, a method to perform partial modifications to Web resources.

After explaining the need for a PATCH format for Linked Data, I will go through all the other candidate technologies that the group considered, before explaining the rationale behind LD Patch. It is fair to remind the reader that the group is still eager for feedback, and that not all the group participants would agree with the views expressed in this post.

Genesis

Despite strong interest from the group participants in a way to partially update LDP Resources with HTTP PATCH, settling on which format to use proved to be more difficult than expected. The group could only agree on standardising the use of PATCH over POST, and decided to wait for concrete proposals while allowing the main specification to reach completion.

Work on a PATCH format for LDP got on a limbo for a while, and concretely resumed during the 5th LDP face-to-face in Boston, MA, where I presented all the proposals the group had gathered so far. I had completed the implementations of both Eric’s SparqlPatch and Pierre-Antoine’s rdfpatch in banana-rdf at that time. Those two proposals were for me the only two serious challengers.

A PATCH format for LDP

Enough talking. What do we even mean by a PATCH format for LDP? Consider the following RDF graph:

$ GET -S -H 'Accept: text/turtle' http://www.w3.org/People/Berners-Lee/card
200 OK
@prefix schema: <http://schema.org/> .
@prefix profile: <http://ogp.me/ns/profile#> .
@prefix ex: <http://example.org/vocab#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://www.w3.org/People/Berners-Lee/card#i> a schema:Person ;
  schema:alternateName "TimBL" ;
  profile:first_name "Tim" ;
  profile:last_name "Berners-Lee" ;
  schema:workLocation [ schema:name "W3C/MIT" ] ;
  schema:performerIn _:b1, _:b2 ;
  ex:preferredLanguages ( "en" "fr" ).

_:b1 schema:name "F2F5 - Linked Data Platform" ;
  schema:url <https://www.w3.org/2012/ldp/wiki/F2F5> .

_:b2 a schema:Event ;
  schema:name "TED 2009" ;
  schema:startDate "2009-02-04" ;
  schema:url <http://conferences.ted.com/TED2009/> .

Even if you are not well-versed in RDF and Turtle, I bet you can still understand that this piece of data is about a person named Tim Berners-Lee, identified by the URI <http://www.w3.org/People/Berners-Lee/card#i>. Also, TimBL seems to have been a participant in two events, each of them having some data attached to them. Also, do you see how those _:b1 and _:b2 identifiers give you more flexibility than plain JSON? They are identifiers local to this graph and are called blank nodes.

Other blank nodes get handled by the Turtle syntax, as you can see if you click on the following graph for a full-size visual representation of the data:

As a side note, let me bring your attention on the URIs being used here: they all resolve to actual documents on the Web, including the vocabularies from schema.org and Facebook’s Open Graph Protocol.

Now, let’s imagine that TimBL wants to add some geo coordinates to the TED event.

RDF Patch / TurtlePatch

Here is what TimBL could do with RDF Patch:

$ cat query.rdfp
Add _:b2  <http://schema.org/location>  _:loc .
Add _:loc <http://schema.org/name>      "Long Beach, California" .
Add _:loc <http://schema.org/geo>       _:geo .
Add _:geo <http://schema.org/latitude>  "33.7817" .
Add _:geo <http://schema.org/longitude> "-118.2054" .
$ cat query.rdfp | PATCH -S -c 'Content-Type: application/rdf-patch' http://www.w3.org/People/Berners-Lee/card
204 No Content

Well, this actually does not work.

Remember when I said that the blank node _:b2 was a local identifier for the graph? This means that TimBL cannot refer directly to the TED event from outside the document. That would require for the server and the client to agree on a stable identifier for that blank node. That process is called skolemization. It brings a lot of burden on the server to manage those stable identifiers. Also, while the use of blank nodes is mostly transparent in Turtle and JSON-LD as they are hidden in the syntax, skolemization would break the syntax.

TurtlePatch has similar expressive power compared to RDF Patch, but it is defined as a subset of SPARQL Update. It also defines skolemization as being part of the protocol, where the client can ask for a skolemized version of the graph, which would then be required before PATCHing.

Because blank nodes occur very frequently and skolemization was a no-go for several participants of the group, the results of one of the strawpolls we had on this subject were welcomed with surprise:

STRAWPOLL: I’d rather have a solution that (a) doesn’t address certain pathological graphs, or (b) requires the server to maintain Skolemization maps

The participants were largely in favor of (a), while (b) had basically no support. Knowing that, the group could now focus on alternative proposals, such as SparqlPatch.

SparqlPatch

SparqlPatch was proposed by Eric Prud’hommeaux, one of the editors for the SPARQL query language. SparqlPatch is a profile for SPARQL Update, as it is defined as a subset of it: a valid SparqlPatch query will always be a valid SPARQL Update query, sharing the same semantics.

Why not full SPARQL Update? Well, SPARQL Update comes with a complex machinery for matching nodes in a graph store. Complexity is never a bad thing when it is justified, which is the case for most SPARQL applications. But it is definitely overkill in the context of LDP, hence Eric’s proposal.

With SparqlPatch, TimBL would be able to update his profile using the following query:

$ cat query.sparql-patch
PREFIX schema: <http://schema.org/>
INSERT {
 ?ted  <http://schema.org/location>  _:loc .
 _:loc <http://schema.org/name>      "Long Beach, California" .
 _:loc <http://schema.org/geo>       _:geo .
 _:geo <http://schema.org/latitude>  "33.7817" .
 _:geo <http://schema.org/longitude> "-118.2054" .
}
WHERE {
 ?ted schema:url <http://conferences.ted.com/TED2009/>
}
$ cat query.sparql-patch | PATCH -S -c 'Content-Type: text/sparqlpatch' http://www.w3.org/People/Berners-Lee/card
204 No Content

The WHERE clause binds the variable ?ted to the node that satisfies the schema:url constraint, and that variable can now be used to INSERT new triples.

This is definitely better and worth considering, as we now have a way to PATCH graphs with blank nodes. But this is still not perfect…

The runtime complexity for matching nodes in a graph is known to be extremely bad in some cases. While SparqlPatch is better that SPARQL Update in that regard, there are still some issues, which become apparent only when you start implementing and thinking about the runtime semantics. The main data structure in the SPARQL semantics is the Solution Mapping, which keeps track of which concrete nodes from the graph can be mapped to which variables, applying to each clause in the WHERE statement. So the semantics of the Basic Graph Pattern (ie. all the clauses in the SparqlPatch’s WHERE) involves a lot of costly cartesian products.

Also, it would be nice to change the evaluation semantics of the Basic Graph Pattern such that the evaluation order of the clauses is exactly the one from the query. It makes a lot of sense to let the client have some control over the evaluation order in the context of a PATCH.

SPARQL Update can also be confusing in that if a graph pattern doesn’t match anything, the query still succeeds with no effect on the graph. I have seen many engineers get puzzled by this (perfectly well defined) behaviour, because they were expecting the query to fail: this would happen every time a predicate gets typoed. I am jumping a bit ahead but that is one reason why LD Patch cannot be compiled down to SPARQL Update while preserving the semantics.

Finally, SparqlPatch has no support for rdf:lists. On one hand, SPARQL is heavily triple-focused and has never played very well with rdf:list. List matching improved in SPARQL 1.1 with Property Paths but their support is not native, in that common operations such as slice manipulation, update, or even a simple append, need to be encoded in the query.

On the other hand, lists are a common data structure in all applications. They come with native support in syntaxes like Turtle or JSON-LD. Append is a very common operation and the user should not have to think about the RDF list encoding for such simple operations.

Limited node matching capabilities and native rdf:list support are two features of LD Patch.

LD Patch

LD Patch was originally proposed by Pierre-Antoine Champin. The format described in the First Public Working Draft is very close to his original proposal. I became an editor for the specification to make some syntactical enhancements and to make sure that we could provide a clean formal semantics for it.

Pierre-Antoine maintains a Python implementation. On my side, I have a Scala implementation working with Jena, Sesame, and plain Scala. Andrei Sambra, the third editor, is working on Go and Javascript implementations.

A potential drawback for LD Patch is that some RDF graphs cannot be patched. They are deemed pathological and are very rare in practice: Linked Data applications should never be concerned. This may not be true for some SPARQL applications, but this is not our use-case here.

Let’s see what TimBL’s query would look like using LD Patch:

$ cat query.ld-patch
@prefix schema: <http://schema.org/> .

Bind ?ted <http://conferences.ted.com/TED2009/> / ^schema:url .

Add ?ted schema:location   _:loc .
Add _:loc schema:name      "Long Beach, California" .
Add _:loc schema:geo       _:geo .
Add _:geo schema:latitude  "33.7817" .
Add _:geo schema:longitude "-118.2054" .
$ cat query.ld-patch | PATCH -S -c 'Content-Type: text/ldpatch' http://www.w3.org/People/Berners-Lee/card
204 No Content

Unlike SparqlPatch, the Bind statement does not operate on triples. Instead, an LD Path expression (/ ^schema:url) is evaluated against a concrete starting node (<http://conferences.ted.com/TED2009/>). The result node gets bound to a variable (?ted) which can then be used in the following statements. That is the main difference when compared to SparqlPatch semantics.

Note: LD Path expressions are very similar to the JSON Pointers used in JSON Patch, and to the XPath selectors used in XML Patch.

The runtime semantics for LD Path expressions only rely on a node set. The final set must have a unique value to successfuly be bound to the variable, otherwise it results in an error. A path expression is processed from left to right, and can have nested paths for filtering nodes.

Given that semantics, you can imagine that it is 1. easy to reason about, 2. easy to implement, and 3. very efficient. I would even argue that you cannot remove functionalities from the path expressions without throwing away a whole class of interesting RDF graphs that LD Patch is able to patch.

Writing a parser for LD Patch proved to be of similar difficulty than for SparqlPatch, as they share most of their respective grammars with Turtle. Most of the code for the engine itself actually lies in the support for rdf:list, which basically encodes what users would have to do in their queries if they didn’t have native support for list manipulations. So this ends up being done in one place, once and for all, and that is indeed a very good thing.

The UpdateList operation is very similar to how slicing is done in Python. I invite you to read the corresponding section in the specification for more examples. LD Patch slicing is very intuitive and so far it has met no resistance in the Working Group.

Subjectivity

It took a very long time before the group was able to publish LD Patch. I still regret that any opportunity would be taken by few people to challenge the whole technology, often without even providing which requirements they would like to address.

For example, the main criticism seems to be about the syntax. Yes, it is a new one, even though 68% of the grammar is shared with Turtle. In particular, it is different from the SPARQL Update syntax. But apparently, it doesn’t matter to some folks if the semantics are not the same.

I have many, many times given my list of requirements (it is not only mine: those requirements are of course shared by others) on the LDP mailing list but somehow, they were never really challenged, and the arguments about syntax keep coming back. So for the record, here they are:

the context is Linked Data, and especially the Linked Data Platform
bare minimum for LDP Resource diff, that is, no high-level features
support for blank nodes, but pathological graphs are ok
no skolemization
first-class citizen rdf:list manipulations
reasonable runtime complexity
easy to implement without the need for an existing SPARQL Update implementation
not being able to bind a node is a failure
being a reasonable alternative for the JSON-LD folks using JSON Patch, because they don’t have better

If you want to make counter proposals, please make sure that those requirements are addressed. Also, you should accept the fact that if you have a different set of requirements, then LD Patch is probably not what you want. Finally, if you think that the above requirements are wrong in the context of LDP, then you should make an official complaint to the group explaining your reasoning.

I would like to emphasize that relying on an existing syntax (such as SPARQL) was never a requirement for me. While reusing bits of SPARQL Update in LD Patch whenever it makes sense is reasonable, it should be done sparingly. For example, I argued on the LDP mailing list that shared syntax with different (runtime) semantics could break some user expectations.

Frequently Asked Questions

Thanks to David Booth for providing me with well formulated questions and concerns. Here are some answers. They only complete the arguments in the other sections of this post.

Are there any concerns about inventing a new syntax? What if SPARQL, or a profile of it, could not address all the requirements? What if a subset of the syntax was no longer aligned with the superset semantics?

Isn’t this yet another syntax similar to SPARQL, which ends up confusing newcomers? Of course there are similar: exactly 68% of the grammar rules for LD Patch are directly coming from Turtle, and SPARQL made a similar choice.

Would using a single language decrease development and maintenance costs? I would like to see actual evidence of that claim. Some people actually have a more nuanced opinion on that subject, and I tend to agree as I find myself using the language/framework that I find the most fit to a given job.

Can implementers simply plug in an existing general-purpose SPARQL engine to get a new system up and running quickly? Not so easy. You still need to reject the valid SPARQL Update queries that are not valid LD Patch queries. And you can be sure that I will make sure that the test suite has tests for that :-) And because I have done it, I can claim that unlike full SPARQL, LD Patch is quick and easy to implement.

Would implementers have the option of supporting additional SPARQL 1.1 Update operations? There is definitely a use-case for querying data in LDP Containers using SPARQL, or using a more ad-hoc query language with support for ordering, filtering, and aggregation. And it is true that bulk updating could be addressed with SPARQL Update. But those use-cases are different from PATCH.

Next steps

The First Public Working Draft just got published. As expected, the document is getting reviewed by experts, who have already started to provide feedback to the group.

In the meantime, the editors are working on completing the semantics section of the document. A proposed approach was to provide a translation from LD Patch to SPARQL Update. While this is definitely useful for people with a SPARQL background, this cannot be used as a formal semantics. We are trying to find a good trade-off between the usual tooling from formal semantics theory, and something that could be read by people without such a theoretical background.

And finally, after the specification gets completed, we will focus on providing a test suite. The plan is to make it part of the LDP one.

That’s all folks. (and thanks Andrei for reviewing drafts for this post)

Finally my own blog

2014-09-16T00:00:00-05:00

I have finally found the time and the motivation to put together my own blog \o/ I had actually planned to do so for about 10 years, basically since I own bertails.org… It is not completely ready yet but I prefer to release it now and work out the issues later. Otherwise it would never happen.

Sooo, how does this work? I wanted something as easy to use as possible. So I have settled on Pelican. At least for now. As I don’t want to pollute my environment with Python dependencies, I am using Docker to generate the static version of this website. The project started as a clone of https://github.com/jderuere/docker-pelican but I quickly rewrote everything, including the Dockerfile. I run Pelican within the container but against the mounted website directory, and I propagate my user from the host to avoid right issues (Docker uses root by default). So I do something like

docker run --name=pelican -d -v `pwd`/website:/srv/pelican-website \
  -p 8000:8000 betehess/pelican

The theme is directly based on Paul Rouget’s website, with few adaptations. Most important ones are the fixed font (Ubuntu Mono) and the greenish colour for the links. I only use 2 templates from Pelican: index.html and article.html. The blog now becomes the main entry point for http://bertails.org. The previous index page has moved to http://bertails.org/alex as I intend to use http://bertails.org/alex#me as my WebID.

My mugshot was taken by my friend and ex W3C colleague Amy van der Hiel. There are very few pictures of me on the Web :-) I cannot remember where the font icons are coming from though :-/

There are no comments at the moment. The reason is that I couldn’t find anything that I liked. I have the markup and the CSS ready though. So it should land here in just a few weeks.

What will you find on this blog? Mainly articles about Scala and Linked Data. I will maintain RSS feeds for those subjects when the time comes.

Stay tuned!