Semantic Data ain't the Golden Path (yet!)

Hola 👋🏽

In the past 5 years, I've worked at and with 4 companies that worked heavily with metadata.

Only one of them had acknowledged it. They also had efforts in place to leverage their humongous knowledge graph. The other one was a startup where we saw the need for it. Then the other 2 hadn't really gotten around thinking of it yet but I helped raise awareness and eventually, something got done about it.

I keep seeing companies struggle with data modeling because it is delegated to developers that don't understand the domain, and it kinda bugs me because we can do better.

It also bugs me that the go-to tool for domain modeling is SQL. SQL databases can be notoriously stiff to evolve. The more constraints you add to your data, the harder it gets to change it.

They are also notoriously hard to scale. But there is just so much written about them and so many frameworks built to work with them that it's hard to blame anyone for thinking that This Is The Way.

I think domain knowledge has never been more relevant, and especially for startups domain knowledge changes at a ridiculously fast pace.

When we started Doorling, a real-estate intelligence startup, we had an idea of what was a Home. Turned out that what we thought was a Home was really more of a "Raw WebScraped Home". In fact, most of the places we got data from had different views of what was a Home, so we needed to be able to see a Home from the perspective of several scrapers, realtor listings, and websites, including our own "canonical" view of what a Home was.

And we had to evolve our schemas too! Hopefully without breaking any of our existing data, since all of our ML pipelines relied on those schemas. And the ML models were where the money came from, so we had to be careful not to break them in the process.

When you're running lean, you can't afford to do that big a refactor every 2 weeks. Regardless of the tech you use (yes, we used OCaml + Erlang, so we were very well prepared for safe large refactors) because they still take time! Time you should spend on product instead.

So we wrote our schemas in RDF/Turtle, and did some code generation to keep different parts of the system in sync. Our metadata database was a little more than a K/V running on Erlang's mnesia. Instead of writing down entire rows in tables, we wrote down tiny facts about entities.

A Fact is roughly a quintuple that says "who" said "what", "when", and about "who".

In other words (not actual code):

let fact = {
  source: "doorling:scrapper:hem-world-10211"; 
  entity: "doorling:home:10000000001";
  field: "doorling:home:price";
  value: "1.2MSEK";
  stated_at: <some timestamp>
}

Accidental complexity aside (mnesia, ttl files, codegen), this was a pretty sweet setup. We could evolve our schemas after a quick chat with our PO, tag new fields as experimental, and keep building.

Another company I worked with had a similar setup where integrating data from a large number of sources in a large number of formats was a prerequisite - we opted for something lighter weight, but that at the end of the day had the same benefits.

Add a message queue for updates on every newly stated fact and you got yourself a framework for real-time semantic data pipelines. Consume updates and you can populate and build features like "search anything" in a day.

In fact, I'd argue that all no-code tools would benefit massively from a Metadata Store to allow people to model whatever the hell they need in a way that's interoperable with other no-code tools.

Anyway, I'm not here to convince you SQL is bad. It isn't. It's great, and you should use whatever you're productive in.

I'm just wondering what it'd take to get this tech to a mass-adoption stage. I think tooling is a good step forward.

Tooling: A Modern Language

I can imagine other people on the verge of adopting RDF to say "yeah, no" when they see this:

###  https://abstractmachines.dev/wittgenstein/Dota2#Attribute
:Attribute rdf:type owl:Class ;
           rdfs:subClassOf :DotaThing .

###  https://abstractmachines.dev/wittgenstein/Dota2#hasAffinityWith
:hasAffinityWith rdf:type owl:ObjectProperty ;
                 rdfs:subPropertyOf :DotaObjectProperty ;
                 rdf:type owl:SymmetricProperty ;
                 rdfs:domain :Attribute ;
                 rdfs:range :Attribute .

###  https://abstractmachines.dev/wittgenstein/Dota2#MovementSpeed
:MovementSpeed rdf:type owl:NamedIndividual, :Attribute ;
               :hasAffinityWith :LinkBreakDistance .

Writing Turtle doesn't come very naturally to me, to be honest. Once I get into the mental model I can sort of roll along, but it has rather strange aesthetics to my eyes.

JSON-LD doesn't cut it either, as we end up with more and more structural noise the more complex our constraints become.

So my wish for making Semantic Data more accessible is a rather small and pretty language that can be compiled to Turtle for backward interop. (And generated from Turtle too).

Here's a sketch of a language I'm Datalang (how original):

# Kinds define new owl:Class'es
kind Venue
  @en "Location of the bootcamp"

# Attributes define owl:DatatypeProperty's
attr Rooms in Venue
  range: integer
  cardinality: 1 or more

kind Participant
  @en "A student at a collaborating university"

# Link define owl:ObjectProperty, or relations between kinds
link HasMentor
  @en "all participants have a mentor, there is one mentor per 6 participants"
  from: Participant
  to: Mentor

# And you can use anything before it has been defined! 
kind Mentor
  @en "Support for the teams"

Anyway, that's just a sketch. I'd like to see a language like this that can make no-code tools easier for domain experts without sacrificing the great properties that Semantic Web technology has.

Tooling: Code Generation

There's plenty of tools out there to do code-generation based on schemas. Avro, Protobuf, ATD. But I haven't found one that is aimed at graph-like data with a heavy semantic charge.

If I could grab some of the above Datalang and spit out abstract datatypes with idiomatic APIs for different languages, and a common serialization/deserialization, then I could literally ask a PO to make a change to the schemas in Github and let CI guide the PO on how to make the changes backward compatible, or follow whatever evolution rules are preferred.

I've worked with and written Protobuf, GraphQL, Erlang, OCaml, and Javacript codegen from Turtle files before, but there isn't a cohesive vision for how all of this should work together.

Not yet at least!

I'd like to talk as well about other gaps in the tooling, like semantic data platform where you just drop your schemas and hook some consumers to build business features on, but this issue would get too long otherwise.

There's also the can of worms of data modeling is hard. We'll leave that to future issues too.

Let me know why do you think that we don't do more Semantic Data? And what's stopping you from getting into it? I'd like to know more 🙌🏽

Hope you're having a good morning and will see you next week!

👋🏽

/ Leandro

Feb. 22, 2021, 9:50 a.m.

Rambling Machines

Tooling: A Modern Language

Tooling: Code Generation