annoying hide: articles

Why Data Needs a Standard

tl;dr

This post attempts to explain why data standardization 1) is necessary, 2) is possible, and 3) can be implemented in a relatively straight forward way via open source tools.

A Short Story

In 2016, I sat down for lunch (at the exquisite Falafel King, no less) with a friend, a Putnam Economist/Analyst. As we ate, my friend pitched me on a data idea he’d been thinking about. Admittedly, at first I half rolled my eyes, thinking, great, another startup idea from a non-tech guy to a tech guy. But at the end of the conversation, I was totally hooked.

Falafel King on Summer and Otis Street in Boston.
Falafel King on Summer and Otis Street in Boston.

As he put it, he spent about 95% of his time gathering data from various sources (political, demographic, economic, etc.), and about 5% of his time actually running the data. “Running the data is a solved problem. What isn’t solved is getting it easily from one format into another.”

The table where the conversation took place.
The table where the conversation took place.

Long story short, over the next couple years we spent a good deal of time discussing what a potential data standards solution might look like, how to possibly build such a platform, and which early audiences to swing for.

The Problem

Let’s say for a moment that—whatever your line of work—you have a bunch of internal data that’s just fantastic. You’ve worked hard to acquire it over the years, or you’ve generated it somehow. Doesn’t really matter how, but you’ve got it. And it’s neat. And tidy. And it fits your system perfectly.

Good data. Hey! There's a Tetris row!
Good data. Hey! There's a Tetris row!

Then the day comes where you have to go get data somewhere else. Sources span the gamut. Wikipedia. Some random hobbyist’s website. A 3rd-party API with a different schema than your data. External data rarely just fits the mold.

Data is, in essence:

  • Inconsistent
  • Inaccurate
  • Incomplete
  • Undocumented, and
  • Unversioned
Can't win Tetris with bad data.
Can't win Tetris with bad data.

And machines are terrible at fixing this. It’s a costly problem, and takes engineering overhead. Sure, computers are conditioned to do what you program them to do, but the ability to arbitrarily match data column-a to data column-x requires deep human contextualization. Machine Learning is barely up to the task, if even. There is a certain kind of cognitive process that humans are just much, much faster at doing.

In short, it’s hard to find and maintain good data.

A Tried (and Tired) Solution

Originally, our solution focused around building a platform to organize and share the world’s data, a sort of social network for searchable, shareable data (think “GitHub for data”). And all the bells and whistles around that. Data entry. API access. Documentation & metadata. Commenting. Rating. Private and public datasets. Versioning. Schemas. Inheritance. Formulas. And a focus on as close to perfect UI/UX as possible.

Basically, such a platform would make data:

  • Accurate
  • Complete
  • Consistent
  • Documented
  • Transferable, and
  • Versioned

The basic vision and value proposition being that there needs to be a focus on building a vibrant community of open data contributers, and a thriving marketplace of open source and retail data.

Just one problem. If you dig just below the surface, you’ll find a myriad of somewhat-known companies (and a few very well know) trying to fulfill just this vision, and failing. This post isn’t actually meant to cover all the existing data platforms in this space, but I’ll name just a few in passing to note the diversity of efforts and—in my view—missing-the-target attempts.

At one end of the spectrum, on the more open side, RapidAPI, which raised $25M this year in series B funding, basically helps users to “find and connect to thousands of APIs.” It’s a great service, and perhaps closer to the target than most, but there’s a missing secret sauce (hint: it’s detailed briefly below). At the other end of the spectrum are highly secretive data-swamp collecting companies like Palantir where the data is only useful to high-paying clients with (sometimes controversial) ambitions. Then there’s the sort-of “usable” middleware companies, such as Google, whose stated mission is to organize the world’s information. But that usability is hindered by Google’s own inability to execute against its stated mission, by trying to control the data flow. A topic for another day.

Data standardization companies.
Data standardization companies.

Even internet-father Tim Berners-Lee, whose ambitions appear every bit pure, has been trying—unsuccesfully, in my book—to reinvent the internet as not just an information highway, but also a data highway. (I’ll speak to this also briefly below.)

All of this to say, the internet wants, very, very badly to standardize data, but since data is a new gold rush, incentives are mostly misaligned to opening it up.

Tim Berners-Lee is Both Right and Wrong

So, after several years of tinkering and toying with various iterations (I did at one point attempt to build a POC website), the idea somehow wormed itself into my brain (and I think correctly) that the first incarnation for a thriving data marketplace needs to be 100% open source.

I spoke of Berners-Lee earlier. This is his vision. He believes in a data highway. And he’s right. But also wrong.

I often frame both the problem and the solution in the context of his internet. In essence, we’ve built an information super highway, but not (in the same fluid paradigm) a data highway. Let me emphasize that, because it’s critical. The internet is not a data highway. Some of you will balk at this, but let me illustrate what Berners-Lee (and I do think very fondly of him) has wrong. Even though he has been beating down this door with a Linked Data initiative since at least 2006, I don’t think he’s ever actually framed a correct solution.

But just so you know I have respect for him, here’s Tim Berners-Lee at Gov 2.0 Expo 2010 eating your lunch.

Don't mess with the Berners-Lee.
Don't mess with the Berners-Lee.

The information highway is not an analogue. It’s just a different beast altogether. Human’s store information in their brains vastly different that contextualized data for machines. This isn’t to say it hasn’t become a transport service for APIs. It most certainly is, and JSON seems to be the winning convention in most cases. But Linked JSON doesn’t make any sense. You don’t link one piece of data to another for there to be a highway. The information highway has been built for humans to easily parse. Hypertext, however, is just the completely wrong language, and a wrong metaphor.

The Right Solution: Open Source

All that said, I think the time is probably ripe for an open source service to make data standardization and an API-drive data-matrix highway happen.

To that end, I’ve started tinkering with something I’m calling synq (name subject to change). It’s a (big caveat!) work-in-progress library, not ready for any serious use at all, and is only in proof of concept stages. I plan on writing a short white paper (or something to that effect) of sorts to walk through the functionality.

But the main point is to start not only open-sourcing the code, but also the knowledge share. Something like this needs to exist in the public domain. I don’t think it’s actually possible to do without a fundamentally 100% open source CLI, free to build a market around. Sort of like what Git did for the software market.

This thing I’ve been building and tinkering with in my brain for the last four years, I’ve finally had enough “Eureka!” to start evangelizing it in the engineering world. So while I’ve maybe been a little too short on detail in this article, I plan on laying out, step by step, what I think it looks like, and solicit feedback from others in the community who are smarter to get it right. But just laying the foundation for now.