Beyond Alphabet Soup: 5 Guidelines For Data Sharing

MFG Archive

alpha soup 2 We know (intellectually) not to rely on magical solutions to drive our work. In practice, however, we sometimes fall into the trap of unwitting, magical assumptions. The reality is that underlying any amazing feat we might accomplish, you can bet on solid infrastructure, process and groundwork to account for it. Andy Isaacson, Forward Deployed Engineer at Palantir Technologies, takes us there – to the heart of responsibly making data usable …and useful for people.

(Note: Please see the accompanying reference document: OPEN DATA, DONE RIGHT: FIVE GUIDELINES – available for download and for you to add your own thoughts and comments.)
…

The Batcomputer was ingenious. In the 1960s Batman television series, the machine took any input, digested it instantly, and automagically spat out a profound insight or prescient answer – always in the nick of time (watch what happens when Batman feeds it alphabet soup.) Sadly, of course, it was fictional. So why do we still cling to the notion that we can feed in just any kind of data and expect revelatory output? As the saying goes, garbage in yields garbage out; so, if we want quality results, we need to begin with high quality input. Open Data initiatives promise just such a rich foundation.

Presented with a thorny problem, any single data source is a great start – it gives you one facet of the challenge ahead. However, to paint a rich analytical picture with data, to solve a truly testing problem, you need as many other facets as you can muster. You can often get these by taking openly available data sets and integrating them with your original source. This is why the Open Data movement is so exciting. It fills in the blanks that lead us to critical insights: informing disaster relief efforts with up-to-the-minute weather data, augmenting agricultural surveys with soil sample data, or predicting the best locations for Internally Displaced Persons camps using rainfall data.

High quality, freely available data means hackers everywhere, from Haiti to Hurricane Sandy, are now building the kinds of analytical tools we need to solve the world’s hardest problems. But great tools and widely-released data isn’t the end of the story.

At Palantir, we believe that with great data comes great responsibility, both to make the information usable, and also to protect the privacy and civil liberties of the people involved. Too often, we are confronted with data that’s been released in a haphazard way, making it nearly impossible to work with. Thankfully, I’ve got one of the best engineering teams in the world backing me up – there’s almost nothing we can’t handle. But Palantir engineers are data integration and analysis pros – and Open Data isn’t about catering to us.

It is, or should be, about the democratization of data, allowing anybody on the web to extract, synthesize, and build from raw materials – and effect change. In a recent talk to a G-8 Summit on Open Data for Agriculture, I outlined the ways we can help make this happen:

#1 – Release structured raw data others can use
#2 – Make your data machine-readable
#3 – Make your data human-readable
#4 – Use an open-data format
#5 – Release responsibly and plan ahead

Abbreviated explanations below. Download the full version here: OPEN DATA, DONE RIGHT: FIVE GUIDELINES
…
#1 – Release structured raw data others can use

One of the most productive side effects of data collection is being able to re-purpose a set collected for one goal and use it towards a new end. This solution-focused effort is at the heart of Open Data. One person solves one problem; someone else takes the exact same dataset and re-aggregates, re-correlates, and remixes it into novel and more powerful work. When data is captured thoroughly and published well, it can be used and re-used in the future too; it will have staying power.

Release data in a raw, structured way – think a table of individual values rather than words – to enable its best use, and re-use.

#2 – Make your data machine-readable.

Once structured, raw data points are integrated into an analysis tool (like one of the Palantir platforms), a machine needs to know how to pick apart the individual pieces.

Even if the data is structured and machine readable, building tools to extract the relevant bits takes time, so another aspect of this rule is that a dataset’s structure should be consistent from one release to the next. Unless there’s a really good reason to change it, next month’s data should be in the exact same format as this month’s, so that the same extraction tools can be used again and again.

Use machine-readable, structured formats like CSV, XML, or JSON to allow the computer to easily parse the structure of data, now and in future.

#3 – Make your data human-readable.

Now that the data can be fed into an analysis tool, it is vital for humans, as well as machines, to understand what it actually means. This is where PDFs come in handy. They are an awful format for a data release as they can be baffling for automatic extraction programs. But, as documentation, they can explain the data clearly to those who are using it.

Assume nothing – document and explain your data as if the reader has no context.

#4 – Use an open-data format.

Proprietary data formats are fine for internal use, but don’t force them on the world. Prefer CSV files to Excel, KMLs to SHPs, and XML or JSON to database dumps. It might sound overly simplistic, but you never know what programming ecosystem your data consumers will favor, so plainness and openness is key.

Choose to make data as simple and available as possible: When releasing it to the world, use an open data format.

#5 – Release responsibly and plan ahead

Now that the data is structured, documented, and open, it needs to be released to the world. Simply posting files on a website is a good start, but we can do better, like using a REST API.

Measures that protect privacy and civil liberties are hugely important in any release of data. Beyond simply keeping things up-to-date, programmatic API access to your data allows you to go to the next level of data responsibility. By knowing who is requesting the data, you can implement audit logging and access controls, understanding what was accessed when and by whom, and limiting exposure of any possibly sensitive information to just the select few that need to see it.

Allow API access to data, to responsibly provide consumers the latest information – perpetually.
…
These guidelines seem simple, almost too simple. You might wonder why in this high tech world we need to keep things so basic when we have an abundance of technological solutions to overcome data complexity.

Sure, it’s all theoretically possible. However, in practice, anybody working with these technologies knows that they can be brittle, inaccurate, and labor intensive. Batman’s engineers can pull off extracting data from pasta, but for the rest of us, relying on heroic efforts means a massive, unnecessary time commitment – time taken away from achieving the fundamental goal: rapid, actionable insight to solve the problem.

There’s no magic wand here, but there are some simple steps to make sure we can share data easily, safely and effectively. As a community of data consumers and providers, together we can make the decisions that will make Open Data work.

August 14, 2013