Rebuilding the Data Process

As data proliferates and organizations expand programs to use data to answer questions, one major obstacle remains: data is too expensive, with data projects routinely spending 80% of their budget on acquisition and preparation. For budget-constrained organizations in government, non profits and journalism, the cost of acquiring data prevents them from ever analyzing the data, so reducing the cost of data is a priority for expanding its use.

In this paper, we’ll present a vision for how civic and social users of research data should be able to use data, a vision that will motivate the solution we’ll present in part 2.

Every data research project is unique, but they tend to follow a similar series of steps. After the scope of the project has been established and the project managers understand what data they will need, most projects will follow a process that involves:

  • Acquisition.  THhe first step is locating the required data and securing permission to use it.
  • Preparation. The most expensive step, preparation, involves converting formats, cleaning bad records, and converting data values.
  • Linking (sometimes.) Datasets are most interesting after they’ve been linked to other data, such as Census demographics.
  • Analysis. In the analysis phase, data professionals create statistical analyses, predictions, charts and graphs, or other data products.
  • Reporting. The final step requires packaging the technical output of the analysis phase in a way that can be shared with, and understood by, others.

The last two steps, Analysis and Reporting, usually require extensive skills, and although some types of projects, like Indicator websites, can make these steps accessible to less skilled people, it is the first three steps that have the highest cost and are most susceptible to commoditization.  In the current world, these first three steps can consume 80% of a data project’s budget. In a perfect world, all of the data that anyone wants to use for reporting and analysis will be easily available, already cleaned, linked to other datasets and ready for use. That’s precisely the world that we’ll explain how to achieve.

What would that world look like for data users? Consider what the typical use cases are for some common data users, such as:

  • The Professional Statistician.  Professionals, such a Biostatisticians, epidemiologists and data scientists, need access to large quantities of disaggregated data to do statistical or predictive analysis.
  • The Casual Analyst. Technically minded workers, often without formal statistical training, who produce charts, graphs and tables for decision makers, and occasionally perform simple analyses like smoothing or trend lines.
  • Website and Dashboard Developers. High-skill technologists, usually without formal statistical training, who create data products for others, such as indicator websites and business intelligence dashboards. They will acquire and clean data, and create tools for others to use to perform simple analysis.

Here are a few scenarios illustrating how these users should be interacting with data.

A casual analyst is looking for data about how much trash his city generates each year. A web search produces a few hits, one of which is a data repository with data from the local sanitation department. The repository lists the file he needs, in a variety of formats. He downloads an Excel file and produces a chart.

A professional analyst is analyzing trends in school test test scores. She performs a web search for school test score data for her state, and the first three hits are relevant. The first hit takes her to a web page with an “Install” button; clicking the button installs the data to a database on her desktop computer. The web page also lists other datasets that are connected to the school test scores, including one that is a cross walk between schools, school enrollment areas and census tracts. With two more clicks and installs, she now has a database with test scores linked to census areas, and she can open those datasets in her favorite statistical tools, such as STATA or R, and begin her analysis.

A website developer is hired to create a health data dashboard for the County, and is given a list of datasets as part of the project requirements. All of the datasets are stored in a state-wide data library, so he uses a web application to find the ID numbers for each of the datasets and creates a manifest file that lists the formal ID numbers for the datasets. He uses a web application to upload the manifest and create an online database that holds all of the data. Within an hour, the data is installed in the database, and he can focus on web application programing, rather than data cleaning.

In these scenarios, staff are able to use their high-valued skills immediately, rather than being mired in work that is not only low-value, but which has also been done by other organizations many times before, allowing them to focus on their specialized skills. They are freed from having to repeat work because the work can be done once, upstream, and shared with them, and the work is packed in a standardized way that ensures that with each new dataset, they can re-use skills and experiences they had with previous one.

These three attributes – Specialization, Standardization and Mass Distribution – are the cornerstones of our solution to the problem of the cost of data, and we’ll address how to build a system that delivers these features in the next section.