Industrial Production For Data

In the first part of  A Vision For A Shared Data Infrastructure, we presented how three types of data users should be using data, using scenarios in which they were freed from much of the grunt work they now have to endure. These scenarios were based on developing three critical attributes of the data systems they use: Specialization, Standardization, and Mass Distribution. In Part 2, we’ll develop what those attributes really mean and show how realizing them can solve the problem of cost.

The problems that we discussed in Part 1 are due to the nature of how we pursue data projects: data projects employ a craft mode of production. We run data projects in essentially the same way that a Medieval cobbler made shoes, with each project tailored to the customer and run with no repeatable process. In craft production, the craftsperson performs many different tasks, works with many different materials, and makes a small number of items using a process that is specific to that one craftsperson. To reduce costs, we need to move data to an industrial mode of production. Although there a lot of differences between craft and industrial production, we’ll look at three that are most important to data projects:

  • Specialization. Acquiring, cleaning, analyzing and reporting are each distinct tasks, which should each be done by different type of person.
  • Standardization. Specialization requires that the specialists work with a constrained number of inputs and outputs, so standardizing data formats within the process is an important part of specialization.
  • Mass Distribution. Creating an industrial process is more expensive unless you also increase production to amortize the costs. With data, this means that our process will need to serve more than one data using organization to be cost effective.

Since Specialization depends on Standardization, we will start with Standardization. For data, Standardization implies that the format and structure of data takes a limited number of forms as it moves through the process. There is a wide variety in workable standards, such as storing all data in CSV files in a shared drive, or using a central database. But since Mass Distribution is a requirement, we’ll need a standard data container that is easy to get from a website, or store in a shared drive, or even send in email.

Fortunately, there is an excellent option: Sqlite. Sqlite is a simple relational database that stores the entire database in a single file. It is the world’s most popular database, being included in every Apple phone, every Apple computer, all Android phones, nearly every web browser, Windows 8, Skype and Adobe Photoshop. Using Sqlite, we can have all of the power of a relational database, with extensions for geographic searches, in a file that you can email to other people. The specific choice of Sqlite isn’t really important at this point in the discussion, but knowing what it is may help understand how the process flows.

In our solution, we’ll load data from any external source, including tabular data from spreadsheets and geographic data, into Sqlite files. Then, there will be a single database format for most of our process. Creating a data standard involves a lot more work than this – we’ll also have to address metadata, documentation, dependencies, security, privacy,  versions and many other issues – but all of these additional efforts will build on the base of a standard container for data.

With a solid start on Standardization, we can address Specialization. In part 1, we presented a data project process that involves these steps:

  • Acquisition.  Locating the required data and securing permission to use it.
  • Preparation. Converting formats, cleaning bad records, and converting data values.
  • Linking. Linking to other data, such as Census demographics.
  • Analysis. Creating statistical analyses, predictions, charts and graphs, or other data projects.
  • Reporting. Packaging the technical output of the analysis phase in a way that can be share with, and understood by, others.

Each of these steps can be a speciality, and each of those specialities is best exemplified by existing jobs.

  • Acquisition. Acquiring data is skill that is most developed by Investigative Journalists, experts in the footwork, phone calls and relationships required to get access to information.
  • Preparation. Data preparation is the domain of Excel Jockeys��and Generalist Programmers, who know how to manipulate data structures programatically.
  • Linking. Linking datasets requires understanding multiple datasets, their structure, limitations on use and how to manipulate them. These skills are currently the domain of Data Scientists, but we’ll call these specialists Data Wranglers.
  • Analysis. As it has been for decades, data analysis is best done by Statisticians and Data Scientists.
  • Reporting. Creating reports from data analysis can be done under a wide variety of titles, but we’ll refer to it as a skill of Policy Analysts.

That’s a lot of different categories of skills, each with a different level of technical and domain expertise, but we can link them together into a process that takes us from the first steps of acquiring data to the final use. ( Note that we are using these names for illustrative purposes rather than implying that these steps each require hiring people with those experiences. )

  • The Investigative Journalist  finds the datasets, and downloads it if it is public, or negotiates for its release if it isn’t. This process involves building relationships, filing Freedom of Information Act requests, making phone calls and other interpersonal activities. The  Investigative Journalist then either records the URL where the data can be downloaded, or for non-public data, saves the file to a data repository where it will have a URL where it can be downloaded in the next step.
  • With a link to the data, the Generalist Programmer creates a program to download the data and package it into the standard data container, a Sqlite file. The Programmer stores this file in a specialized library that is designed for storing the containers. When packaging the data, the Programmer tries not to make any substantive changes to the dataset – cleaning is for the next phase.
  • The Data Wranglers use the library to find the packages that the Programmer created, and then the Wrangler creates a derivative package of cleaned data. The cleaning can change the format of the data, combine and break up files, remove non-numeric characters and perform other manipulations. The container with the cleaned data is stored in the Library.
  • After a package has been cleaned, Data Wranglers can use the Library to find datasets than should be linked together, for instance, by re-aggregating a dataset so it conforms to Census tract boundaries. These combined datasets are also stored in the Library.

At this point in the process, we have a data Library, a web application and file store that holds a lot of data containers. Most of the steps of the process involve putting standardized data containers into the library, taking them out, or both.

The next step is to make that data available to users. Some of these users are:

  • Statisticians and Data Scientists.  These highly-technical users can use programming languages that can access the Library directly, downloading containers and loading them into their data analysis applications.
  • Website Developers. For website developers who prefer to work with standard relational databases, containers can be “installed” into a database. Using dedicated software, the developer can list the containers that should be loaded into a database, and the data management software will keep the database up to date.
  • Casual Analysts. Casual analysts prefer to work with Excel, so their data needs to be accessible as Excel format or CSV. These users will prefer to work with a data repository, such as Data.gov, where they can download files. Fortunately, just as containers can be installed to a database, they can also be installed to a data repository.

After the analysis is done, there are many ways to get the data analysis to the Reporting phase. At the San Diego Data Library, we use software for analysis notebooks, which allows analysts to write documents that combine text and data analysis, such as this notebook on San Diego crime patterns. The data analysts can work with the  Policy Analysts to create these notebooks to present the analysis and visualizations that the Policy Analyst will write about.

So, to summarize the Process:

  • Get the URL for the data.
  • Take the data from the URL, put it into a container.  Put the container in the Library
  • Get a container, alter the data into another container, and put the container into the Library.
  • Take containers from the Library, and store the data into files, websites and databases.
  • Final users use data from the Library, files, websites or databases.

Now we’ve illustrated the entire process, but within a single organization. What about Mass Distribution? Mass distribution is invisible to most data users, since the process already has several places that data containers can be shared.

  • Containers are created with short programs, and these programs can be stored on code sharing sites like GitHub, where other Data Wranglers can download the program and re-create the containers, or modify them to create variants.
  • After the containers are installed in the Library, access to the library can be shared across organizations, or with the public.
  • Containers can be installed to databases and data repository, both of which can be shared.

Mass Distribution is a natural result of having standard containers, created by programs,  that can be installed and downloaded, and the result is that where previously, with the craft process, cleaning data or combining it with other datasets served one organization, in the industrial process it can serve hundreds. The result of that mass distribution is amortizing costs for data work across more users, resulting in significant reductions in cost, even while making more data available and easier to get.

So far, we’ve motivated the solution, laid out a process, but have talked about only one part of the technology, the Sqlite files. In Part 3, we’ll introduce Ambry, the software that creates these files and manages the process of using them.