Summer Water Quality Data Project

For its data program for Summer 2018, The San Diego Regional Data Library proposes to build a data repository for water quality data, to make it easier to find water quality data, link it together and to create maps, reports and analyses. The project will be run with a small paid staff and a group of volunteers, and if the project is successful, we will pursue funding for continued maintenance.

 This project is over, but see the project website for archived results.

Data we expect to  collect, process and store will include:

  • Beach water quality reports
  • Creek water quality reports
  • Land use
  • Watershed locations
  • Sewer spill reports
  • Rainfall
  • Air quality
  • Traffic patterns

All of these datasets will be stored in their original forms, as acquired from the upstream source, but also converted to a common format to make combining datasets easier. Point data will be aggregated to US Census tracts and the 1K US National Grid, with all spatial records being additionally marked with which watershed the region of the record is part of and which water monitoring points the region drains to. Other additions and improvements to the datasets will be made in response to requests from analysts.

As the datasets are collected, they will be published to a data repository website, similar to the Data Library’s CKAN Data Repository. Each dataset will include metadata and documentation, similar to this dataset. The system will use the Metatab and Metapack data packaging system to create and maintain data packages. This set of data management practices and software will make it much easier to analysis to find water quality data, link different sets together, analyze the combined data, and produce reports.

We will be recruiting volunteers for two technical roles:

  • Data Wranglers, to find water quality datasets, package them with Metapack, and add features to make analysis easier
  • Data Analysts, to analyze datasets and create reports to demonstrate how to use the data collection, primarily using IPython, Jupyter and R.

The project will be run with one paid management intern, a dedicated project manager, and a group of volunteers. We are currently pitching the project to potential partners and recruiting the management intern and volunteer data wranglers. If you are interested in participating, please contact Eric Busboom at or 619 363 2607.