Summer Water Quality Data Project – San Diego Regional Data Library

For its data program for Summer 2018, The San Diego Regional Data Library proposes to build a data repository for water quality data, to make it easier to find water quality data, link it together and to create maps, reports and analyses. The project will be run with a small paid staff and a group of volunteers, and if the project is successful, we will pursue funding for continued maintenance.

This project is over, but see the project website for archived results.

Data we expect to collect, process and store will include:

Beach water quality reports
Creek water quality reports
Land use
Watershed locations
Sewer spill reports
Rainfall
Air quality
Traffic patterns

All of these datasets will be stored in their original forms, as acquired from the upstream source, but also converted to a common format to make combining datasets easier. Point data will be aggregated to US Census tracts and the 1K US National Grid, with all spatial records being additionally marked with which watershed the region of the record is part of and which water monitoring points the region drains to. Other additions and improvements to the datasets will be made in response to requests from analysts.

As the datasets are collected, they will be published to a data repository website, similar to the Data Library’s CKAN Data Repository. Each dataset will include metadata and documentation, similar to this dataset. The system will use the Metatab and Metapack data packaging system to create and maintain data packages. This set of data management practices and software will make it much easier to analysis to find water quality data, link different sets together, analyze the combined data, and produce reports.

We will be recruiting volunteers for two technical roles:

Data Wranglers, to find water quality datasets, package them with Metapack, and add features to make analysis easier
Data Analysts, to analyze datasets and create reports to demonstrate how to use the data collection, primarily using IPython, Jupyter and R.

The project will be run with one paid management intern, a dedicated project manager, and a group of volunteers. We are currently pitching the project to potential partners and recruiting the management intern and volunteer data wranglers. If you are interested in participating, please contact Eric Busboom at eric@sandiegodata.org or 619 363 2607.