At tomorrow night’s meeting, we’ll be kicking off two new data projects. The first is the Health Food Access project, previously announced, and the second is the Age Friendly Communities project, for which we’ve just posted the project page. In this project we will be collecting data to analyze the capacity and affordability of San Diego’s assisted living industry, considering the anticipated need for these services over the next 30 years. Hope you can join us.
Data For Civic And Social Development
Better data for better government and better communities.
Looking for Data?
Make a request in the forum, or search the repository.
We can connect you with data analysts.
Want to help?
Want to help? Volunteer your data, programming or management skills to an analysis project.
Next week we’ll be kicking off two new data projects, and a big part of these projects will be finding data, documenting it, and preparing it in a consistent way for analysis, a process known as data wrangling. I’ve been developing software for wrangling social data for a few years, and have collected many of the best ideas into a new metadata system called Metatab. Metatab is a system for storing structured metadata in a CSV file, often alongside data, making it easier to create and publish metadata.
In the next two data projects, we will using the Metatab Google Spreadsheet Add-On to document data we locate for the two projects. Once a metatab specification is created for a dataset, it can be uploaded to CKAN, our data repository software directly from the Google spreadsheet system. And I’m currently working on other tools for finding and manipulating data.
When we are done with the main data wrangling, there will be collections of datasets in our main data repository related to food access and assisted living, and then we can start on data analysis, most likely using Pandas and Tableau, but we may also tackled using a few AWS tools like AWS Athena and AWS Quicksight.Register for the Meeting
The Library has been quiet lately, mostly because the Director doesn’t have time to properly manage projects, and because he’s not very good at it. So, we’d like to find a volunteer to manage projects. This is a volunteer role that would involve:
- Talking to nonprofits and journalists about data needs
- Recruiting other volunteers for data projects
- Setting up meetings and finding rooms
- Participating in data projects
If you are interested in data and have good organizational skills, please apply by sending email to Eric Busboom, email@example.com.
Collect and analyze data about the food system in San Diego county.
The San Diego Food System Alliance’s Healthy Food Access Working Group is developing an indicator library to analyze food access issues, and we need your help to locate datasets, wrangle them into useable shape, and create visualizations.
The work is similar to the topics of our March 2015 Data Contest, with additional work of building a reusable data library to perform additional analysis.
This project needs volunteers with a range of skills, including:
- Administration and logistics: Call potential data providers, locate datasets, and arrange meetings and events.
- Data wranglers: People skilled with either Excel or Python to manipulate datasets.
- Data analysts: Data analysis who know R or Python/Pandas.
We will be starting with a list of potential datasets, from which we will construct Ambry Data Bundles. We can load the bundles into a data library. Then we can do visualizations and analysis, such as this map from a project at Palomar College.
How To Participate
To announce the arrival of a new set of crime data, our next meetup will be a mini data contest, with a $100 prize for the best student analysis. In this meeting, we will present the new Crime Incident dataset and talk about how to link it to other social datasets. After the presentation, we’ll challenge you to do you own analysis, with a $100 prize for the best analysis from a student, undergraduate or lower.
Then, for the next meeting, we’ll invite the best analysts to present their findings and techniques.
When we last requested crime data from SANDAG, 3 years ago, it took four months of negotiation to get them to admit they could produce it, and two more months to get the price down to a reasonable amount.
Last week when I requested an update, I got one clarification email, then a phone call, and the files were in my inbox a few minutes later. Thanks SANDAG! As a bonus, the data is now geocoded to census block and track, making the files more immediately analyzable.
You can find both the old data file and the new one in an Ambry package on our new data repository.
The Data Library works primarily with journalists and nonprofits, but until recently, I hadn’t fully realized how different the processes are in these two environments. We’d been following two different processes, but didn’t have names for them, so it is worthwhile to give the two contexts names, so we can be sure we are working in a process that is comfortable for our clients and partners.
In the Journalist / Exploratory context, the journalist is looking for a story in data, or wants to use data to support a story idea. In either case, the dataset is novel; there is no pre-existing format to follow, and most times, neither the journalist nor the analyst has worked with the data before.
In this context, we follow a light, fast process that produces a lot of graphs and tables to look at different angles. The outputs are very rough, so the plots aren’t properly labeled and table columns can be cryptic. The goal is to sift through the data quickly to find a few gems, and polish them later. We don’t want to spend a lot of time making plots look good if they will never be used.
In this context, we will work directly from IPython and plotly, documenting the process in Google Docs, and will share with the journalist the IPython Notebooks, which like this one, are complicated, ugly, and not at all suitable for publication, but they are very useful for exploring ideas quickly.
The other context is the Nonprofit / Reporting context, where an organization has a fairly specific goal, most often a well defined report. In this context, the projects run most smoothly when we start with a copy of a previous report, discuss changes to it, and use it as a template. If there isn’t a previous report, we’ll mock up one in Excel first.
In the reporting context, the client doesn’t want to see the rough work, and the IPython notebooks are very confusing and distracting, so we usually share Excel files for data, and load the data into Tableau for presentation and discussion. The Tableau workbooks are much easier to understand, and a whole lot more attractive. Here is an example of a Tableau workbook organized for reporting.
There are other contexts as well, one for programming projects and another for producing datasets for later analysis, but these are the two where proper communication about which context will be used for a project has the most impact on project success.