At tomorrow night’s meeting, we’ll be kicking off two new data projects. The first is the Health Food Access project, previously announced, and the second is the Age Friendly Communities project, for which we’ve just posted the project page. In this project we will be collecting data to analyze the capacity and affordability of San Diego’s assisted living industry, considering the anticipated need for these services over the next 30 years. Hope you can join us.
Next week we’ll be kicking off two new data projects, and a big part of these projects will be finding data, documenting it, and preparing it in a consistent way for analysis, a process known as data wrangling. I’ve been developing software for wrangling social data for a few years, and have collected many of the best ideas into a new metadata system called Metatab. Metatab is a system for storing structured metadata in a CSV file, often alongside data, making it easier to create and publish metadata.
In the next two data projects, we will using the Metatab Google Spreadsheet Add-On to document data we locate for the two projects. Once a metatab specification is created for a dataset, it can be uploaded to CKAN, our data repository software directly from the Google spreadsheet system. And I’m currently working on other tools for finding and manipulating data.
When we are done with the main data wrangling, there will be collections of datasets in our main data repository related to food access and assisted living, and then we can start on data analysis, most likely using Pandas and Tableau, but we may also tackled using a few AWS tools like AWS Athena and AWS Quicksight.Register for the Meeting
Collect and analyze data about the food system in San Diego county.
The San Diego Food System Alliance’s Healthy Food Access Working Group is developing an indicator library to analyze food access issues, and we need your help to locate datasets, wrangle them into useable shape, and create visualizations.
The work is similar to the topics of our March 2015 Data Contest, with additional work of building a reusable data library to perform additional analysis.
This project needs volunteers with a range of skills, including:
- Administration and logistics: Call potential data providers, locate datasets, and arrange meetings and events.
- Data wranglers: People skilled with either Excel or Python to manipulate datasets.
- Data analysts: Data analysis who know R or Python/Pandas.
We will be starting with a list of potential datasets, from which we will construct Ambry Data Bundles. We can load the bundles into a data library. Then we can do visualizations and analysis, such as this map from a project at Palomar College.
How To Participate
We’re looking for some programers to visualize crime data and present it at our booth at the San Diego Magazine Big Ideas Party on Jan 21. The Data Library was one of the 25 Big Ideas covered in their January issue, so they’d like us to have a presentation at the party.
I’d like to have an interactive display, probably using D3, that shows a crime hot spot map for the region, as well as a collection of time-based Rhythm maps for selected areas. A visitor to the booth could select a neighborhood or city, see the hot spots in that area, and see how the crime incidents change in that area over time.
You’ll get a ticket to the party on the 21st, to share in the glory, get free food, and do some high-quality hobnobbing.
If you are interested, send me an email, with a link to your Github/Bitbucket/etc account or portfolio, to email@example.com. We can use any number of volunteers, but I only have three free tickets.
For the last few months, a team of geography students at SDSU have been working with the crime data provided by the Library, producing analyses and visualizations of the data.
Elias Issa has been looking at Drugs and Alcohol violations in Downtown San Diego and East Village. He writes:
The Hot Spot tool calculates the Gi* statistic for each feature in a dataset. The resultant z-scores and p-values tell you where features with either high or low values cluster spatially. To have a statistically significant hot spot, a feature will have a high value and be surrounded by other features with high values as well. My animated maps illustrate 2 hot spots from 2007 to 2012. the most significant hot spot is located close to St. Vincent Paul homeless shelter ( Imperial Ave) on the South Eastern part of East Village. The second hot spot is located North of Gaslamp Quarter between Broadway And Market which most of the famous and popular bars are found. In addition to those map,and based on those hot spots, I did some statistical analysis to show the average of Drugs/ Alcohol violation monthly (Above/ Below Avg) and yearly. My study reveals that there is a slight increase within 2011 and 2012 in the average of Drugs/ Alcohol violations.
Over the course of the project, he has been experimenting with various ways to visualize the time component of geographic data. This is quite difficult, since you can’t easily scan the time dimension like you can in space. Visual processing is tuned for noticing changes and differences — like a deer that won’t notice you if you don’t move — so Elias’ visualization is best for quickly identifying areas that deserve more analysis, rather than showing the quantitative differences.
Due to the difficulty of creating animations like this in ArcMap, the video has only one frame per year, but that is enough to illustrate how the changes from frame to frame draw your eye to problem areas. Without a visualization like this, it is easy to miss some of the most important features of an issue including short term spikes and long-term trends.
After identifying an area to focus on through the visualization, Elias’ underlying statistical method serves to quantify the differences between times or locations, so this project is a great example of a way to use animation to partition a large problem space into components that can be analyzed in detail.
Here is some eye candy, a population density map of Pacific Beach and surrounding neighborhoods.
This map was created with a lot of Python code, using the 2010 census shapefiles for census blocks, setting a value for each block as the population of the block divided by the area of the block, and rasterizing all of the blocks to an image. Red indicates areas of higher population density. You can clearly pick out the areas in Pacific Beach that are zoned for apartments vs single family homes, the UTC high-rise apartment area, and many other variations in land use.
This map is a test of code I’m creating to allow any census variable to be mapped, but I’m not really happy with the result. The problem is that human brains like to see smooth variations in density, and the jaring discontinuities in this map are confusing. Some of the time, the abrupt changes in density is connected to changes in land use, since census boundaries tend to follow streets, but most of the time what map users are really more interested in is how people respond to density, and in those cases, human movements and behaviors don’t follow sharp boundaries.
To address this issue, I will be converting these maps into the same grid structure that we use for crime maps and smoothing across the grid cells to remove the discontinuities. These modified maps won’t show the population density with the same accuracy, but they will be easier for people to interpret in ways that are relevant to their real interests in population density.