Data.gov is the top level data search system for the US, with references to over 130,000 datasets from federal and state agencies. And yet, I’ve never successfully used it for finding data. Here is an example search for “Diabetes Rates”:
So, we look for diabetes, and get births, 22 year old mortality data from the US Geological Survey, and quality of service data as the first three hits. The first link at least points to the right agency, but you still have to click three times to get there.
Here is the same search on Google:
Not only do I get links to real primary organizations, the fourth hit is the original source of the data, and Google helpfully gives us quick stats before the hits. Even better, if you search for “diabetes rates data” you are one click away from the primary data source for US diabetes rates.
The really poor quality of search results on Data.gov has been a problem for its entire existence; I’ve never done a data search on Data.gov that returned what I wanted. I’ve always had better results with Google or browsing the website of the agency that produces the data.
Data.gov has been around for about 5 years, and despite human curation is still isn’t as useful as Google’s automatic index. At some point, I’d like to stop being excited by its possibilities, and star being excited by it utility.
Here is the Tableau workbook that we’ll be using for the SPJ Hacks/Hackers Data Show event tonight.
You can also get links to all of the documentation from our data warehouse.
Voice of San Diego has published a story we supported, a fact-check about the density of veterans in San Diego County. You can get all of the data behind the story — including schemas and SQL queries ! – in our data warehouse.
Once you’ve got the basic skills in programming and statistics, the best way to learn data analysis is to do it. So, we’re developing a practical experience program for aspiring data analysts. The program is in the pilot phase with a small set of students, but you can read about how it works on the Internship Program’s wiki page. The goal of the program is to develop more experience with answering questions with data in San Diego and to make that experience available to non profits, government agencies, journalists and other...read more
We’ve frequently mentioned that people who work on data projects tell us that frequently, 80% of their projects are consumed by data preparation and cleaning, so it is interesting to get this data point from Kaggle: (2) How long is a typical project? When working with a top 0.5% data scientist, projects take just eight to 40 hours ($3k to $12k). Projects are finished in closer to eight hours for clean data and closer to 40 hours when the data requires cleaning. So, in this anecdote, with some squinty-eyed interpretation, data cleaning...read more
Here are the files for my presentation to the San Diego Crime Analysts Association: PDF File PPT Fileread more
A Rhythm Map is a heat map that displays time in the X and Y dimensions. They are an excellent way to visualize repeating patterns in time, such as how crimes occur by hour and data of week. Here we look at some interesting patterns in burglaries in the City of San Diego. First, here is the map for a range of crime types in San Diego, compiled from the type, time and date of about 400K crime incidents in the City of San Diego from 2006 to 2012. Each square is a crime type. The vertical axis is the hour of the day, and the...read more
For the last 5 months, SANDAG has been publishing their crime incident data to the web. The file they publish only stores the last 180 days, and it is a bit hard to find, so we’re archiving the files to our data repository .read more
Here is an interactive data application that explores how crime incidents vary over the day of week and time of day. In the checkbox below, select one or more crime types, and the heatmap will show the relative intensity of those crime types over day of week and time of day. There many interesting patterns here, some you would expect, some you might not: Things you might expect: DUI and Drugs violations are primarily committed in the evening and early mornings on weekends. Assaults are most frequent in the late evenings and early mornings on...read more
For the last few months, a team of geography students at SDSU have been working with the crime data provided by the Library, producing analyses and visualizations of the data. Elias Issa has been looking at Drugs and Alcohol violations in Downtown San Diego and East Village. He writes: The Hot Spot tool calculates the Gi* statistic for each feature in a dataset. The resultant z-scores and p-values tell you where features with either high or low values cluster spatially. To have a statistically significant hot spot, a feature will have...read more