Data Analysis For Your Mission

Posted by on Dec 8, 2014 in News | 0 comments

Sponsor a Project in Our Student Data Analysis Contest

Give your nonprofit, agency or news organization valuable data-driven insights by sponsoring a project at our student data analysis contest. 

The San Diego Regional Data Library, the SDSU Society for Statisticians and Actuaries and Teradata are organizing a data analysis contest to aid nonprofits, journalists and government agencies in making better use of data, develop a broader regional capacity for data analysis, and introduce students interested in data analysis to future employers.

Contestant teams will have one week in early March 2015 to answer a set of data-driven questions and visualize the results for one of four projects. Each project will be provided by a nonprofit, government agency or news organization.

The contest will be announced in early January, primarily to college students. For a month before the contest begins, the San Diego Regional Data Library will run a special session of its Practical Data Program to train contestants on using data, Python, R, and IPython to analyze data.

If you are interested in being involved in the contest, as a project sponsor, mentor or contestant, visit the overview page or contact Eric Busboom at, or (619) 363-2607.

Read More

When Will be Useful?

Posted by on Oct 27, 2014 in News | 2 comments is the top level data search system for the US, with references to over 130,000 datasets from federal and state agencies. And yet, I’ve never successfully used it for finding data. Here is an example search for “Diabetes Rates”:

Search for “Diabetes Rates” on


So, we look for diabetes, and get births, 22 year old mortality data from the US Geological Survey, and quality of service data as the first three hits. The first link at least points to the right agency, but you still have to click three times to get there.

Here is the same search on Google:


Search for Diabetes Rates on Google.

Search for Diabetes Rates on Google.


Not only do I get links to real primary organizations, the fourth hit is the original source of the data, and Google helpfully gives us quick stats before the hits. Even better,  if you search for “diabetes rates data” you are one click away from the primary data source for US diabetes rates.

The really poor quality of search results on has been a problem for its entire existence; I’ve never done a data search on that returned what I wanted. I’ve always had better results with Google or browsing the website of the agency that produces the data. has been around for about 5 years, and despite human curation is still isn’t as useful as Google’s automatic index.  At some point, I’d like to stop being excited by its possibilities, and start being excited by it utility.



Read More

Learn Data Analysis Techniques

Posted by on Jun 5, 2014 in News | 0 comments

Once you’ve got the basic skills in programming and statistics, the best way to learn data analysis is to do it. So, we’re developing a practical experience program for aspiring data analysts. The program is in the pilot phase with a small set of students, but you can read about how it works on the Internship Program’s wiki page.

The goal of the program is to develop more experience with answering questions with data in San Diego and to make that experience available to non profits, government agencies, journalists and other organizations that have questions that could be answered with data, but don’t have the time or skills to do it.

We’re currently recruiting participants, mentors and projects. So, if you’d like to develop data skills, teach data skills, or have a data project, let us know.



Read More

The Cost of Cleaning

Posted by on May 7, 2014 in Commentary | 0 comments

We’ve frequently mentioned that people who work on data projects tell us that frequently, 80% of their projects are consumed by data preparation and cleaning, so it is interesting to get this data point from Kaggle:

(2) How long is a typical project?
When working with a top 0.5% data scientist, projects take just eight to 40 hours ($3k to $12k). Projects are finished in closer to eight hours for clean data and closer to 40 hours when the data requires cleaning.

So, in this anecdote, with some squinty-eyed interpretation,  data cleaning requires 32 out of 40 hours. Dead on. And, by the way, that’s 32 hours at $300 per hour.

Fortunately, the library has a plan to reduce the cost of data cleaning and preparation.


Read More