The Data Library works primarily with journalists and nonprofits, but until recently, I hadn’t fully realized how different the processes are in these two environments. We’d been following two different processes, but didn’t have names for them, so it is worthwhile to give the two contexts names, so we can be sure we are working in a process that is comfortable for our clients and partners.
In the Journalist / Exploratory context, the journalist is looking for a story in data, or wants to use data to support a story idea. In either case, the dataset is novel; there is no pre-existing format to follow, and most times, neither the journalist nor the analyst has worked with the data before.
In this context, we follow a light, fast process that produces a lot of graphs and tables to look at different angles. The outputs are very rough, so the plots aren’t properly labeled and table columns can be cryptic. The goal is to sift through the data quickly to find a few gems, and polish them later. We don’t want to spend a lot of time making plots look good if they will never be used.
In this context, we will work directly from IPython and plotly, documenting the process in Google Docs, and will share with the journalist the IPython Notebooks, which like this one, are complicated, ugly, and not at all suitable for publication, but they are very useful for exploring ideas quickly.
The other context is the Nonprofit / Reporting context, where an organization has a fairly specific goal, most often a well defined report. In this context, the projects run most smoothly when we start with a copy of a previous report, discuss changes to it, and use it as a template. If there isn’t a previous report, we’ll mock up one in Excel first.
In the reporting context, the client doesn’t want to see the rough work, and the IPython notebooks are very confusing and distracting, so we usually share Excel files for data, and load the data into Tableau for presentation and discussion. The Tableau workbooks are much easier to understand, and a whole lot more attractive. Here is an example of a Tableau workbook organized for reporting.
There are other contexts as well, one for programming projects and another for producing datasets for later analysis, but these are the two where proper communication about which context will be used for a project has the most impact on project success.
Last week the Library worked with Wendy Fry at NBC7 to analyze GPS records of street sweeping and parking violations. We haven’t got the data online yet — it’s about 3GB, 11M GPS records — but let us know if you are interested in accessing it. A condensed version of the data is available as a heatmap of blocks where tickets were issued on days the streets were not swept.
Big Data Hackathon on Oct 3 at 9AM at SDSU!
On October 3, the Center for Human Dynamics in the Mobile Age will host a Big Data Hackathon at San Diego State University. Contestants will use data analysis and programming to solve civic problems related to water conservation, disaster recovery and crime prevention. Visit the Devpost website for details and to register, or hit Github for the tech details and data.
Finally, our legislators are getting below the surface of the Open Data issue and addressing one of the deeper plains: Open Data is nearly useless when it is delivered in PDF. To address this problem, our very own Brian Maienschein (well, the inland “us” ) has introduced AB 169, which mandates that when agencies publish open data, it is published in a machine readable format. Yea! One prayer answered! If you are in district 77, send Maienschein some love, if not, tell your assembly person to get with the...read more
We completed our 2015 Data Contest with final presentations and winners at the awards ceremony on Tuesday. Here are the winners and their presentations: UCSD MAS Data Science, Time and Space Analysis of Food Distribution Presentation Narrative irHacker, California Suspensions Presentation Flash and Shadow, A Visual Geographical Study on Location, Availability, Public Transportations and Crime Exposure Presentation Narrative A Mathematical Modeling Team, Are Some Teachers Just “Meaner” than...read more
We completed the 2015 SDSU Data Contest on Saturday, with a fantastic collection of excellent submissions. The Judges are reviewing them now, but until you learn the winners at the Awards Ceremony on Tuesday, you can see all of the submissions here. Everyone is welcome at the Awards Ceremony, so follow the link to register. UCSD MAS Data Science Presentation Narrative LanguageExploration Presentation Narrative Flash and Shadow Presentation Narrative Other: KML Fiel for Drawing on Google Maps Other: Interactive Map for Different Types of...read more
Here is the schedule for the data contest hangouts this week. The next one is Monday at noon, and will present tips on accessing data from the Ambry databases. We’ll be broadcasting these events via Hangouts on Air. Visit our Google+ Page to attend the...read more
The SDSU Data Contest Kickoff is tomorrow! Better register if you haven’t already. Here are all of the last minute details. Time and Location: Petersen Gym, SDSU, Room 153, 1:00 PM Parking: Parking Structure 5 ( PS5 ) Full Schedule. Before: Register for the contest web app at http://sandiegodata.org/contest/register. Bring: Laptop Staff Contacts: Eric Busboom, firstname.lastname@example.org, 619 363 2607 Gonzalo Urruita, email@example.com Also, Our Twitter hashtag is...read more
To get ready for the Data Contest, you’ll want to ensure that your laptop already has installed on it all of the tools you’ll need. There is a set of tools that we use in most of our programs, and it will serve as a good base for your contest toolkit. These tools are: Tableau Public, for quick yet attractive visualizations. QGIS, for open source Geographic analysis and maps. The Anaconda Python distribution, to get IPython Notebook and many other important Python libraries. RStudio Desktop, an excellent R environment. Open Refine...read more
The SDSU / Data Library Data Contest has teamed up with UCSD, Python User Group and several Data Science User’s groups to now offer a full day event with two morning tutorials (R and Python) a mid-day exhibition with many Data Science projects and software demos, an afternoon Machine Learning challenge and the kick off to the SDSU / Data Library Data Contest. Visit the signup page to join the contest, learn more about data science, and have a chance to win part of the $2,100 in prizes. Visit the Conference Eventbright page to register...read more