Learn Data Analysis Techniques

Posted by on Jun 5, 2014 in News | 0 comments

Once you’ve got the basic skills in programming and statistics, the best way to learn data analysis is to do it. So, we’re developing a practical experience program for aspiring data analysts. The program is in the pilot phase with a small set of students, but you can read about how it works on the Internship Program’s wiki page.

The goal of the program is to develop more experience with answering questions with data in San Diego and to make that experience available to non profits, government agencies, journalists and other organizations that have questions that could be answered with data, but don’t have the time or skills to do it.

We’re currently recruiting participants, mentors and projects. So, if you’d like to develop data skills, teach data skills, or have a data project, let us know.

 

 

Read More

The Cost of Cleaning

Posted by on May 7, 2014 in Commentary | 0 comments

We’ve frequently mentioned that people who work on data projects tell us that frequently, 80% of their projects are consumed by data preparation and cleaning, so it is interesting to get this data point from Kaggle:

(2) How long is a typical project?
When working with a top 0.5% data scientist, projects take just eight to 40 hours ($3k to $12k). Projects are finished in closer to eight hours for clean data and closer to 40 hours when the data requires cleaning.

So, in this anecdote, with some squinty-eyed interpretation,  data cleaning requires 32 out of 40 hours. Dead on. And, by the way, that’s 32 hours at $300 per hour.

Fortunately, the library has a plan to reduce the cost of data cleaning and preparation.

 

Read More

Burglary Rhythm Maps

Posted by on Apr 18, 2014 in Analysis | 0 comments

A Rhythm Map is a heat map that displays time in the X and Y dimensions. They are an excellent way to visualize repeating patterns in time, such as how crimes occur by hour and data of week. Here we look at some interesting patterns in burglaries in the City of San Diego.

First, here is the map for a range of crime types in San Diego, compiled from the type, time and date of about 400K crime incidents in the City of San Diego from 2006 to 2012.

Rhythm Map, All crimes in San Diego

 

Each square is a crime type. The vertical axis is the hour of the day, and the  horizontal axis is the day of the week, with Sunday being the cell between 0 and 1.  Darker red means there are more crimes than lighter red and yellow. The colors are not comparable across squares, only within the cell. So, the dark red cell at 5:00PM on Friday in the Burglary square may represent a very different number of crime incidents than then dark red cell at 12:00AM on Thursday in Sex Crimes.  Also note that these views combine citations, arrests and reported crimes, and there may be different patters when the maps are broken out on that factor.

There are a lot of interesting patterns here, but we’ll focus on Burglary. The first thing to notice is there are two time ranges, groups of darker red cells,  when burglaries occur: during the work hours on weekdays and on Friday evenings. ( The strong line at noon is most likely an artifact of crimes for which the time is not known being given that value arbitrarily.  )

What accounts for the two separate time ranges? First, let’s break it out by community. This chart uses Clarinova Place Codes for the community names.

 

Here we see that some communities exhibit one pattern or the other, and sometimes both. Downtown ( SanDOW ),  La Jolla ( SanLAJ ) and Mira Mesa ( SanMIR ) show the Friday pattern, while Southeastern ( SanSOT),  Greater North Park ( SanGRE) and Midtown ( SanMID ) show the week day pattern.

Community distinctions may explain some of the differences in the patterns, but there is a factor that is probably more important: residential vs commercial crime. So, let’s split out the maps on that factor.

Here is where the distinctions become the strongest. In Otay Mesa ( SanOAT ), Mira Mesa ( SanMIR ) University ( SanUNV ) and others, the Friday evening pattern completely splits from the weekday pattern. However, we also see a new weekday pattern in the commercial burglaries in Claremont ( SanCLA ), Uptown, Midtown, with commercial burglaries occurring across the weekday evenings.

Those features are consistent with exactly what you’d expect from burglary: the burglaries occur when the business and homes are unoccupied. But it doesn’t explain why in many communities the commercial crimes would occur more frequently on Friday evenings.  Another unusual pattern is that in Pacific Beach ( SanPCF ) there is a residential burglary cluster on Friday and Saturday evenings, with a similar but weaker pattern occurring in Uptown and College.

Rhythms are a powerful way to look for patterns in time-structured data, because they take advantage of the ways that human brains most quickly process visual information. However,  they aren’t  a complete solution; they are just a start. Before making any recommendations based on the data, we’d want to do a few statistical tests, and at least, look at the absolute number of incidents per cell in the areas exhibiting patterns.

 

Read More

Day/Time Crime Heatmaps

Posted by on Dec 21, 2013 in Analysis | 0 comments

Here is an interactive data application that explores how crime incidents vary over the day of week and time of day. In the checkbox below, select one or more crime types, and the heatmap will show the relative intensity of those crime types over day of week and time of day.

There many interesting patterns here, some you would expect, some you might not:

Things you might expect:

  • DUI and Drugs violations are primarily committed in the evening and early mornings on weekends.
  • Assaults are most frequent in the late evenings and early mornings on weekends.
  • Vehicle theft and break-ins are committed in the dark.
  • Burglary is committed during the day, while people are at work.

There are also a few interesting surprises:

  • Sex crimes, which are mostly prostitution, peak on Thursday evenings.
  • Homicides are spread throughout the week.  Rather than being tied to nightlife.
  • Fraud  occurs almost entirely at noon or midnight. This is almost certainly a data-collection issue, not the actual time the crimes occurred.
  • Weapons violations, however, are primarily in the middle of the week.

This sort of heatmap is a really powerful way to visualize complex relationships quickly, although it also hides a lot of other interesting features. For instance, crime varies considerably by location,  so a valuable extension of this analysis would be to include checkboxes for selecting neighborhoods.

This application was built using R and Shiny. If you’d like to  learn to develop this sort of application for your own site, the Library is considering running a training class. If you are interested, let us know.

Read More

Drugs and Alcohol in Downtown and East Villiage

Posted by on Nov 12, 2013 in New Data, Projects | 0 comments

For the last few months, a team of geography students at SDSU have been working with the crime data provided by the Library, producing analyses and visualizations of the data.

Elias Issa has been looking at Drugs and Alcohol violations in Downtown San Diego and East Village. He writes:

The Hot Spot tool calculates the  Gi* statistic for each feature in a dataset. The resultant z-scores and p-values tell you where features with either high or low values cluster spatially.  To have a statistically significant hot spot, a feature will have a high value and be surrounded by other features with high values as well. My animated maps illustrate 2 hot spots from 2007 to 2012. the most significant hot spot is located close to St. Vincent Paul homeless shelter ( Imperial Ave) on the South Eastern part of East Village. The second hot spot is located North of Gaslamp Quarter between Broadway And Market which most of the famous and popular bars  are found. In addition to those map,and based on those hot spots, I did some statistical analysis to show the average of Drugs/ Alcohol violation monthly (Above/ Below Avg) and yearly. My study reveals that there is a slight increase within 2011 and 2012 in the average of Drugs/ Alcohol violations.

Over the course of the project, he has been experimenting with various ways to visualize the time component of geographic data. This is quite difficult, since you can’t easily scan the time dimension like you can in space. Visual processing is tuned for noticing changes and differences — like a deer that won’t notice you if you don’t move — so Elias’ visualization is best for quickly identifying areas that deserve more analysis, rather than showing the quantitative differences.

Due to the difficulty of  creating animations like this in ArcMap, the video has only one frame per year, but that is enough to illustrate how the changes from frame to frame draw your eye to problem areas. Without a visualization like this, it is easy to miss some of the most important features of an issue including short term spikes and long-term trends.

After identifying an area to focus on through the visualization, Elias’ underlying statistical method serves to quantify  the differences between times or locations, so this project is a great example of a way to use animation to partition a large problem space into components that can be analyzed in detail.

 

Read More