Limitations

We faced lots of challenges as we wrangled this data into a usable shape. Here are some of our adventures...

Cleaning the Data

Our main two datasets were a Housing Violations and a Student Addresses dataset.

When it came to cleaning the Housing Violations dataset, there was little bit of fighting, as we found some latitude and longitude locations that didn't make sense for the Boston latitude and longitude range. Additionally, we edited the ranges of years to match typical Boston College school years.

But for Student Addresses, we had to work a lot harder. Student Addresses was a dataset generated by forms from colleges for compliance every year, where students would list where they were living off campus and their addresses, along with other statuses i.e. part/full time, etc.

However, we found enormous variability with the input of addresses, and unlike our Housing Violations dataset, Student Addresses didn't have precise latitude and longitude coordinates. Rianna cleaned the columns and separated them best for the API, but there were still little cases that were missed with our method of cleaning the data (matching on potential words with a dictionary like "str" vs. "str." vs. "st." and replacing with "street" or separting out long lines incorrectly placed like a "Commonwealth Avenue Apt 5" in the Street Name column that should be separated into Street Name, Prefix, and Unit separately) resulted in some of our data not translating well to the API.

For example in 2022-2023 and 2023-2024 school years respectively, around 1440 af the 11000+ student addresses had to be discarded due to not successfully getting a latitude and longitude from the API. More on this can be found within the "Colab API Running File" within the Github.



Further Steps

Interactive Graphs

For the sake of time and our experiences, we used static graphs for plotting the locations of students and housing violations. However, it would be much more useful to use something like Folium in the future to interact with a real map and zoom in on specific addresses, especially since that could help with further contextualization of neighborhoods or other key things like school locations.

Dealing with an API

We used the free Geocoding API(https://geocode.maps.co/docs/) which offers a free plan for geocoding addresses from the readable, human format to the latitude and longitude version. However, the free plan restricts requests to a certain amount per second, and to accomodate this, often our API code ran for hours upon hours.

Finding a more accurate API or one that doesn't have a payment cap on requests per second could greatly increase this speed for converting the addresses to plotable points. Form creators should also take our example as motivation to demand more precision from form submitters, as that would greatly decrease the time that data scientists take on cleaning, especially qualitative data.

Additional Violation Types

We dissected the main Housing Violations Reports in Boston, but it is worth adding other violation types like 311 calls in Boston. Additionally, we found that certain violations seem to be concentrated in certain areas consistently throughout different years, so learning more about how a violation is evaluated and why there are certain areas where certain violations are would be key to explore further.