Analysis of MTA Data for WomenTechWomenYes Annual Gala

Aisulu Omar
4 min readJul 10, 2019

--

The purpose of this article is to provide brief storytelling about my first data science project with Metis. I and three other team members were part of this project.

Problem: WomenTechWomenYes (WTWY) organizes an annual gala at the beginning of each summer every year to increase women’s engagement in tech. For marketing purposes, they are planning to place street teams at entrances of subway stations to collect email addresses. Those who sign up are sent free tickets to the gala.

Goal: Our goal as data scientists is to provide recommendations on the best placement of the street teams in order for them to get signatures from people who care about the cause and who are able to provide donations.

Data Gathering.

In this project we used two data sources:

NYC MTA Turnstyle Data. This data provides information on turnstyle usage. Fortunately, both of the datasets were publicly available. First, we scraped data from the MTA website. We focused on the data for June 2019. This is the head of the MTA dataset.

Mean and Median Household Income from Michigan Population Studies Center. The only variable we needed in this dataset is the median household income for each zip code. I merged two datasets by connecting zip codes to the top 20 stations. Here is the head of the merged data.

Data Cleaning.

NYC MTA Turnstyle Data. One of the most time-consuming parts of the project was cleaning the MTA data. Below are a few important points on the MTA data cleaning.

  • The columns with entries and exits include accumulated counts, and I had to calculate the difference between subsequent rows to create the number of daily entries per every station.
  • I merged the Date and Time columns, and created a new column with the day of the week.
  • For every weekday within a month, there were four dates in the dataset, except there were only three Sundays. I had to take this into account when performing calculations.

Mean and Median Household Income from Michigan Population Studies Center. We selected the top 20 busiest stations and scraped API data from yelp to add information on zip codes for every station. Once zip codes were added, we were able to merge them with MTA data and find a median household for every station.

Exploratory Data Analysis.

Question 1. What is the busiest day of the week?

  • The busiest day of the week is Wednesday.
  • The least busy day of the week is Sunday.
  • The average number of entries on the weekdays does not diverge very much.

Here is another plot that proves the above observations. Highest spikes are on Wednesdays (06/05, 06/12, 06/19,06/26), and drops are on Sundays.

Question 2. What are the busiest stations in NYC?

Here are the top busiest stations:

34th St — Penn Station

42nd St — Grand Central

34th St — Herald Square1

4th St — Union Sq

42nd St — Times Sq

Observations:

  • The stations are located near the center of Manhattan (Midtown Area).
  • There are various major restaurants, landmarks , colleges and companies around this area.
  • These locations are the ideal targets for street teams to advertise the event.

Question 3. What is the median income of the busiests stations?

I included in the graph the stations where the mean and median household income exceeds $70,000. Grand Central station has the highest average income in NYC.

Summary. From our analysis, we recommend that WomenTechWomenYes deploy street teams on Wednesdays to the 42nd Grand Central station to best target their appropriate audience. The secondary target location & time should be deploying street teams on weekdays, except Monday, to 34th St — Penn Station, 4th St — Union sq, and 34th St — Herald Square.

Here is the link to the GitHub page to see the full code for the project.

--

--

Aisulu Omar

I am a data scientist who loves coming up with different theories and uncover the insights using data and algorithms.