Final Project - Tools for Data Science

In this project you will be working with R, SQL, and Python in the same document. We will use the data sets airlines and flights from the package nycflights13

Install the packages nycflights13, sqldf, tydiverse and load all data sets and packages. Take your time to understand the data sets.

# Install code here - comment the code when you installed

#install.packages("nycflights13")
#install.packages("sqldf")
#install.packages("tidyverse")

# Load all packages here

library(nycflights13)
library(sqldf)
library(tidyverse)

# Load the data here

data(airlines)
data(flights)

Question 1: List the name of airlines where the destination is ATL airport with their average arrival delays and sort them from the smallest to largest average arrival delays. Use data frames flights and airlines. We solve this question using R, SQL, and Python.

R solution

You can use the R command inner_join() to join the two tables.

library(dplyr)

flightsDF <- data.frame(flights) #loading in the data
airlinesDF <- data.frame(airlines)

joinedDF = inner_join(flightsDF, airlinesDF, by='carrier') %>% filter(dest == 'ATL') %>%  select(name,arr_delay) #joining data, filtering by destination and selecting name and arrival delay

joinedDF %>% group_by(name) %>%  summarise_each(funs(mean(., na.rm = TRUE))) %>% arrange(desc(arr_delay)) #grouping data by name and calculating the mean
## # A tibble: 7 x 2
##   name                        arr_delay
##   <chr>                           <dbl>
## 1 AirTran Airways Corporation    20.7  
## 2 ExpressJet Airlines Inc.       19.6  
## 3 Envoy Air                      14.0  
## 4 United Air Lines Inc.          10.5  
## 5 Delta Air Lines Inc.            7.42 
## 6 Southwest Airlines Co.          6.90 
## 7 Endeavor Air Inc.               0.857

SQL solution

Write your SQL query in the function sqldf(). For exmaple,sqldf("select * from relig_income") list the data frame relig_income.

# Example
# sqldf("select * from relig_income limit 3")
# code here


#selecting airline name and average delay
sqldf("
      Select airlns.name, AVG(flgt.arr_delay)
      FROM flights as flgt
      LEFT JOIN airlines as airlns --joining data on carrier
      on flgt.carrier = airlns.carrier
      WHERE flgt.dest == 'ATL' --filtering by destination
      GROUP BY airlns.name --grouping by airline
      ORDER BY AVG(flgt.arr_delay) DESC --ordering the data by arrival delays
      ")
##                          name AVG(flgt.arr_delay)
## 1 AirTran Airways Corporation          20.7445127
## 2    ExpressJet Airlines Inc.          19.6382850
## 3                   Envoy Air          14.0340045
## 4       United Air Lines Inc.          10.5000000
## 5        Delta Air Lines Inc.           7.4242250
## 6      Southwest Airlines Co.           6.8965517
## 7           Endeavor Air Inc.           0.8571429

Python solution

Use df.join() from pandas.

# load python libraries 

import pandas as pd
import statsmodels.api as stat

# load data
## C:\Users\clayt\Documents\R\win-library\4.1\reticulate\python\rpytools\loader.py:39: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
##   module = _import(
## C:\Users\clayt\CONDA\lib\site-packages\statsmodels\compat\pandas.py:61: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
##   from pandas import Int64Index as NumericIndex
flights_df = stat.datasets.get_rdataset("flights", "nycflights13").data
airlines_df = stat.datasets.get_rdataset("airlines", "nycflights13").data

# code here
joined_df = flights_df.set_index('carrier').join(airlines_df.set_index('carrier'), how='inner') #joining airlines and flight data

joined_df = joined_df[joined_df['dest'] == 'ATL'] #filtering the data for ATL as destination

joined_df.groupby('name')['arr_delay'].mean().reset_index().sort_values('arr_delay', ascending=False) #grouping by name and calculating the mean
##                           name  arr_delay
## 0  AirTran Airways Corporation  20.744513
## 4     ExpressJet Airlines Inc.  19.638285
## 3                    Envoy Air  14.034004
## 6        United Air Lines Inc.  10.500000
## 1         Delta Air Lines Inc.   7.424225
## 5       Southwest Airlines Co.   6.896552
## 2            Endeavor Air Inc.   0.857143

Question 2: Plot the boxplot of the departure delays vs the name of airlines where the destination is ATL airport (use the previous question). Solve this question using R and Python.

R solution

Use ggplot

# code here
library(ggplot2) #loading ggplot2

plot <- ggplot(data = joinedDF, aes(x = name, y = arr_delay, ymin = -500)) + #plotting data
  geom_boxplot() + 
  labs(title="Arrival Delays by Airline", #setting labels
       x="Airline",
       y = "Arrival Delays") +
  theme(axis.text.x = element_text(angle = 90)) #rotating x axis labels

plot

Python solution

You can use pandas.boxplot() or seaborn.boxplot().

# import python libraries
import matplotlib.pyplot as plt #importing library used to create box plot
# data

#the data utilizes the joined_df created on line 75

# Boxplot

plt.figure() #initializing figure
joined_df.boxplot(column=['arr_delay'], by="name",rot=90) #using the boxplot method to create a boxplot

#setting lables for the plot
plt.xlabel("Airline")
plt.ylabel("Arrival Delays")
plt.suptitle('')
plt.title("Arrival Delays by Airline")
plt.tight_layout()

plt.show() #displaying box plot