Final Project - Tools for Data Science
Clayton Carpenter
03/03/2022
In this project you will be working with R, SQL, and Python in the same document. We will use the data sets airlines and flights from the package nycflights13
Install the packages nycflights13
, sqldf
, tydiverse
and load all data sets and packages. Take your time to understand the data sets.
# Install code here - comment the code when you installed
#install.packages("nycflights13")
#install.packages("sqldf")
#install.packages("tidyverse")
# Load all packages here
library(nycflights13)
library(sqldf)
library(tidyverse)
# Load the data here
data(airlines)
data(flights)
Question 1: List the name of airlines where the destination is ATL airport with their average arrival delays and sort them from the smallest to largest average arrival delays. Use data frames flights
and airlines
. We solve this question using R, SQL, and Python.
R solution
You can use the R command inner_join()
to join the two tables.
library(dplyr)
flightsDF <- data.frame(flights) #loading in the data
airlinesDF <- data.frame(airlines)
joinedDF = inner_join(flightsDF, airlinesDF, by='carrier') %>% filter(dest == 'ATL') %>% select(name,arr_delay) #joining data, filtering by destination and selecting name and arrival delay
joinedDF %>% group_by(name) %>% summarise_each(funs(mean(., na.rm = TRUE))) %>% arrange(desc(arr_delay)) #grouping data by name and calculating the mean
## # A tibble: 7 x 2
## name arr_delay
## <chr> <dbl>
## 1 AirTran Airways Corporation 20.7
## 2 ExpressJet Airlines Inc. 19.6
## 3 Envoy Air 14.0
## 4 United Air Lines Inc. 10.5
## 5 Delta Air Lines Inc. 7.42
## 6 Southwest Airlines Co. 6.90
## 7 Endeavor Air Inc. 0.857
SQL solution
Write your SQL query in the function sqldf()
. For exmaple,sqldf("select * from relig_income")
list the data frame relig_income
.
# Example
# sqldf("select * from relig_income limit 3")
# code here
#selecting airline name and average delay
sqldf("
Select airlns.name, AVG(flgt.arr_delay)
FROM flights as flgt
LEFT JOIN airlines as airlns --joining data on carrier
on flgt.carrier = airlns.carrier
WHERE flgt.dest == 'ATL' --filtering by destination
GROUP BY airlns.name --grouping by airline
ORDER BY AVG(flgt.arr_delay) DESC --ordering the data by arrival delays
")
## name AVG(flgt.arr_delay)
## 1 AirTran Airways Corporation 20.7445127
## 2 ExpressJet Airlines Inc. 19.6382850
## 3 Envoy Air 14.0340045
## 4 United Air Lines Inc. 10.5000000
## 5 Delta Air Lines Inc. 7.4242250
## 6 Southwest Airlines Co. 6.8965517
## 7 Endeavor Air Inc. 0.8571429
Python solution
Use df.join() from pandas.
# load python libraries
import pandas as pd
import statsmodels.api as stat
# load data
## C:\Users\clayt\Documents\R\win-library\4.1\reticulate\python\rpytools\loader.py:39: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
## module = _import(
## C:\Users\clayt\CONDA\lib\site-packages\statsmodels\compat\pandas.py:61: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
## from pandas import Int64Index as NumericIndex
flights_df = stat.datasets.get_rdataset("flights", "nycflights13").data
airlines_df = stat.datasets.get_rdataset("airlines", "nycflights13").data
# code here
joined_df = flights_df.set_index('carrier').join(airlines_df.set_index('carrier'), how='inner') #joining airlines and flight data
joined_df = joined_df[joined_df['dest'] == 'ATL'] #filtering the data for ATL as destination
joined_df.groupby('name')['arr_delay'].mean().reset_index().sort_values('arr_delay', ascending=False) #grouping by name and calculating the mean
## name arr_delay
## 0 AirTran Airways Corporation 20.744513
## 4 ExpressJet Airlines Inc. 19.638285
## 3 Envoy Air 14.034004
## 6 United Air Lines Inc. 10.500000
## 1 Delta Air Lines Inc. 7.424225
## 5 Southwest Airlines Co. 6.896552
## 2 Endeavor Air Inc. 0.857143
Question 2: Plot the boxplot of the departure delays vs the name of airlines where the destination is ATL airport (use the previous question). Solve this question using R and Python.
R solution
Use ggplot
# code here
library(ggplot2) #loading ggplot2
plot <- ggplot(data = joinedDF, aes(x = name, y = arr_delay, ymin = -500)) + #plotting data
geom_boxplot() +
labs(title="Arrival Delays by Airline", #setting labels
x="Airline",
y = "Arrival Delays") +
theme(axis.text.x = element_text(angle = 90)) #rotating x axis labels
plot
Python solution
You can use pandas.boxplot() or seaborn.boxplot().
# import python libraries
import matplotlib.pyplot as plt #importing library used to create box plot
# data
#the data utilizes the joined_df created on line 75
# Boxplot
plt.figure() #initializing figure
joined_df.boxplot(column=['arr_delay'], by="name",rot=90) #using the boxplot method to create a boxplot
#setting lables for the plot
plt.xlabel("Airline")
plt.ylabel("Arrival Delays")
plt.suptitle('')
plt.title("Arrival Delays by Airline")
plt.tight_layout()
plt.show() #displaying box plot