Introduction

On this analysis we used the data from Data Expo 2009 in search of a Key performance indicator. The goal was to train a model using least squares to predict the behavior of that KPI over time.

Preliminaries

We load the required libraries.

library(RSQLite)
library(ff)
library(ffbase)
library(ETLUtils)
library(ggplot2)

Set Connection to database

I downloaded the data post 2001 and set up the database using SQLite as instructed here.

Once we have the database ready, we can set up te connection to that database.

ontime <- dbConnect(RSQLite::SQLite(), dbname = "ontimesmall.sqlite3")
from_db <- function(sql) {
  dbGetQuery(ontime, sql)
}

Sheduled Departure time vs Cancelations and Departure Delay

A SQL Query was run getting year, Sheduled departure time, average departure delays, average cancelations, grouping by year and sheduled departure time.

We save then the query as ffdf files to access them faster on future analysis.

ffdf.delaycancel <- read.dbi.ffdf(query = "SELECT Year, CRSDepTime, avg(DepDelay), sum(Cancelled), count(DepDelay) as flights FROM ontime GROUP BY Year, CRSDepTime", 
                          dbConnect.args = list(
                            RSQLite::SQLite(),
                             dbname = "ontimesmall.sqlite3"))
save.ffdf(ffdf.delaycancel,dir = "../data", overwrite = TRUE)
load.ffdf("../data")

Every sheduled departure time was rounded to group by sheduled hour of departure.

redondear <- function(x) round(x / 100)

Preprocessing the data as data frame and other operations.

load("data.Rda")
df.all = as.data.frame(ffdf.delaycancel)
df.all$Year = as.factor(df.all$Year)
df.all$CRSDepTime = sapply(df.all$CRSDepTime, redondear)

Grouping the average of the departure delays, the sum of the cancelled flights and the amount of flights per sheduled hour of departure and year.

v1 <- aggregate( avg.DepDelay. ~ Year + CRSDepTime, data = df.all, mean)
v2 <- aggregate( sum.Cancelled. ~ Year + CRSDepTime, data = df.all, sum)
v3 <- aggregate( flights ~ Year + CRSDepTime, data = df.all, sum)
df.all = merge(v1,v2)
df.all = merge(df.all, v3)

Sheduled Departure Time vs Percentage of Cancellations

Data Exploration

The percentage of cancelled flights was plotted against shedulede departure hour to see the shape of the distribution of each year.

gg <- ggplot(df.all, aes(x = CRSDepTime, y = sum.Cancelled./flights, color = Year))
gg <- gg + geom_point(aes(size = flights)) 
gg <- gg + geom_smooth(method = "lm", formula = y ~ poly(x, 3), se=FALSE,
                       aes(weight = flights)) 
gg <- gg + scale_x_continuous(breaks = pretty(df.all$CRSDepTime,n=10),
                     name = "Sheduled Departure Time") 
gg <- gg + scale_y_continuous(name = "Percentage of Cancelled Flights")
gg <- gg + scale_color_discrete(name = "Year")
gg <- gg + scale_size(name = "Flights")
gg

A grade 3 polynomial function was used to approximate the distribution of each year, and as it can be noticed, it fits well on the majority of the points. In particular the ones it misses the most are early hours between 0:00 and 4:00, and this is not a problem since only a minority of the flights occur around this time.

temp = aggregate(flights ~ CRSDepTime, data = df.all, sum)
range1 = subset(temp, CRSDepTime > 4)
sum(range1$flights)/sum(temp$flights)
## [1] 0.9976916

Prediction

The last 3 years were chosen as a good training data for the model. We tested with 2 and 4 years behind, but they both didn’t perform so well when we use cross validation over the years.

yearcut <- function(data, year)
  subset(data, Year %in% c(toString(year-1), toString(year-2), toString(year-3)))

Here we cut the hours to erase the noise provided by early hours and divide between train and set for a initial analysis.

df.noutliers = subset(df.all, CRSDepTime > 4)
train = yearcut(df.noutliers, 2008)
train$Year = as.numeric(as.character(train$Year))
test = subset(df.noutliers, Year %in% c("2008"))
test$Year = as.numeric(as.character(test$Year))

Two polynomials of different grade were compared. Here is the plot

LS = lm(sum.Cancelled./flights ~ poly(CRSDepTime, 3),data = train)
LS7 = lm(sum.Cancelled./flights ~ poly(CRSDepTime, 7),data = train) 
gg <- ggplot(train, aes(x = CRSDepTime, y = sum.Cancelled./flights)) 
gg <- gg + geom_point() 
gg <- gg + geom_smooth(method = "lm", formula = y ~ poly(x,3), se = FALSE,
                       aes(color = "Poly grade 3")) 
gg <- gg + geom_smooth(method = "lm", formula = y ~ poly(x,7), se = FALSE,
                       aes(color = "Poly grade 7")) 
gg <- gg + scale_x_continuous(breaks = pretty(train$CRSDepTime,n=20),
                     name = "Sheduled Departure Time") 
gg <- gg + scale_y_continuous(name = "Cancellations per Flight")
gg

#MSE grade 3
mse3 = mean((predict(LS) - test$sum.Cancelled./test$flights)^2)
mse3
## [1] 2.790698e-06
#Coefficients grade 3
LS$coefficients
##          (Intercept) poly(CRSDepTime, 3)1 poly(CRSDepTime, 3)2 
##           0.01854156          -0.00200545          -0.01319986 
## poly(CRSDepTime, 3)3 
##          -0.02203896
#MSE grade 7
mse7 = mean((predict(LS7) - test$sum.Cancelled./test$flights)^2)
mse7
## [1] 2.115239e-06
#Coefficients grade 7
LS7$coefficients
##          (Intercept) poly(CRSDepTime, 7)1 poly(CRSDepTime, 7)2 
##          0.018541556         -0.002005450         -0.013199855 
## poly(CRSDepTime, 7)3 poly(CRSDepTime, 7)4 poly(CRSDepTime, 7)5 
##         -0.022038956          0.003149267          0.003557037 
## poly(CRSDepTime, 7)6 poly(CRSDepTime, 7)7 
##          0.004984905         -0.001700850

As we can notice, the advantage of rising the grade of the polynomial isn’t great.

For gaining a better understanding of the model and it’s behavior using several years, we do cross validation over a range of years and take the mean of its performance (here mean square error)

predictions = c()
predictions7 = c()

for(year in 2005:2008){
  train = yearcut(df.noutliers, year)
  train$Year = as.numeric(as.character(train$Year))
  test = subset(df.noutliers, Year %in% c(toString(year)))
  test$Year = as.numeric(as.character(test$Year))
  LS = lm(sum.Cancelled./flights ~ poly(CRSDepTime, 3),data = train)
  LS7 = lm(sum.Cancelled./flights ~ poly(CRSDepTime, 7),data = train)
  p = predict(LS, test)
  p7 = predict(LS7, test)
  predictions = c(predictions, mean((p - test$sum.Cancelled./test$flights)^2))
  predictions7 = c(predictions7, mean((p7 - test$sum.Cancelled./test$flights)^2))
}

#MSE mean grade 3
cv3 = mean(predictions)
cv3
## [1] 9.519727e-06
#MSE mean grade 7
cv7 = mean(predictions7)
cv7
## [1] 9.329138e-06

The MSE of the grade 7 is still lower, but I don’t find that such a low performance increase warrants doubling the complexity of the model. Furthermore, we look at the root mean square deviation

sqrt(cv3)
## [1] 0.003085405
sqrt(cv7)
## [1] 0.003054364

Sheduled Departure Time vs Departure Delay

Data Exploration

The same analysis we did on the exploration stage of the previous variables was produced. This time using a grade 4 polynomial.

gg <- ggplot(df.all, aes(x = df.all$CRSDepTime, y = df.all$avg.DepDelay., color = df.all$Year)) 
gg <- gg + geom_point(aes(size = df.all$flights)) 
gg <- gg + geom_smooth(method = "lm", formula = y ~ poly(x,4),se=FALSE, aes(weight = flights))
gg <- gg + scale_y_continuous(limits = c(-10, 20), breaks = pretty(df.all$avg.DepDelay.,n=25),
                              name = "Departure Delay" )
gg <- gg + scale_x_continuous(breaks = pretty(df.all$CRSDepTime,n=10), 
                     name = "Sheduled Departure Time")
gg <- gg + scale_color_discrete(name = "Year")
gg <- gg + scale_size(name = "Flights")
gg
## Warning: Removed 4 rows containing non-finite values (stat_smooth).
## Warning: Removed 4 rows containing missing values (geom_point).

The code outputs some warnings since we dropped a few departure delay values disregarding them as outliers. This values were on the same range we found that only a minority of flights took place. It can be seen that the chosen function approaches the data points well, specially if we use the amount of flights as weights for the linear regression.

Prediction

Once the data was subsetted, there is no need to use a weight matrix of flights, but adding the flights as a linear variable provided some improvement to the model

LS = lm(avg.DepDelay. ~ poly(CRSDepTime,4) + flights,data = train)

gg <- ggplot(train, aes(x = CRSDepTime, y = avg.DepDelay.)) 
gg <- gg + geom_point() 
gg <- gg + geom_smooth(method = "lm", formula = y ~ poly(x,4), se = FALSE) 
gg <- gg + scale_x_continuous(breaks = pretty(train$CRSDepTime,n=10),
                     name = "Sheduled Departure Time") 
gg <- gg + scale_y_continuous(breaks = pretty(train$avg.DepDelay.,n=10),
                     name = "Average Departure Delay")
gg

#MSE grade 4
mse = mean((predict(LS) - test$avg.DepDelay.)^2)
mse
## [1] 2.076512
#Coefficient grade 4
LS$coefficients
##          (Intercept) poly(CRSDepTime, 4)1 poly(CRSDepTime, 4)2 
##         8.316157e+00         3.628595e+01        -1.208052e+01 
## poly(CRSDepTime, 4)3 poly(CRSDepTime, 4)4              flights 
##        -1.006661e+01        -5.450710e-01         5.254533e-06

As we can see, 2008 can be predicted pretty accurately. Let’s see if the model holds up against the other years.

predictions = c()
for(year in 2005:2008){
  train = yearcut(df.noutliers, year)
  train$Year = as.numeric(as.character(train$Year))
  test = subset(df.noutliers, Year %in% c(toString(year)))
  test$Year = as.numeric(as.character(test$Year))
  LS = lm(avg.DepDelay. ~ poly(CRSDepTime,4) + flights,data = train)
  p = predict(LS, test)
  predictions = c(predictions, mean((p - test$avg.DepDelay.)^2))
}
#MSE
mse = mean(predictions)
mse
## [1] 6.852874

The MSE dramatically worsen after doing Cross Validation. Adding the year as a linear variable to the model improves the performance of the cross validated model.

predictions = c()
for(year in 2005:2008){
  train = yearcut(df.noutliers, year)
  train$Year = as.numeric(as.character(train$Year))
  test = subset(df.noutliers, Year %in% c(toString(year)))
  test$Year = as.numeric(as.character(test$Year))
  LS = lm(avg.DepDelay. ~ poly(CRSDepTime,4) + flights + Year,data = train)
  p = predict(LS, test)
  predictions = c(predictions, mean((p - test$avg.DepDelay.)^2))
}
#MSE
mse = mean(predictions)
mse
## [1] 3.878254

The MSE still is worse than the performance on 2008, but we consider the RMSD to be sufficiently low.

sqrt(mse)
## [1] 1.969328

Conclusion

We found a a simple model that accurately predicts yearly average percentage of cancelled flights on a given hour only using least squares. Other techniques may be more appropiate for this task using other available information, as weather forecasts and day of the year, if the intention were to predict the percentage of cancelled flights on a particular day of the year.

Afterwards, we provided a second model that predicts the average delay of departure on a given year that performs moderately well. It’s worth noting the amount of years over we did cross validation wasn’t particulary big for neither model. Further analysis could provide more information about how this KPIs change over time.