Part 1: Summary to CEO

This study determined that the most accurate prediction of the number of sales in North America for “The Fatal Empire” is 0.779 million copies. The most accurate prediction was obtained by evaluating on testing data the best performing model, which then was utilised for determining the number of sales in North America for The Fatal Empire. The variable that had the highest importance in the final model was the number of sales in Europe, followed by the sales in other regions of the world.

Part 2: Manager’s report

Introduction

The aim of this report is to predict the number of sales in North America for the recently released PlayStation 4 game “The Fatal Empire”. The game has sold 2.58 million copies in Japan, 0.53 million copies in Europe, and 0.1 million copies in other world regions.

In this report, the prediction of sales in North America was performed using two different predictive models (i.e. Lasso regression and Random Forest). The final model was selected by evaluating the highest performing model.

Data manipulation

In this section, the data types and missing values of the different variables were identified. The variables in the data are described in Table 1 with the respective data types. The variable Year was transformed from categorical to a numerical variable, and all the observations that had missing values for Year were deleted. Furthermore, the Publisher of the game “The Fatal Empire” is not available for the final prediction. Thus, this variable was deleted from the data.

Exploratory Data Analysis

A Parallel Coordinate Plot and a Principal Component Analysis were performed to explore the data. The former did not show groupings of games (i.e. clusters) for the sales in different regions (Figure 6). The latter identified important variables in the different components (i.e. Sales, Year, Genre, Platform) (Figure 9). However, the reduction of dimensions was not significant.

The total number of sales in North America were higher for the year 2008, Action genre and Xbox360 platform. The relationship between the outcome variable and the sales in different regions was varied, but only the sales in Europe showed an slightly linear relationship (Figure 12). It is important to highlight that all relationships showed outliers for sales in North America, which could have an effect on the final predictive model.

Preprocessing

The data was split into training and testing sets. The training set included 12245 observations, and the testing included 4082. The preprocessing steps applied were:

  • Remove Name due to its overfitting effect on the outcome variable.
  • Create an “Other” category for Genre and Platform to avoid a large number of dimensions when predicting the model.
  • Convert Genre and Platform to a dummy variable so that all predictors are numerical.
  • Normalise all the predictors converting the values so that the mean is zero and the standard deviation is one.
  • Remove all predictors with a variance equal to zero.
  • Remove all correlated predictors to avoid autocorrelation.

Model outcome

Random forest showed better accuracy metrics for the cross-validation analysis than the Lasso regression. Random forest had an r-squared of 0.806 and an root mean squared error of 0.376 (Table 23). In contrast, the Lasso regression had an r-squared of 0.667 and an root mean squared error of 0.446 (Table 22). Thus, the Random Forest model was chosen to perform the predictions of sales in North America for “The Fatal Empire”.

Model performance

The most important variable for predicting the sales in North America with the Random forest model was the number of sales in Europe, which is consistent with the slightly linear relationship (Figure 12B).

Random forest showed a lower accuracy on the testing set than on the training set with an r-squared of 0.643 and an root mean squared error of 0.468 (Table 25). This could be explained by the high variability in the number of sales for each game in North America, consistent with the number of outliers identified. The predictors of the model explained 64.3% of the variance in sales in North America, which is a moderate prediction accuracy.

Predictions

The Random Forest model predicted 0.779 million sales in North America of The Fatal Empire. It is important to consider that the model had a moderate accuracy, which could be improved by adding other variables, adding the missing values for year in the data, and/or further researching the reason behing the high number of outliers.

Part 3: Statistician’s report

1. Data cleaning

Data dictionary

The following table shows the variables presented in the data with the respective data types.

Table 1. Description and type of the variables in the vgsales dataset.

Variable Description Type
Name The games name Categorical
Platform Platform of the games release (i.e. PC, PS4, XBOX, etc) Categorical
Year Year of the game’s release Discrete
Genre Genre of the game Categorical
Publisher Publisher of the game Categorical
NA_Sales Sales in North America (in millions) Continuous
EU_Sales Sales in Europe (in millions) Continuous
JP_Sales Sales in Japan (in millions) Continuous
Other_Sales Sales in the rest of the world (in millions) Continuous

Read and skim the data

# Read the data and transform it to a tibble
vgsales <- read.csv("vgsales.csv", header = TRUE)
vgsales <- tibble(vgsales)
vgsales %>%
  head() %>%
  kable(caption = "Table 2. First 6 rows of the vgsales dataset") %>%
  kable_classic("hover", font_size = 18)
Table 2. First 6 rows of the vgsales dataset
Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales
Wii Sports Wii 2006 Sports Nintendo 41.49 29.02 3.77 8.46
Super Mario Bros.  NES 1985 Platform Nintendo 29.08 3.58 6.81 0.77
Mario Kart Wii Wii 2008 Racing Nintendo 15.85 12.88 3.79 3.31
Wii Sports Resort Wii 2009 Sports Nintendo 15.75 11.01 3.28 2.96
Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27 8.89 10.22 1.00
Tetris GB 1989 Puzzle Nintendo 23.20 2.26 4.22 0.58

When comparing Table 2 with the description of the data in Kaggle, it is possible to see that the games’ ranking and world sales are missing.

dim(vgsales) #dimension
## [1] 16598     9
skim_without_charts(vgsales)%>%
  kable(caption = "Table 3. Skimmed vgsales data") %>%
  kable_classic("hover", font_size = 18)
Table 3. Skimmed vgsales data
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
character Name 0 1 1 132 0 11493 0 NA NA NA NA NA NA NA
character Platform 0 1 2 4 0 31 0 NA NA NA NA NA NA NA
character Year 0 1 3 4 0 40 0 NA NA NA NA NA NA NA
character Genre 0 1 4 12 0 12 0 NA NA NA NA NA NA NA
character Publisher 0 1 3 38 0 579 0 NA NA NA NA NA NA NA
numeric NA_Sales 0 1 NA NA NA NA NA 0.2646674 0.8166830 0 0 0.08 0.24 41.49
numeric EU_Sales 0 1 NA NA NA NA NA 0.1466520 0.5053512 0 0 0.02 0.11 29.02
numeric JP_Sales 0 1 NA NA NA NA NA 0.0777817 0.3092906 0 0 0.00 0.04 10.22
numeric Other_Sales 0 1 NA NA NA NA NA 0.0480630 0.1885884 0 0 0.01 0.04 10.57

Observations

  • There are 16598 observations and 9 variables. The outcome variable is NA_Sales whilst all other variables are predictors.

  • There are 5 categorical variables (i.e. Name, Platform, Year, Genre, Publisher). From these variables, Year should be a numerical variable.

  • There are 4 numerical variables (i.e. NA_Sales, EU_Sales, JP_Sales, Other_Sales). All these variables have the correct data type

  • None of the variables has missing data.

  • There are differences in the mean an standard deviation of numerical variables. For this reason, for the final model, the numerical variables should be normalised.

Plot key variables

Check the distribution of the Year variable.

ggplot(data = vgsales, aes(as.factor(Year))) +
  geom_histogram(stat = "count", fill = "#2F5755") + 
  scale_fill_distiller() +
  xlab("Year") +
  ylab("Count") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Figure 1. Histogram of the count of observations by year

Figure 1. Histogram of the count of observations by year

Observations

  • Missing observations for years 2018 and 2019. These observation could be in the “N/A” values of the Years variable.

  • There are missing values that weren’t captured by the skim function since the string “N/A” is not the same as NA value for R.

  • The distribution is unimodal and left skewed with more observations for recent years.

  • Most of the observations were for year 2009.

Since there are missing values in the year, the observations were filtered by the missing values to further investigate the reason behind.

#Obtain the dimension of the observations with missing Year
dim(vgsales %>% 
  filter(Year == "N/A"))
## [1] 271   9
#Obtain a table of the observations with missing Year
vgsales %>% 
  filter(Year == "N/A") %>%
  head() %>%
  kable(caption = "Table 4. First 6 rows of observations with missing year") %>%
  kable_classic("hover", font_size = 18)
Table 4. First 6 rows of observations with missing year
Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales
Madden NFL 2004 PS2 N/A Sports Electronic Arts 4.26 0.26 0.01 0.71
FIFA Soccer 2004 PS2 N/A Sports Electronic Arts 0.59 2.36 0.04 0.51
LEGO Batman: The Videogame Wii N/A Action Warner Bros. Interactive Entertainment 1.86 1.02 0.00 0.29
wwe Smackdown vs. Raw 2006 PS2 N/A Fighting N/A 1.57 1.02 0.00 0.41
Space Invaders 2600 N/A Shooter Atari 2.36 0.14 0.00 0.03
Rock Band X360 N/A Misc Electronic Arts 1.93 0.34 0.00 0.21

Observations

  • There are 271 observations with missing values in the Year variable.

  • Table 4 shows that some games contain the year in the Name variable (e.g. Madden NFL 2004)

Check the distribution of sales in different regions.

p1 <- ggplot(data = vgsales, aes(x = NA_Sales)) +
  geom_histogram(fill = "#2F5755") +
  xlab("Sales in North America") +
  ylab("Count") +
  theme_bw() 

p2 <- ggplot(data = vgsales, aes(JP_Sales)) +
  geom_histogram(fill = "#2F5755") +
  xlab("Sales in Japan") +
  ylab("Count") +
  theme_bw()

p3 <- ggplot(data = vgsales, aes(EU_Sales)) +
  geom_histogram(fill = "#2F5755") +
  xlab("Sales in Europe") +
  ylab("Count") +
  theme_bw()

p4 <- ggplot(data = vgsales, aes(Other_Sales)) +
  geom_histogram(fill = "#2F5755") +
  xlab("Sales in the rest of the world") +
  ylab("Count") +
  theme_bw()


# Use patchwork
(p1/p2/p3/p4) + 
  plot_annotation(tag_levels = "A",
                  title = "Histogram of sales in different regions",
                  theme = theme(plot.title = element_text(size = 18)))
Figure 2. Histogram of the count of observations by sales for each region

Figure 2. Histogram of the count of observations by sales for each region

Observations

  • Most of the sales are zero for all regions (i.e. North America, Japan, Europe and the rest of the world).

  • The distribution of sales is unimodal and right skewed for all regions. This means that most of the games in the dataset had low number of sales.

Analyse the number of games released by Platform.

count_released <- vgsales %>%
  group_by(Platform) %>%
  summarise(released_games = n()) %>%
  arrange(desc(released_games)) %>%
  mutate(pal = case_when(released_games >= 1000 ~ "#2F5755",
                         released_games >= 500 ~ "#F7B15B",
                         released_games >= 250 ~ "#F65C51",
                         TRUE ~ "grey90"))


ggplot(data = count_released, aes(y = reorder(Platform, released_games), x = released_games, fill = pal)) +
  geom_col(alpha = .8) +
  xlab("Count")+
  ylab("Platform") +
  ggtitle("Platform by number of released games") +
  scale_fill_identity("Released games", labels = c(">1000", ">500",">250" ,"<250"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51","grey90")) +
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 3. Bar chart of the number of released games per platform

Figure 3. Bar chart of the number of released games per platform

Create a table with the top 5 Platform:

count_released %>%
  select(Platform, released_games)%>%
  head(5)%>%
  kable(caption = "Table 5. Top 5 platforms with the higher number of released games") %>%
  kable_classic("hover", font_size = 18)
Table 5. Top 5 platforms with the higher number of released games
Platform released_games
DS 2163
PS2 2161
PS3 1329
Wii 1325
X360 1265

Create a table with the last 5 Platform:

count_released %>%
  select(Platform, released_games)%>%
  tail(5)%>%
  kable(caption = "Table 6. Last 5 platforms with the lower number of released games") %>%
  kable_classic("hover", font_size = 18)
Table 6. Last 5 platforms with the lower number of released games
Platform released_games
WS 6
3DO 3
TG16 2
GG 1
PCFX 1

Observations

  • Figure 3 and Table 5 show that the number of games released is the highest for Nintendo DS (2163 games released), followed by PlayStation 2 (2161 games released).

  • Figure 3 and Table 6 show that the number of games released is the lowest for Game Gear (GG) and PC-FX (PCFX), both having 1 game released (Sonic the Hedgehog 2 and Blue Breaker: Ken Yorimo Hohoemi, respectively).

Identify number of released games per Publisher

count_publisher <- vgsales %>%
  group_by(Publisher) %>%
  summarise(released_games = n()) %>%
  arrange(desc(released_games)) %>%
  mutate(pal = case_when(released_games >= 1000 ~ "#2F5755",
                         released_games >= 750 ~ "#F7B15B",
                         released_games >= 500 ~ "#F65C51",
                         TRUE ~ "grey90"))


ggplot(data = count_publisher%>%head(10), aes(y = reorder(Publisher, released_games), x = released_games, fill = pal)) +
  geom_col(alpha = .8) +
  xlab("Count")+
  ylab("Publisher") +
  ggtitle("Top 10 Publishers by number of released games") +
  scale_fill_identity("Released games", labels = c(">1000", ">750",">500", "<500" ), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51", "grey90")) +
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 4. Bar chart of the number of released games per publisher

Figure 4. Bar chart of the number of released games per publisher

count_publisher%>%
  select(Publisher, released_games) %>%
  head(10) %>%
  kable(caption = "Table 7. Summary table of the number of released games for the top 10 publishers") %>%
  kable_classic("hover", font_size = 18)
Table 7. Summary table of the number of released games for the top 10 publishers
Publisher released_games
Electronic Arts 1351
Activision 975
Namco Bandai Games 932
Ubisoft 921
Konami Digital Entertainment 832
THQ 715
Nintendo 703
Sony Computer Entertainment 683
Sega 639
Take-Two Interactive 413

Observations

  • Figure 4 and Table 7 show that the Publisher that had the highest number of released games was Electronic Arts (1351 games released), followed by Activision (975 games released). Electronic Arts has released iconic video games such as FIFA, The Sims, Battlefield, and others. Activision is known for famous video games such as Call of Duty, Doom, Crash Bandicoot, and others.

  • Some publishers that are also creators of the consoles like Nintendo, Sony Computer Entertainment and Sega appear later in the lists.

Check the number of released games per Genre

count_genre <- vgsales %>%
  group_by(Genre) %>%
  summarise(released_games = n()) %>%
  arrange(desc(released_games)) %>%
  mutate(pal = case_when(released_games >= 1500 ~ "#2F5755",
                         released_games >= 1000 ~ "#F7B15B",
                         released_games >= 750 ~ "#F65C51",
                         TRUE ~ "grey90"))


ggplot(data = count_genre, aes(y = reorder(Genre, released_games), x = released_games, fill = pal)) +
  geom_col(alpha = .8) +
  xlab("Count")+
  ylab("Genre") +
  ggtitle("Number of released games by genre") +
  scale_fill_identity("Released games", labels = c(">1500", ">1000",">750", "<750"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51", "grey90")) +
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 5. Bar chart of the number of released games per genre

Figure 5. Bar chart of the number of released games per genre

count_genre %>%
  select(Genre, released_games) %>%
  kable(caption = "Table 8. Summary table of the number of released games per genre") %>%
  kable_classic("hover", font_size = 18)
Table 8. Summary table of the number of released games per genre
Genre released_games
Action 3316
Sports 2346
Misc 1739
Role-Playing 1488
Shooter 1310
Adventure 1286
Racing 1249
Platform 886
Simulation 867
Fighting 848
Strategy 681
Puzzle 582
genre_unique <- vgsales %>%
  distinct(Name, Genre) 

genre_unique %>%
  group_by(Genre) %>%
  summarise(released_games = n()) %>%
  arrange(desc(released_games)) %>%
  kable(caption = "Table 9. Summary table of the number of unique released games per genre") %>%
  kable_classic("hover", font_size = 18)
Table 9. Summary table of the number of unique released games per genre
Genre released_games
Action 1925
Sports 1379
Misc 1327
Role-Playing 1219
Adventure 1048
Shooter 817
Racing 772
Simulation 726
Fighting 612
Platform 589
Strategy 584
Puzzle 499

Observations

  • Figure 5 and Table 8 show that the genre Action has the highest number of released games. It is important to highlight that the same games can be released on different Platform. For this reason, the number of released games per genre is double-counting observations.

  • Table 9 shows the actual number of unique released games per genre. In this table, the genre Adventure has more released games than the Shooter. Additionally, the genre Platform dropped 2 rows on the list, having less number of released games than Simulation and Fighting.

Data manipulation

In the previous section, it was identified that the Year column has the wrong data type and has missing values. The following code aims to change the data type of Year to numeric and delete the missing values. Additionally, since information about the Publisher of The Fatal Empire is not available, it would not be possible to use this variable for predicting the sales in North America. Thus, the Publisher variable was removed.

vgsales <- vgsales %>%
  mutate(Year = as.numeric(Year)) %>% # Transform year to numeric
  select(-Publisher) %>% # Publisher is not considered
  drop_na() # Remove NAs from the data

vgsales %>%
  head() %>%
  kable(caption = "Table 10. First 6 rows of the transformed vgsales dataset") %>%
  kable_classic("hover", font_size = 18)
Table 10. First 6 rows of the transformed vgsales dataset
Name Platform Year Genre NA_Sales EU_Sales JP_Sales Other_Sales
Wii Sports Wii 2006 Sports 41.49 29.02 3.77 8.46
Super Mario Bros.  NES 1985 Platform 29.08 3.58 6.81 0.77
Mario Kart Wii Wii 2008 Racing 15.85 12.88 3.79 3.31
Wii Sports Resort Wii 2009 Sports 15.75 11.01 3.28 2.96
Pokemon Red/Pokemon Blue GB 1996 Role-Playing 11.27 8.89 10.22 1.00
Tetris GB 1989 Puzzle 23.20 2.26 4.22 0.58
dim(vgsales) #dimension
## [1] 16327     8
skim_without_charts(vgsales)%>%
  kable(caption = "Table 11. Skimmed vgsales transformed data") %>%
  kable_classic("hover", font_size = 18)
Table 11. Skimmed vgsales transformed data
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
character Name 0 1 1 132 0 11360 0 NA NA NA NA NA NA NA
character Platform 0 1 2 4 0 31 0 NA NA NA NA NA NA NA
character Genre 0 1 4 12 0 12 0 NA NA NA NA NA NA NA
numeric Year 0 1 NA NA NA NA NA 2006.4064433 5.8289811 1980 2003 2007.00 2010.00 2020.00
numeric NA_Sales 0 1 NA NA NA NA NA 0.2654150 0.8215909 0 0 0.08 0.24 41.49
numeric EU_Sales 0 1 NA NA NA NA NA 0.1475544 0.5087657 0 0 0.02 0.11 29.02
numeric JP_Sales 0 1 NA NA NA NA NA 0.0786611 0.3115570 0 0 0.00 0.04 10.22
numeric Other_Sales 0 1 NA NA NA NA NA 0.0483255 0.1898854 0 0 0.01 0.04 10.57

Observations

  • There are 8 variables instead of 9 since Publisher was removed.

  • There are 16327 observations instead of 16598 since the NAs from Year were removed.

  • Table 11 shows that the variable Year is a numerical variable instead of a categorical variable.

2. Exploratory Data Analysis

Exploratory Data Analysis is required to increase performance and accuracy of predictive models. In this case, this process will allow to determine the relationships between the predictor variables and the outcome variable, identify outliers in the data and detect anomalies. The aim of this section is to have a better understanding of the data and identify which transformations are required.

Parallel Coordinate Plot

#Create top 10 platforms
top_10 <- vgsales %>%
  mutate(Platform = fct_lump(Platform, n = 10)) %>%
  count(Platform, sort = TRUE)



#Use pivot_longer to transform the data to long format
vgsales_pcp <- vgsales %>%
  select(-Year) %>% # Exclude year due to difference in scale with the sales
  mutate(ID = row_number(), Platform = case_when(Platform %in% unique(top_10$Platform) ~ Platform, TRUE ~ "Other Platform")) %>%
  pivot_longer(cols = NA_Sales:Other_Sales)
vgsales_pcp %>%
  head()%>%
  kable(caption = "Table 12. Data transformed to a long format") %>%
  kable_classic("hover", font_size = 18)
Table 12. Data transformed to a long format
Name Platform Genre ID name value
Wii Sports Wii Sports 1 NA_Sales 41.49
Wii Sports Wii Sports 1 EU_Sales 29.02
Wii Sports Wii Sports 1 JP_Sales 3.77
Wii Sports Wii Sports 1 Other_Sales 8.46
Super Mario Bros.  Other Platform Platform 2 NA_Sales 29.08
Super Mario Bros.  Other Platform Platform 2 EU_Sales 3.58
vgsales_pcp %>% 
  ggplot( aes( x = name, y = value) ) +
  geom_line(aes( group = ID), color = "#2F5755" ) +
  ggtitle("Parallel Coordinate Plot")+
  theme_bw()
Figure 6. Parallel Coordinate Plot of Number of Sales

Figure 6. Parallel Coordinate Plot of Number of Sales

The Parallel Component Plot does not show clear grouping for the sales in different regions. For all regions, most of the sales were between 0 and 5 million. However, there are some observations for Europe and North America with higher number of sales.

Check if the higher observations are related to the Genre of the game.

vgsales_pcp %>% 
  ggplot(aes( x = name, y = value, colour = Genre )) +
  geom_line( aes( group = ID ) ) +
  ggtitle("Parallel Coordinate Plot by Genre")+
  theme_bw()
Figure 7. Parallel Coordinate Plot of Number of Sales by Genre

Figure 7. Parallel Coordinate Plot of Number of Sales by Genre

The highest number of sales for Europe and North America correspond to the Sport genre. However, there are different genres that have higher values of sales for North America and they do not generate a clear grouping.

Check if the higher observations are related to the Platform of the game.

vgsales_pcp %>% 
  ggplot(aes( x = name, y = value, colour = Platform)) +
  geom_line(aes(group = ID)) +
  ggtitle("Parallel Coordinate Plot by Platform")+
  theme_bw()
Figure 8. Parallel Coordinate Plot of Number of Sales by Platform

Figure 8. Parallel Coordinate Plot of Number of Sales by Platform

The highest number of sales for Europe and North America correspond to the Wii Platform. Since some of the Platforms were aggregated into “Other Platforms” for visualisation purposes, it is possible to identify a group of “Other Platforms” aggregated in the Japan sales.

Figure 6, 7 and 8 showed that there are some outliers in the data belonging to the Wii Platform and Sport Genre. However, there is no clear grouping of the different games in relation to the number of sales per region.

Principal Component Analysis

vgsales_pca <- recipe(NA_Sales ~ . , data = vgsales) %>%
  step_rm(Name) %>% # Remove name
  step_other(Genre, Platform, threshold = 50) %>% # Create "other" category for Genre and Platform
  step_dummy(Genre, Platform,  one_hot = TRUE) %>% # Convert to dummy variable
  step_normalize(all_predictors()) %>% # Normalise the predictors
  step_zv(all_predictors()) %>% # Remove predictors with variance = 0
  step_corr(all_predictors()) %>% # Remove correlated predictors
  step_pca(all_predictors()) %>% # Perform PCA
  prep()

Steps Explanation:

step_rm: removes the Name variable. To perform the PCA the Name variable is not relevant since it represents an unique identifier of the observation.

step_other: creates “Other” category for the variables Genre and Platform if the observations for the variable were less than 50.

step_dummy: converts Genre and Platform to a dummy variable including all categories.

step_normalize: normalise all the predictors (i.e. Year, EU_Sales, JP_Sales, Other_Sales, Genre and Platform). This means that it converts the values so that the mean is zero and the standard deviation 1.

step_zv: removes all the predictors that have a variance equal to zero. In this case, we are looking for the variables that increase the variance.

step_corr: removes all predictors that are correlated.

step_pca: performs the Principal Component Analysis.

tidy(vgsales_pca, 7) %>% 
  filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>% 
  group_by(component) %>% 
  top_n(5, abs(value)) %>% 
  ungroup() %>% 
  ggplot(aes(x = value, y = terms, fill = abs(value))) +
  geom_col(show.legend = FALSE) +
  facet_wrap( ~ component, scales = "free_y") +
  xlab("Loadings") + 
  ylab(NULL)+
  ggtitle("Principal Components Loadings")+
  scale_fill_distiller(palette = "Reds", direction = 1)+
  theme_bw()
Figure 9. Principal Component Analysis Loadings Plot

Figure 9. Principal Component Analysis Loadings Plot

#Calculate standard deviation
sdev <- vgsales_pca$steps[[7]]$res$sdev
#Calculate variance explained
ve <- sdev^2 / sum(sdev^2)
# Calculate the proportion of variance explained
PC.pve <- tibble(
  pc = fct_inorder(str_c("PC", 1:40)),
  pve = cumsum(ve) #Cumulative sum of the variance explained
)
PC.pve %>%
  filter(pve >= 0.9) %>% #Component that explain more than 90% of the variance
  kable(caption = "Table 13. Components that explain more than 90% of the variability") %>%
  kable_classic("hover", font_size = 18)
Table 13. Components that explain more than 90% of the variability
pc pve
PC32 0.9205947
PC33 0.9413979
PC34 0.9611211
PC35 0.9792675
PC36 0.9920159
PC37 0.9980949
PC38 1.0000000
PC39 1.0000000
PC40 1.0000000

Observations

  • The Year variable and the sales in different regions are the most important variables for the first component.

  • The Year variable, sales in Europe and sales in other regions are the most important variables for the second component.

  • Platform DS is the most important variable for the third component.

  • Role-Playing and Miscellaneous genre, and Wii and DS platforms are the most important variables for the fourth component.

  • Table 13 shows that to explain 90% of the variance it is necessary to consider 32 out of 40 components (reduction on dimensions will be 8). Since the dimensions were not significantly reduced and adding the components will reduce the interpretability of the models, the components would not be considered for the prediction modelling.

Relationship between predictors and outcome

Sales in North America by year

sales_year <- vgsales %>%
  select(NA_Sales, Year) %>%
  group_by(Year)%>%
  summarise(sales = sum(NA_Sales))

ggplot(data = sales_year, aes(x = as.factor(Year), y = sales, fill = sales)) +
  geom_col() +
  xlab("Year") + 
  ylab("Sales in North America") +
  ggtitle("Sales in North America by Year") +
  scale_fill_distiller("Sales (in million)",palette = "Purples", direction = 1)+
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
  )
Figure 10. Bar chart of the sales in North America by year

Figure 10. Bar chart of the sales in North America by year

#Create a median variable to colour the boxes
vgsales_mean <- vgsales %>%
  mutate(median = as.factor(round(ave(NA_Sales, as.factor(Year), FUN = median), 1)))

ggplot(data = vgsales_mean, aes(x = as.factor(Year), y = NA_Sales, fill = median)) +
  geom_boxplot() +
  xlab("Year") + 
  ylab("Sales in North America") +
  ylim(0,2)+ # Limit y scale to see the boxes for each year
  scale_fill_brewer("Median sales (in million)", palette = "Purples")+
  ggtitle("Sales in North America by Year") +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
  )
Figure 11. Boxplot of the sales in North America by year

Figure 11. Boxplot of the sales in North America by year

Observations

  • Figure 10 shows that the highest total number of sales were for 2008, and that the sales dropped the following years until now. However, Figure 11 shows that the median of the number of sales was higher for 1990 and it was slightly constant after the year 2000. The former could be explained by the release of gaming subscriptions such as Xbox Game Pass, Steam, PlayStation Plus, and others, which allow access to multiple video games without the need of buying the game. The latter could be explained by the release of the SNES (1991) and Genesis (1988) consoles, which created a continuous battle between Nintendo and Sega.

  • Figure 11 shows that there are a significant amount of outliers for each year, meaning that every year there are video games that have high number of sales in comparison to the other games released during the same year.

Sales in North America by sales in other regions

p1 <- ggplot(data = vgsales, aes(x = JP_Sales, y = NA_Sales)) +
  geom_point(size = 3, col = "#002F2F", alpha = .7) +
  geom_smooth(col = "#F7B15B", se = FALSE)+
  xlab("Sales in Japan") +
  ylab("Sales in North America") +
  ggtitle("Japan")+
  theme_bw() 

p2 <- ggplot(data = vgsales, aes(x = EU_Sales, y = NA_Sales)) +
  geom_point(size = 3, col = "#3F606F", alpha = .7) +
  geom_smooth(col = "#F7B15B", se = FALSE)+
  xlab("Sales in Europe") +
  ylab("Sales in North America") +
  ggtitle("Europe")+
  theme_bw()

p3 <- ggplot(data = vgsales, aes(x = Other_Sales, y = NA_Sales)) +
  geom_point(size = 3, col = "#92AA9D", alpha = .7) +
  geom_smooth(col = "#F7B15B", se = FALSE)+
  xlab("Sales in the rest of the world") +
  ylab("Sales in North America") +
  ggtitle("Other regions")+
  theme_bw() 

(p1 / p2 / p3) + 
  plot_annotation(tag_levels = "A",
                  title = "Sales in North America by sales in different regions",
                  theme = theme(plot.title = element_text(size = 18)))
Figure 12. Scatter plot of sales in North America by sales in different regions of the world

Figure 12. Scatter plot of sales in North America by sales in different regions of the world

Observations

  • Figure 12A shows that sales in North America have a quadratic relationship with sales in Japan. When sales in Japan increase, sales in North America stay constant until 2.5 million sales in Japan. After this point, sales in North America start increasing as well. Additionally, when reaching the 4 million sales in Japan, North America has an outlier that has over 40 million sales.

  • Figure 12B shows that sales in North America have a slight linear relationship with sales in Europe. Most of the sales for both regions are concentrated below 5 million. Similar to the previous plot, there are some outliers for sales in North America.

  • Figure 12C shows that sales in North America have a quadratic relationship with sales in other regions of the world. This plot is similar to the relationship with sales in Japan, sales in North America stay constant and after 1.5 million sales in other regions, sales in North America start increasing. Additionally, there are some outliers for sales in North America, the highest reaching over 40 million sales.

Sales in North America by genre

sales_gen <- vgsales %>%
  group_by(Genre) %>%
  summarise(sales = sum(NA_Sales)) %>%
  arrange(desc(sales)) %>%
  mutate(pal = case_when(sales >= 500 ~ "#2F5755",
                         sales >= 375 ~ "#F7B15B",
                         sales >= 250 ~ "#F65C51",
                         TRUE ~ "grey90"))

ggplot(data = sales_gen, aes(y = reorder(Genre, sales), x = sales, fill = pal)) +
  scale_fill_identity("Sales (in million)", labels = c(">500", ">375", ">250", "<250"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
  geom_col(alpha = .8) +
  xlab("Sales in North America") +
  ylab("Genre") +
  ggtitle("Sales in North America by Genre")+
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 13. Bar chart of sales in North America by Genre

Figure 13. Bar chart of sales in North America by Genre

#Create a median variable to colour the boxes
vgsales_median <- vgsales %>%
  mutate(median = round(ave(NA_Sales, Genre, FUN = median), 2))%>%
  mutate(pal = case_when(median >= 0.1 ~ "#2F5755",
                         median >= 0.08~ "#F7B15B",
                         median >= 0.06 ~ "#F65C51",
                         TRUE ~ "grey90"))


ggplot(data = vgsales_median, aes(y = reorder(Genre, median), x = NA_Sales,fill = pal)) +
  scale_fill_identity("Median sales (in million)", labels = c(">0.1", ">0.08", ">0.06", "<0.06"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
  geom_boxplot(alpha = .8) +
  xlab("Sales in North America") +
  ylab("Genre") +
  xlim(0,1)+ #Limit x axis to be able to see boxes
  ggtitle("Sales in North America by Genre")+
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 14. Boxplot of sales in North America by Genre

Figure 14. Boxplot of sales in North America by Genre

Observations

  • Figure 13 shows that the genres Action, Sports, and Shooter were the genres with higher amount of total sales in North America. However, when plotting the median value of sales per genre (Figure 14), the median sales for Platform and Racing are as high as the genres mentioned before. Platform genre is the one with the highest median sales.

  • Figure 14 shows that all genres have outliers, which means that for every genre there are games that have a higher amount of sales in North America than the games in the same genre.

Sales in North America by platform

sales_plat <- vgsales %>%
  group_by(Platform) %>%
  summarise(sales = sum(NA_Sales)) %>%
  arrange(desc(sales)) %>%
  mutate(pal = case_when(sales >= 400 ~ "#2F5755",
                         sales >= 200 ~ "#F7B15B",
                         sales >= 100 ~ "#F65C51",
                         TRUE ~ "grey90"))

ggplot(data = sales_plat%>%filter(sales>0), aes(y = reorder(Platform, sales), x = sales, fill = pal)) +
  scale_fill_identity("Sales (in million)", labels = c(">400", ">200", ">100", "<100"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
  geom_col(alpha = .8) +
  xlab("Sales in North America") +
  ylab("Platform") +
  ggtitle("Sales in North America by Platform")+
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 15. Bar chart of sales in North America by Platform

Figure 15. Bar chart of sales in North America by Platform

#Create a median variable to colour the boxes
vgsales_median <- vgsales %>%
  mutate(median = round(ave(NA_Sales, Platform, FUN = median), 2))%>%
  mutate(pal = case_when(median > 0.15 ~ "#2F5755",
                         median > 0.1~ "#F7B15B",
                         median > 0.05 ~ "#F65C51",
                         TRUE ~ "grey90"))


ggplot(data = vgsales_median, aes(y = reorder(Platform,median), x = NA_Sales,fill = pal)) +
  scale_fill_identity("Median sales (in million)", labels = c(">0.15", ">0.10", ">0.05", "<0.05"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
  geom_boxplot(alpha = .8) +
  xlab("Sales in North America") +
  ylab("Platform") +
  xlim(0,0.5)+ #Limit x axis to be able to see boxes
  ggtitle("Sales in North America by Platform")+
  theme_bw() +
  theme(
    panel.grid.major.y = element_blank()
  )
Figure 16. Boxplot of sales in North America by Platform

Figure 16. Boxplot of sales in North America by Platform

Observations

  • Figure 15 shows that the platforms X360, PS2, and Wii were the platforms with higher amount of total sales in North America. However, when plotting the median value of sales per platform (Figure 16), the median sales for 2600, NES, N64, and X360 have the highest value.

  • Figure 16 shows that all platforms have outliers, which means that for every platform there are games that have a higher amount of sales in North America than the games released in the same platform.

3. Preprocessing

Create initial split of the data:

set.seed(4321)
vgsales_split <- initial_split(vgsales)
vgsales_split
## <Analysis/Assess/Total>
## <12245/4082/16327>

The training set has 12245 observations and the testing set 4082.

Create training and testing set with the initial split data:

vgsales_train <- training(vgsales_split)
vgsales_test <- testing(vgsales_split)
dim(vgsales_train)
## [1] 12245     8
skim_without_charts(vgsales_train) %>%
  kable(caption = "Table 14. Skimmed training set") %>%
  kable_classic("hover", font_size = 18)
Table 14. Skimmed training set
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
character Name 0 1 2 132 0 9094 0 NA NA NA NA NA NA NA
character Platform 0 1 2 4 0 30 0 NA NA NA NA NA NA NA
character Genre 0 1 4 12 0 12 0 NA NA NA NA NA NA NA
numeric Year 0 1 NA NA NA NA NA 2006.4318497 5.8086130 1980 2003 2007.00 2010.00 2020.00
numeric NA_Sales 0 1 NA NA NA NA NA 0.2645659 0.8341938 0 0 0.08 0.24 41.49
numeric EU_Sales 0 1 NA NA NA NA NA 0.1490118 0.5353449 0 0 0.02 0.11 29.02
numeric JP_Sales 0 1 NA NA NA NA NA 0.0802058 0.3201934 0 0 0.00 0.04 10.22
numeric Other_Sales 0 1 NA NA NA NA NA 0.0473148 0.1661819 0 0 0.01 0.03 8.46

The dimensions of the training set and Table 14 show the right number of observations (i.e. 12245), and all variables have the correct data type.

dim(vgsales_test)
## [1] 4082    8
skim_without_charts(vgsales_test)%>%
  kable(caption = "Table 15. Skimmed testing set") %>%
  kable_classic("hover", font_size = 18)
Table 15. Skimmed testing set
skim_type skim_variable n_missing complete_rate character.min character.max character.empty character.n_unique character.whitespace numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
character Name 0 1 1 122 0 3614 0 NA NA NA NA NA NA NA
character Platform 0 1 2 4 0 28 0 NA NA NA NA NA NA NA
character Genre 0 1 4 12 0 12 0 NA NA NA NA NA NA NA
numeric Year 0 1 NA NA NA NA NA 2006.3302303 5.8897169 1980 2003 2007.00 2010.00 2016.00
numeric NA_Sales 0 1 NA NA NA NA NA 0.2679618 0.7826621 0 0 0.08 0.25 26.93
numeric EU_Sales 0 1 NA NA NA NA NA 0.1431823 0.4190689 0 0 0.02 0.11 9.26
numeric JP_Sales 0 1 NA NA NA NA NA 0.0740274 0.2840659 0 0 0.00 0.03 4.87
numeric Other_Sales 0 1 NA NA NA NA NA 0.0513572 0.2477403 0 0 0.01 0.04 10.57

The dimensions of the testing set and Table 15 show the right number of observations (i.e. 4082), and all variables have the correct data type.

Recipe

vgsales_recipe <- recipe(NA_Sales ~ ., data = vgsales_train) %>%
  step_rm(Name) %>%
  step_other(Genre, Platform,  threshold = 50) %>%
  step_dummy(Genre, Platform, one_hot = TRUE) %>%
  step_normalize(all_predictors()) %>%
  step_zv(all_predictors()) %>% #Remove predictors with variance = 0
  step_corr(all_predictors()) %>%
  prep()
vgsales_recipe
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7
## 
## Training data contained 12245 data points and no missing data.
## 
## Operations:
## 
## Variables removed Name [trained]
## Collapsing factor levels for Platform [trained]
## Dummy variables from Genre, Platform [trained]
## Centering and scaling for Year, EU_Sales, JP_Sales, Other_Sales, Genre_Ac... [trained]
## Zero variance filter removed <none> [trained]
## Correlation filter on <none> [trained]

Steps Explanation:

  • step_rm: removes the Name variable. The name variable is the unique identifier for the observations, which means that if we predict the sales in North America with this variables it is likely to create overfitting.

  • step_other: creates “Other” category for the variables Genre and Platform if the observations for the variable were less than 50. This is becessary to be able to reduce the number of dimensions for the final model.

  • step_dummy: converts Genre and Platform to a dummy variable including all levels (i.e. one_hote = TRUE). This step was performed so that all predictors are numerical.

  • step_normalize: normalise all the predictors (i.e. Year, EU_Sales, JP_Sales, Other_Sales, Genre and Platform). This means that it converts the values so that the mean is zero and the standard deviation 1. This is required since the variables have a different range, which will influence the prediction of the final model.

  • step_zv: removes all the predictors that have a variance equal to zero.

  • step_corr: removes all predictors that are correlated. The autocorrelation is an assumption for linear regressions. Thus, this step helps avoiding autocorrelation.

Preprocess training data using the bake function. This allows to apply all previous steps to the training data. When using new_data=NULL, the bake function applies these steps to the training data.

vgsales_train_preproc <- bake(vgsales_recipe, new_data = NULL)
vgsales_train_preproc %>%
  head()%>%
  kable(caption = "Table 16. First 6 rows of the preprocessed training data") %>%
  kable_classic("hover", font_size = 18)
Table 16. First 6 rows of the preprocessed training data
Year EU_Sales JP_Sales Other_Sales NA_Sales Genre_Action Genre_Adventure Genre_Fighting Genre_Misc Genre_Platform Genre_Puzzle Genre_Racing Genre_Role.Playing Genre_Shooter Genre_Simulation Genre_Sports Genre_Strategy Platform_X2600 Platform_X3DS Platform_DS Platform_GB Platform_GBA Platform_GC Platform_N64 Platform_NES Platform_PC Platform_PS Platform_PS2 Platform_PS3 Platform_PS4 Platform_PSP Platform_PSV Platform_SAT Platform_SNES Platform_Wii Platform_WiiU Platform_X360 Platform_XB Platform_XOne Platform_other
0.7864442 0.2446799 -0.2504918 -0.0440170 0.00 -0.4967876 -0.289008 -0.2311268 2.9267229 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 2.5590598 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
0.4421280 -0.2783474 -0.1567984 -0.2847171 0.06 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 3.1270195 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 2.5590598 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
1.4750768 -0.2409883 -0.2504918 -0.2245420 0.07 2.0127684 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 8.8026849 -0.0836036
0.7864442 -0.0915519 -0.2504918 0.0763331 0.54 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 4.2900119 -0.4077842 -0.2065347 -0.0831069 -0.1779791 2.5590598 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
-0.9351371 0.7116686 1.0299845 0.0763331 0.92 2.0127684 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 12.8241780 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
-0.9351371 9.2295414 5.5897292 6.6955857 6.85 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 3.5291680 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 2.5951421 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036

Preprocess testing data. This allows to apply all previous recipe steps to the testing data. When using new_data=vgsales_test, the bake function applies these steps to the testing data.

vgsales_test_preproc <- bake(vgsales_recipe, new_data = vgsales_test)
vgsales_test_preproc %>%
  head()%>%
  kable(caption = "Table 17. First 6 rows of the preprocessed testing data") %>%
  kable_classic("hover", font_size = 18)
Table 17. First 6 rows of the preprocessed testing data
Year EU_Sales JP_Sales Other_Sales NA_Sales Genre_Action Genre_Adventure Genre_Fighting Genre_Misc Genre_Platform Genre_Puzzle Genre_Racing Genre_Role.Playing Genre_Shooter Genre_Simulation Genre_Sports Genre_Strategy Platform_X2600 Platform_X3DS Platform_DS Platform_GB Platform_GBA Platform_GC Platform_N64 Platform_NES Platform_PC Platform_PS Platform_PS2 Platform_PS3 Platform_PS4 Platform_PSP Platform_PSV Platform_SAT Platform_SNES Platform_Wii Platform_WiiU Platform_X360 Platform_XB Platform_XOne Platform_other
-3.8618255 0.8984640 0.6239798 2.543509 26.93 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 -0.3197672 3.4016060 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 13.3812894 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
-0.2465046 13.8620688 12.6479642 11.268887 9.81 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 3.5291680 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 2.5590598 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
0.6142861 8.9493482 0.4990553 9.764512 14.97 -0.4967876 -0.289008 -0.2311268 2.9267229 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 3.4618167 -0.2281712 -0.1135924 -0.0836036
-0.4186627 0.4688345 1.0299845 63.320283 9.43 2.0127684 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 2.5951421 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
-0.2465046 17.0189120 12.7416576 12.051163 4.75 -0.4967876 -0.289008 -0.2311268 2.9267229 -0.2373338 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 2.5590598 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
-3.1731929 6.1474164 11.7422615 2.483334 9.54 -0.4967876 -0.289008 -0.2311268 -0.3416512 4.2131300 -0.1896129 -0.2833298 -0.3197672 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 13.3812894 -0.2499356 -0.2773997 -0.3853039 -0.2955911 -0.1440667 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036
skim_without_charts(vgsales_train_preproc)%>%
  kable(caption = "Table 17. Skimmed preprocessed training data") %>%
  kable_classic("hover", font_size = 18)
Table 17. Skimmed preprocessed training data
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p0 numeric.p25 numeric.p50 numeric.p75 numeric.p100
numeric Year 0 1 0.0000000 1.0000000 -4.5504580 -0.5908209 0.0978117 0.6142861 2.335867
numeric EU_Sales 0 1 0.0000000 1.0000000 -0.2783474 -0.2783474 -0.2409883 -0.0728723 53.929694
numeric JP_Sales 0 1 0.0000000 1.0000000 -0.2504918 -0.2504918 -0.2504918 -0.1255672 31.667721
numeric Other_Sales 0 1 0.0000000 1.0000000 -0.2847171 -0.2847171 -0.2245420 -0.1041920 50.623353
numeric NA_Sales 0 1 0.2645659 0.8341938 0.0000000 0.0000000 0.0800000 0.2400000 41.490000
numeric Genre_Action 0 1 0.0000000 1.0000000 -0.4967876 -0.4967876 -0.4967876 -0.4967876 2.012768
numeric Genre_Adventure 0 1 0.0000000 1.0000000 -0.2890080 -0.2890080 -0.2890080 -0.2890080 3.459830
numeric Genre_Fighting 0 1 0.0000000 1.0000000 -0.2311268 -0.2311268 -0.2311268 -0.2311268 4.326276
numeric Genre_Misc 0 1 0.0000000 1.0000000 -0.3416512 -0.3416512 -0.3416512 -0.3416512 2.926723
numeric Genre_Platform 0 1 0.0000000 1.0000000 -0.2373338 -0.2373338 -0.2373338 -0.2373338 4.213130
numeric Genre_Puzzle 0 1 0.0000000 1.0000000 -0.1896129 -0.1896129 -0.1896129 -0.1896129 5.273471
numeric Genre_Racing 0 1 0.0000000 1.0000000 -0.2833298 -0.2833298 -0.2833298 -0.2833298 3.529168
numeric Genre_Role.Playing 0 1 0.0000000 1.0000000 -0.3197672 -0.3197672 -0.3197672 -0.3197672 3.127019
numeric Genre_Shooter 0 1 0.0000000 1.0000000 -0.2939548 -0.2939548 -0.2939548 -0.2939548 3.401606
numeric Genre_Simulation 0 1 0.0000000 1.0000000 -0.2330806 -0.2330806 -0.2330806 -0.2330806 4.290012
numeric Genre_Sports 0 1 0.0000000 1.0000000 -0.4077842 -0.4077842 -0.4077842 -0.4077842 2.452077
numeric Genre_Strategy 0 1 0.0000000 1.0000000 -0.2065347 -0.2065347 -0.2065347 -0.2065347 4.841405
numeric Platform_X2600 0 1 0.0000000 1.0000000 -0.0831069 -0.0831069 -0.0831069 -0.0831069 12.031707
numeric Platform_X3DS 0 1 0.0000000 1.0000000 -0.1779791 -0.1779791 -0.1779791 -0.1779791 5.618177
numeric Platform_DS 0 1 0.0000000 1.0000000 -0.3907366 -0.3907366 -0.3907366 -0.3907366 2.559060
numeric Platform_GB 0 1 0.0000000 1.0000000 -0.0779713 -0.0779713 -0.0779713 -0.0779713 12.824178
numeric Platform_GBA 0 1 0.0000000 1.0000000 -0.2295542 -0.2295542 -0.2295542 -0.2295542 4.355913
numeric Platform_GC 0 1 0.0000000 1.0000000 -0.1818491 -0.1818491 -0.1818491 -0.1818491 5.498616
numeric Platform_N64 0 1 0.0000000 1.0000000 -0.1437712 -0.1437712 -0.1437712 -0.1437712 6.954930
numeric Platform_NES 0 1 0.0000000 1.0000000 -0.0747251 -0.0747251 -0.0747251 -0.0747251 13.381289
numeric Platform_PC 0 1 0.0000000 1.0000000 -0.2499356 -0.2499356 -0.2499356 -0.2499356 4.000705
numeric Platform_PS 0 1 0.0000000 1.0000000 -0.2773997 -0.2773997 -0.2773997 -0.2773997 3.604612
numeric Platform_PS2 0 1 0.0000000 1.0000000 -0.3853039 -0.3853039 -0.3853039 -0.3853039 2.595142
numeric Platform_PS3 0 1 0.0000000 1.0000000 -0.2955911 -0.2955911 -0.2955911 -0.2955911 3.382776
numeric Platform_PS4 0 1 0.0000000 1.0000000 -0.1440667 -0.1440667 -0.1440667 -0.1440667 6.940661
numeric Platform_PSP 0 1 0.0000000 1.0000000 -0.2797824 -0.2797824 -0.2797824 -0.2797824 3.573914
numeric Platform_PSV 0 1 0.0000000 1.0000000 -0.1614245 -0.1614245 -0.1614245 -0.1614245 6.194341
numeric Platform_SAT 0 1 0.0000000 1.0000000 -0.1039858 -0.1039858 -0.1039858 -0.1039858 9.615910
numeric Platform_SNES 0 1 0.0000000 1.0000000 -0.1231681 -0.1231681 -0.1231681 -0.1231681 8.118324
numeric Platform_Wii 0 1 0.0000000 1.0000000 -0.2898365 -0.2898365 -0.2898365 -0.2898365 3.449940
numeric Platform_WiiU 0 1 0.0000000 1.0000000 -0.0925490 -0.0925490 -0.0925490 -0.0925490 10.804205
numeric Platform_X360 0 1 0.0000000 1.0000000 -0.2888421 -0.2888421 -0.2888421 -0.2888421 3.461817
numeric Platform_XB 0 1 0.0000000 1.0000000 -0.2281712 -0.2281712 -0.2281712 -0.2281712 4.382317
numeric Platform_XOne 0 1 0.0000000 1.0000000 -0.1135924 -0.1135924 -0.1135924 -0.1135924 8.802685
numeric Platform_other 0 1 0.0000000 1.0000000 -0.0836036 -0.0836036 -0.0836036 -0.0836036 11.960232

Observations - Name was removed from the data

  • Genre and Platform were converted to numerical variables and an “Other” category was added for Platform. The “Other” category was not added to Genre which means that all categories had more than 50 observations.

  • All predictors were normalised, having a mean of 0 and variance of 1.

  • No variable had a variance equal to zero, and the step did not find autocorrelation between variables.

4. Model fitting

Lasso regression

Lasso regression specification. This function creates a linear regression with mixture=1 (i.e. Lasso regression). In this case, the penalty variable was tuned in order to obtain the optimal value that enhances the metrics for this hyperparameter. The penalty value for a Lasso regression is of importance since it can reduce some of the coefficients to zero.

lasso_spec <- linear_reg(
  mode = "regression",
  penalty = tune(),
  mixture = 1
) %>%
  set_engine("glmnet")

Create sample data using bootstrap, which allows to reduce the RMSE of the models.

#Bootstrapped data
set.seed(4321)

vgsales_boots <- bootstraps(vgsales_train_preproc, times = 10)

Create a grid of penalty values. In this case, 50 levels were used.

penalty_grid <- grid_regular(penalty( range = c(-3, 0), trans = log10_trans() ), 
                             levels = 50)
penalty_grid%>%
  kable(caption = "Table 18. 50 penalty values used to tune the Lasso regression") %>%
  kable_classic("hover", font_size = 18)
Table 18. 50 penalty values used to tune the Lasso regression
penalty
0.0010000
0.0011514
0.0013257
0.0015264
0.0017575
0.0020236
0.0023300
0.0026827
0.0030888
0.0035565
0.0040949
0.0047149
0.0054287
0.0062506
0.0071969
0.0082864
0.0095410
0.0109854
0.0126486
0.0145635
0.0167683
0.0193070
0.0222300
0.0255955
0.0294705
0.0339322
0.0390694
0.0449843
0.0517947
0.0596362
0.0686649
0.0790604
0.0910298
0.1048113
0.1206793
0.1389495
0.1599859
0.1842070
0.2120951
0.2442053
0.2811769
0.3237458
0.3727594
0.4291934
0.4941713
0.5689866
0.6551286
0.7543120
0.8685114
1.0000000

Tune the penalty variable of the lasso regression using the previous grid:

set.seed(4321)
lasso_grid <- tune_grid(lasso_spec, NA_Sales ~., vgsales_boots, grid = penalty_grid)

Visualise the values of RMSE and \(R^2\) for different penalty values:

lasso_grid %>% 
  collect_metrics() %>% 
  ggplot( aes( penalty, mean, color = .metric ) ) +
  geom_line() +
  facet_wrap( ~.metric, scales = "free", nrow = 2) +
  scale_x_log10()+ 
  theme_bw()+
  ggtitle("Metrics for different values of penalty")+
  labs(color = "Metric")
Figure 17. Line plot of the metrics for different penalty values

Figure 17. Line plot of the metrics for different penalty values

Identify the penalty value that reduces the RMSE:

best_lasso_rmse <- select_best( lasso_grid, "rmse" )
best_lasso_rmse%>%
  kable(caption = "Table 19. Optimal penalty value for Lasso regression") %>%
  kable_classic("hover", font_size = 18)
Table 19. Optimal penalty value for Lasso regression
penalty .config
0.0020236 Preprocessor1_Model06

Table 19 shows that the optimal penalty value that reduces the RMSE is 0.002.

Define the final Lasso model using the penalty value from Table 19:

final_lasso <- finalize_model( lasso_spec, best_lasso_rmse )
final_lasso
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.00202358964772516
##   mixture = 1
## 
## Computational engine: glmnet

The Lasso regression was correctly defined with the penalty value that reduces the RMSE.

Random Forest

Random Forest specification. In this case, the mtry and min_n variables were tuned in order to obtain the optimal value that enhances the metrics for these hyperparameters. The mtry value defines the number of predictors considered on each resampling and the min_n defines the minimum node size. For computing purposes, in this case, the number of trees evaluated will be set to 100 (trees = 100) and will not be tuned.

rf_spec <- rand_forest( 
  mode = "regression",
  mtry = tune(),
  trees = 100,
  min_n = tune() 
) %>% 
  set_engine( "ranger", importance = "permutation")

Create a grid with different values for mtry and min_n. In this case, the levels are 5, which means that the grid will have 25 values.

rand_spec_grid <- grid_regular( 
  finalize(mtry(), 
            vgsales_train_preproc %>% 
              select(-NA_Sales)),
  min_n(),
  levels = 5 )
rand_spec_grid%>%
  kable(caption = "Table 20. Grid of differents mtry and min_n values") %>%
  kable_classic("hover", font_size = 18)
Table 20. Grid of differents mtry and min_n values
mtry min_n
1 2
10 2
20 2
29 2
39 2
1 11
10 11
20 11
29 11
39 11
1 21
10 21
20 21
29 21
39 21
1 30
10 30
20 30
29 30
39 30
1 40
10 40
20 40
29 40
39 40

Tune the Random Forest with the 25 different values using the previous resampled values obtained with bootstrapping (i.e. vgsales_boots):

doParallel::registerDoParallel() 
set.seed(4321)
rf_grid <- tune_grid( rf_spec, NA_Sales ~ ., 
                      vgsales_boots, 
                      grid = rand_spec_grid )

Visualise the RMSE and \(R^2\) for the different values of mtry and min_n:

rf_grid %>% 
 collect_metrics() %>% 
  mutate( min_n = as.factor( min_n ) ) %>% 
  ggplot( aes( x = mtry, y = mean, colour = min_n ) ) +
  geom_point( size = 2 ) +
  geom_line( alpha = 0.75 ) +
  facet_wrap( ~ .metric, scales = "free", nrow = 3 )+
  theme_bw()+
  ggtitle("Mean RMSE and R-Squared for mtry and min_n")+
  labs(colour = "min_n")
Figure 18. Line plot of the metrics for different mtry and min_n values

Figure 18. Line plot of the metrics for different mtry and min_n values

Calculate the values of mtry and min_n that reduce the RMSE:

best_rf_rmse <- select_best( rf_grid, "rmse" )
best_rf_rmse%>%
  kable(caption = "Table 21. Optimal mtry and min_n values for Random Forest") %>%
  kable_classic("hover", font_size = 18)
Table 21. Optimal mtry and min_n values for Random Forest
mtry min_n .config
20 2 Preprocessor1_Model03

Figure 18 and Table 21 show that the optimal mtry and min_n values that reduce the RMSE are 20 and 2, respectively.

Define the final Random Forest model using the penalty value from Table 21:

final_rf <- finalize_model( rf_spec, best_rf_rmse )
final_rf
## Random Forest Model Specification (regression)
## 
## Main Arguments:
##   mtry = 20
##   trees = 100
##   min_n = 2
## 
## Engine-Specific Arguments:
##   importance = permutation
## 
## Computational engine: ranger

The Random Forest was correctly defined with the mtry and min_n values that reduce the RMSE.

Cross Validation

Perform cross-validation to avoid overfitting. Cross-validation is used so that the performance of the model is evaluated with various sampled data.

set.seed(4321)
vgsales_cv <- vfold_cv(vgsales_train_preproc, v = 10)

Evaluate metrics of the Lasso regression using the cross-validated resamples:

lasso_cv <- fit_resamples(final_lasso, NA_Sales~., vgsales_cv)
lasso_cv %>%
  collect_metrics()%>%
  kable(caption = "Table 22. RMSE and R-squared for Lasso regression") %>%
  kable_classic("hover", font_size = 18)
Table 22. RMSE and R-squared for Lasso regression
.metric .estimator mean n std_err .config
rmse standard 0.4460320 10 0.0571312 Preprocessor1_Model1
rsq standard 0.6672145 10 0.0452453 Preprocessor1_Model1

Evaluate metrics of the Random Forest using the cross-validated resamples:

set.seed(4321)
rf_cv <- fit_resamples(final_rf, NA_Sales ~ . , vgsales_cv)
rf_cv %>%
  collect_metrics()%>%
  kable(caption = "Table 23. RMSE and R-squared for Random Forest") %>%
  kable_classic("hover", font_size = 18)
Table 23. RMSE and R-squared for Random Forest
.metric .estimator mean n std_err .config
rmse standard 0.3760635 10 0.0741628 Preprocessor1_Model1
rsq standard 0.8059027 10 0.0294340 Preprocessor1_Model1

Observations

  • Comparing Table 22 with Table 23, the model with lower RMSE and higher R-squared is Random Forest. The RMSE allows to evaluate the distance between the predicted values and the observed values. For this reason, the model with a lower RMSE has predicted values closed to the observed values. In contrast, the \(R^2\) measure the variance in the response variable explained by the predictors, which means that the higher the \(R^2\) the better the model. Hence, Random Forest would be the model that is performing best when evaluating both metrics.

Since the Random Forest was the best model, this will be the final model used to predict the sales in North America. Fit the Random Forest using the following code:

set.seed(4321)
vgsales_rf <- final_rf %>%
  fit(NA_Sales ~ ., data = vgsales_train_preproc)
vgsales_rf
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~20L,      x), num.trees = ~100, min.node.size = min_rows(~2L, x), importance = ~"permutation",      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      12245 
## Number of independent variables:  39 
## Mtry:                             20 
## Target node size:                 2 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.1761053 
## R squared (OOB):                  0.7469313

5. Model Evaluation

Variable Importance Plot

The following plot allows to identify the most important predictors from the model:

vgsales_rf %>%
  vip(all_permutations = TRUE, mapping = aes_string(fill = "Importance")) +
  scale_fill_distiller(palette = "Reds", direction = 1)+
  ggtitle("Variable Importance Scores for predictors") + 
  theme_bw()
Figure 19. Bar chart of the most important predictors from the Random Forest model

Figure 19. Bar chart of the most important predictors from the Random Forest model

Figure 19 shows that the most important predictors for the Random Forest model were sales in Europe and sales in other regions of the world. Figure 12B shows that there is a slightly linear relationship between sales in North America and sales in Europe which could explain the high importance of this variable in the model.

Predictions of testing set

Create model predictions with the testing set (i.e. vgsales_test_preproc):

vgsales_preds <- predict(vgsales_rf,
                         new_data = vgsales_test_preproc) %>%
  bind_cols(vgsales_test_preproc %>% select(NA_Sales))
vgsales_preds%>%
  head()%>%
  kable(caption = "Table 24. First 6 rows of the predicted and observed sales in North America for the testing set") %>%
  kable_classic("hover", font_size = 18)
Table 24. First 6 rows of the predicted and observed sales in North America for the testing set
.pred NA_Sales
2.964267 26.93
8.900740 9.81
7.875325 14.97
6.081783 9.43
9.901490 4.75
13.629000 9.54

Visualise observerd versus predicted values:

vgsales_preds %>% 
  ggplot(aes( x = .pred, y = NA_Sales )) +
  geom_point(col = "#2F5755", alpha = 0.6) +
  xlab("Prediction") +
  ylab("Sales in North America") +
  ggtitle("Observed versus predicted values of sales in North America")+
  geom_abline(intercept = 0, slope = 1, colour = "#F7B15B") +
  theme_bw()
Figure 20. Scatter plot of the observed versus predicted values of sales in North America

Figure 20. Scatter plot of the observed versus predicted values of sales in North America

Determine the metrics of the model with the testing set:

vgsales_preds %>% 
  metrics(truth = NA_Sales, estimate = .pred)%>%
  kable(caption = "Table 25. Metrics of the Random Forest on the testing set") %>%
  kable_classic("hover", font_size = 18)
Table 25. Metrics of the Random Forest on the testing set
.metric .estimator .estimate
rmse standard 0.4679026
rsq standard 0.6433416
mae standard 0.0802173

Observations

  • Figure 20 shows that there is a moderate prediction of the observed values of sales in North America when performing the Random Forest on the testing set.

  • Table 25 is consistent with Figure 20, the metrics show a moderate prediction of the testing observations (RMSE = 0.468, R-squared = 0.643). The metrics are worse than the metrics obtained on the training set (RMSE = 0.376, R-squared = 0.806), which could be explained because of the highly variability in the data, identified by the high number of outliers. This means that the training set might not be an accurate sample of the whole data.

Predictions of The Fatal Empire

Create a tibble to predict the sales in North America for The Fatal Empire game:

# Create tibble with data from The Fatal Empire game
new_game <- tibble(
  Name = "The Fatal Empire", #This will be removed when preprocessing
  Year = 2022,
  Genre = "Role-Playing",
  Platform = "PS4",
  JP_Sales = 2.58,
  EU_Sales = 0.53,
  Other_Sales = 0.1
)

new_game%>%
  kable(caption = "Table 26. The Fatal Empire data for predicting sales in North America") %>%
  kable_classic("hover", font_size = 18)
Table 26. The Fatal Empire data for predicting sales in North America
Name Year Genre Platform JP_Sales EU_Sales Other_Sales
The Fatal Empire 2022 Role-Playing PS4 2.58 0.53 0.1

Preprocess the data utilising the recipe vgsales_recipe and the bake function. This function will apply all steps defined on the Preprocessing section to the data from The Fatal Empire.

new_game_preproc <- bake(vgsales_recipe, new_data = new_game)
new_game_preproc%>%
  kable(caption = "Table 27. The Fatal Empire preprocessed data for predicting sales in North America") %>%
  kable_classic("hover", font_size = 18)
Table 27. The Fatal Empire preprocessed data for predicting sales in North America
Year EU_Sales JP_Sales Other_Sales Genre_Action Genre_Adventure Genre_Fighting Genre_Misc Genre_Platform Genre_Puzzle Genre_Racing Genre_Role.Playing Genre_Shooter Genre_Simulation Genre_Sports Genre_Strategy Platform_X2600 Platform_X3DS Platform_DS Platform_GB Platform_GBA Platform_GC Platform_N64 Platform_NES Platform_PC Platform_PS Platform_PS2 Platform_PS3 Platform_PS4 Platform_PSP Platform_PSV Platform_SAT Platform_SNES Platform_Wii Platform_WiiU Platform_X360 Platform_XB Platform_XOne Platform_other
2.680184 0.7116686 7.807139 0.3170332 -0.4967876 -0.289008 -0.2311268 -0.3416512 -0.2373338 -0.1896129 -0.2833298 3.127019 -0.2939548 -0.2330806 -0.4077842 -0.2065347 -0.0831069 -0.1779791 -0.3907366 -0.0779713 -0.2295542 -0.1818491 -0.1437712 -0.0747251 -0.2499356 -0.2773997 -0.3853039 -0.2955911 6.940661 -0.2797824 -0.1614245 -0.1039858 -0.1231681 -0.2898365 -0.092549 -0.2888421 -0.2281712 -0.1135924 -0.0836036

Create predictions of sales in North America using the final model (i.e. Random forest model) and the preprocessed data from The Fatal Empire.

new_game_preds <- predict(
  vgsales_rf,
  new_data = new_game_preproc
)

new_game_preds%>%
  kable(caption = "Table 28. The Fatal Empire predicted sales in North America") %>%
  kable_classic("hover", font_size = 18)
Table 28. The Fatal Empire predicted sales in North America
.pred
0.7790733

Observations

  • The final model predicted that the sales in North America for The Fatal Empire will be 0.779 million (Table 28).

  • It is important to highlight that the model has a moderate prediction accuracy (Figure 20 and Table 25), which means that the predicted value should just be used as a reference since it can be slightly different to the actual value.

  • To improve the accuracy of the model more variables could be considered such as publisher, targeted age group, duration of the game, and console exclusivity. Additionally, since the year variable was the fifth variable of importance (Figure 19), it would be valuable to input the missing years for the 271 observations instead of deleting them from the data. For this, the data for each game could be acquired from other data sets and merged into the current data. Furthermore, the data had a high number of outliers that could be further studied in order to increase the accuracy of the model.