This study determined that the most accurate prediction of the number of sales in North America for “The Fatal Empire” is 0.779 million copies. The most accurate prediction was obtained by evaluating on testing data the best performing model, which then was utilised for determining the number of sales in North America for The Fatal Empire. The variable that had the highest importance in the final model was the number of sales in Europe, followed by the sales in other regions of the world.
The aim of this report is to predict the number of sales in North America for the recently released PlayStation 4 game “The Fatal Empire”. The game has sold 2.58 million copies in Japan, 0.53 million copies in Europe, and 0.1 million copies in other world regions.
In this report, the prediction of sales in North America was performed using two different predictive models (i.e. Lasso regression and Random Forest). The final model was selected by evaluating the highest performing model.
In this section, the data types and missing values of the different variables were identified. The variables in the data are described in Table 1 with the respective data types. The variable Year
was transformed from categorical to a numerical variable, and all the observations that had missing values for Year
were deleted. Furthermore, the Publisher
of the game “The Fatal Empire” is not available for the final prediction. Thus, this variable was deleted from the data.
A Parallel Coordinate Plot and a Principal Component Analysis were performed to explore the data. The former did not show groupings of games (i.e. clusters) for the sales in different regions (Figure 6). The latter identified important variables in the different components (i.e. Sales, Year, Genre, Platform) (Figure 9). However, the reduction of dimensions was not significant.
The total number of sales in North America were higher for the year 2008, Action genre and Xbox360 platform. The relationship between the outcome variable and the sales in different regions was varied, but only the sales in Europe showed an slightly linear relationship (Figure 12). It is important to highlight that all relationships showed outliers for sales in North America, which could have an effect on the final predictive model.
The data was split into training and testing sets. The training set included 12245 observations, and the testing included 4082. The preprocessing steps applied were:
Name
due to its overfitting effect on the outcome variable.Genre
and Platform
to avoid a large number of dimensions when predicting the model.Random forest showed better accuracy metrics for the cross-validation analysis than the Lasso regression. Random forest had an r-squared of 0.806 and an root mean squared error of 0.376 (Table 23). In contrast, the Lasso regression had an r-squared of 0.667 and an root mean squared error of 0.446 (Table 22). Thus, the Random Forest model was chosen to perform the predictions of sales in North America for “The Fatal Empire”.
The most important variable for predicting the sales in North America with the Random forest model was the number of sales in Europe, which is consistent with the slightly linear relationship (Figure 12B).
Random forest showed a lower accuracy on the testing set than on the training set with an r-squared of 0.643 and an root mean squared error of 0.468 (Table 25). This could be explained by the high variability in the number of sales for each game in North America, consistent with the number of outliers identified. The predictors of the model explained 64.3% of the variance in sales in North America, which is a moderate prediction accuracy.
The Random Forest model predicted 0.779 million sales in North America of The Fatal Empire. It is important to consider that the model had a moderate accuracy, which could be improved by adding other variables, adding the missing values for year in the data, and/or further researching the reason behing the high number of outliers.
The following table shows the variables presented in the data with the respective data types.
Table 1. Description and type of the variables in the vgsales
dataset.
Variable | Description | Type |
---|---|---|
Name | The games name | Categorical |
Platform | Platform of the games release (i.e. PC, PS4, XBOX, etc) | Categorical |
Year | Year of the game’s release | Discrete |
Genre | Genre of the game | Categorical |
Publisher | Publisher of the game | Categorical |
NA_Sales | Sales in North America (in millions) | Continuous |
EU_Sales | Sales in Europe (in millions) | Continuous |
JP_Sales | Sales in Japan (in millions) | Continuous |
Other_Sales | Sales in the rest of the world (in millions) | Continuous |
# Read the data and transform it to a tibble
<- read.csv("vgsales.csv", header = TRUE)
vgsales <- tibble(vgsales)
vgsales %>%
vgsales head() %>%
kable(caption = "Table 2. First 6 rows of the vgsales dataset") %>%
kable_classic("hover", font_size = 18)
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales |
---|---|---|---|---|---|---|---|---|
Wii Sports | Wii | 2006 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 |
Super Mario Bros. | NES | 1985 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 |
Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 |
Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 |
Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 |
Tetris | GB | 1989 | Puzzle | Nintendo | 23.20 | 2.26 | 4.22 | 0.58 |
When comparing Table 2 with the description of the data in Kaggle, it is possible to see that the games’ ranking and world sales are missing.
dim(vgsales) #dimension
## [1] 16598 9
skim_without_charts(vgsales)%>%
kable(caption = "Table 3. Skimmed vgsales data") %>%
kable_classic("hover", font_size = 18)
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | Name | 0 | 1 | 1 | 132 | 0 | 11493 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Platform | 0 | 1 | 2 | 4 | 0 | 31 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Year | 0 | 1 | 3 | 4 | 0 | 40 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Genre | 0 | 1 | 4 | 12 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Publisher | 0 | 1 | 3 | 38 | 0 | 579 | 0 | NA | NA | NA | NA | NA | NA | NA |
numeric | NA_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.2646674 | 0.8166830 | 0 | 0 | 0.08 | 0.24 | 41.49 |
numeric | EU_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.1466520 | 0.5053512 | 0 | 0 | 0.02 | 0.11 | 29.02 |
numeric | JP_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0777817 | 0.3092906 | 0 | 0 | 0.00 | 0.04 | 10.22 |
numeric | Other_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0480630 | 0.1885884 | 0 | 0 | 0.01 | 0.04 | 10.57 |
Observations
There are 16598 observations and 9 variables. The outcome variable is NA_Sales
whilst all other variables are predictors.
There are 5 categorical variables (i.e. Name, Platform, Year, Genre, Publisher). From these variables, Year
should be a numerical variable.
There are 4 numerical variables (i.e. NA_Sales, EU_Sales, JP_Sales, Other_Sales). All these variables have the correct data type
None of the variables has missing data.
There are differences in the mean an standard deviation of numerical variables. For this reason, for the final model, the numerical variables should be normalised.
Check the distribution of the Year
variable.
ggplot(data = vgsales, aes(as.factor(Year))) +
geom_histogram(stat = "count", fill = "#2F5755") +
scale_fill_distiller() +
xlab("Year") +
ylab("Count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
Figure 1. Histogram of the count of observations by year
Observations
Missing observations for years 2018 and 2019. These observation could be in the “N/A” values of the Years
variable.
There are missing values that weren’t captured by the skim
function since the string “N/A” is not the same as NA value for R.
The distribution is unimodal and left skewed with more observations for recent years.
Most of the observations were for year 2009.
Since there are missing values in the year, the observations were filtered by the missing values to further investigate the reason behind.
#Obtain the dimension of the observations with missing Year
dim(vgsales %>%
filter(Year == "N/A"))
## [1] 271 9
#Obtain a table of the observations with missing Year
%>%
vgsales filter(Year == "N/A") %>%
head() %>%
kable(caption = "Table 4. First 6 rows of observations with missing year") %>%
kable_classic("hover", font_size = 18)
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales |
---|---|---|---|---|---|---|---|---|
Madden NFL 2004 | PS2 | N/A | Sports | Electronic Arts | 4.26 | 0.26 | 0.01 | 0.71 |
FIFA Soccer 2004 | PS2 | N/A | Sports | Electronic Arts | 0.59 | 2.36 | 0.04 | 0.51 |
LEGO Batman: The Videogame | Wii | N/A | Action | Warner Bros. Interactive Entertainment | 1.86 | 1.02 | 0.00 | 0.29 |
wwe Smackdown vs. Raw 2006 | PS2 | N/A | Fighting | N/A | 1.57 | 1.02 | 0.00 | 0.41 |
Space Invaders | 2600 | N/A | Shooter | Atari | 2.36 | 0.14 | 0.00 | 0.03 |
Rock Band | X360 | N/A | Misc | Electronic Arts | 1.93 | 0.34 | 0.00 | 0.21 |
Observations
There are 271 observations with missing values in the Year
variable.
Table 4 shows that some games contain the year in the Name
variable (e.g. Madden NFL 2004)
Check the distribution of sales in different regions.
<- ggplot(data = vgsales, aes(x = NA_Sales)) +
p1 geom_histogram(fill = "#2F5755") +
xlab("Sales in North America") +
ylab("Count") +
theme_bw()
<- ggplot(data = vgsales, aes(JP_Sales)) +
p2 geom_histogram(fill = "#2F5755") +
xlab("Sales in Japan") +
ylab("Count") +
theme_bw()
<- ggplot(data = vgsales, aes(EU_Sales)) +
p3 geom_histogram(fill = "#2F5755") +
xlab("Sales in Europe") +
ylab("Count") +
theme_bw()
<- ggplot(data = vgsales, aes(Other_Sales)) +
p4 geom_histogram(fill = "#2F5755") +
xlab("Sales in the rest of the world") +
ylab("Count") +
theme_bw()
# Use patchwork
/p2/p3/p4) +
(p1plot_annotation(tag_levels = "A",
title = "Histogram of sales in different regions",
theme = theme(plot.title = element_text(size = 18)))
Figure 2. Histogram of the count of observations by sales for each region
Observations
Most of the sales are zero for all regions (i.e. North America, Japan, Europe and the rest of the world).
The distribution of sales is unimodal and right skewed for all regions. This means that most of the games in the dataset had low number of sales.
Analyse the number of games released by Platform
.
<- vgsales %>%
count_released group_by(Platform) %>%
summarise(released_games = n()) %>%
arrange(desc(released_games)) %>%
mutate(pal = case_when(released_games >= 1000 ~ "#2F5755",
>= 500 ~ "#F7B15B",
released_games >= 250 ~ "#F65C51",
released_games TRUE ~ "grey90"))
ggplot(data = count_released, aes(y = reorder(Platform, released_games), x = released_games, fill = pal)) +
geom_col(alpha = .8) +
xlab("Count")+
ylab("Platform") +
ggtitle("Platform by number of released games") +
scale_fill_identity("Released games", labels = c(">1000", ">500",">250" ,"<250"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51","grey90")) +
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 3. Bar chart of the number of released games per platform
Create a table with the top 5 Platform
:
%>%
count_released select(Platform, released_games)%>%
head(5)%>%
kable(caption = "Table 5. Top 5 platforms with the higher number of released games") %>%
kable_classic("hover", font_size = 18)
Platform | released_games |
---|---|
DS | 2163 |
PS2 | 2161 |
PS3 | 1329 |
Wii | 1325 |
X360 | 1265 |
Create a table with the last 5 Platform
:
%>%
count_released select(Platform, released_games)%>%
tail(5)%>%
kable(caption = "Table 6. Last 5 platforms with the lower number of released games") %>%
kable_classic("hover", font_size = 18)
Platform | released_games |
---|---|
WS | 6 |
3DO | 3 |
TG16 | 2 |
GG | 1 |
PCFX | 1 |
Observations
Figure 3 and Table 5 show that the number of games released is the highest for Nintendo DS (2163 games released), followed by PlayStation 2 (2161 games released).
Figure 3 and Table 6 show that the number of games released is the lowest for Game Gear (GG) and PC-FX (PCFX), both having 1 game released (Sonic the Hedgehog 2 and Blue Breaker: Ken Yorimo Hohoemi, respectively).
Identify number of released games per Publisher
<- vgsales %>%
count_publisher group_by(Publisher) %>%
summarise(released_games = n()) %>%
arrange(desc(released_games)) %>%
mutate(pal = case_when(released_games >= 1000 ~ "#2F5755",
>= 750 ~ "#F7B15B",
released_games >= 500 ~ "#F65C51",
released_games TRUE ~ "grey90"))
ggplot(data = count_publisher%>%head(10), aes(y = reorder(Publisher, released_games), x = released_games, fill = pal)) +
geom_col(alpha = .8) +
xlab("Count")+
ylab("Publisher") +
ggtitle("Top 10 Publishers by number of released games") +
scale_fill_identity("Released games", labels = c(">1000", ">750",">500", "<500" ), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51", "grey90")) +
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 4. Bar chart of the number of released games per publisher
%>%
count_publisherselect(Publisher, released_games) %>%
head(10) %>%
kable(caption = "Table 7. Summary table of the number of released games for the top 10 publishers") %>%
kable_classic("hover", font_size = 18)
Publisher | released_games |
---|---|
Electronic Arts | 1351 |
Activision | 975 |
Namco Bandai Games | 932 |
Ubisoft | 921 |
Konami Digital Entertainment | 832 |
THQ | 715 |
Nintendo | 703 |
Sony Computer Entertainment | 683 |
Sega | 639 |
Take-Two Interactive | 413 |
Observations
Figure 4 and Table 7 show that the Publisher that had the highest number of released games was Electronic Arts (1351 games released), followed by Activision (975 games released). Electronic Arts has released iconic video games such as FIFA, The Sims, Battlefield, and others. Activision is known for famous video games such as Call of Duty, Doom, Crash Bandicoot, and others.
Some publishers that are also creators of the consoles like Nintendo, Sony Computer Entertainment and Sega appear later in the lists.
Check the number of released games per Genre
<- vgsales %>%
count_genre group_by(Genre) %>%
summarise(released_games = n()) %>%
arrange(desc(released_games)) %>%
mutate(pal = case_when(released_games >= 1500 ~ "#2F5755",
>= 1000 ~ "#F7B15B",
released_games >= 750 ~ "#F65C51",
released_games TRUE ~ "grey90"))
ggplot(data = count_genre, aes(y = reorder(Genre, released_games), x = released_games, fill = pal)) +
geom_col(alpha = .8) +
xlab("Count")+
ylab("Genre") +
ggtitle("Number of released games by genre") +
scale_fill_identity("Released games", labels = c(">1500", ">1000",">750", "<750"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51", "grey90")) +
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 5. Bar chart of the number of released games per genre
%>%
count_genre select(Genre, released_games) %>%
kable(caption = "Table 8. Summary table of the number of released games per genre") %>%
kable_classic("hover", font_size = 18)
Genre | released_games |
---|---|
Action | 3316 |
Sports | 2346 |
Misc | 1739 |
Role-Playing | 1488 |
Shooter | 1310 |
Adventure | 1286 |
Racing | 1249 |
Platform | 886 |
Simulation | 867 |
Fighting | 848 |
Strategy | 681 |
Puzzle | 582 |
<- vgsales %>%
genre_unique distinct(Name, Genre)
%>%
genre_unique group_by(Genre) %>%
summarise(released_games = n()) %>%
arrange(desc(released_games)) %>%
kable(caption = "Table 9. Summary table of the number of unique released games per genre") %>%
kable_classic("hover", font_size = 18)
Genre | released_games |
---|---|
Action | 1925 |
Sports | 1379 |
Misc | 1327 |
Role-Playing | 1219 |
Adventure | 1048 |
Shooter | 817 |
Racing | 772 |
Simulation | 726 |
Fighting | 612 |
Platform | 589 |
Strategy | 584 |
Puzzle | 499 |
Observations
Figure 5 and Table 8 show that the genre Action has the highest number of released games. It is important to highlight that the same games can be released on different Platform
. For this reason, the number of released games per genre is double-counting observations.
Table 9 shows the actual number of unique released games per genre. In this table, the genre Adventure has more released games than the Shooter. Additionally, the genre Platform dropped 2 rows on the list, having less number of released games than Simulation and Fighting.
In the previous section, it was identified that the Year
column has the wrong data type and has missing values. The following code aims to change the data type of Year
to numeric and delete the missing values. Additionally, since information about the Publisher
of The Fatal Empire is not available, it would not be possible to use this variable for predicting the sales in North America. Thus, the Publisher
variable was removed.
<- vgsales %>%
vgsales mutate(Year = as.numeric(Year)) %>% # Transform year to numeric
select(-Publisher) %>% # Publisher is not considered
drop_na() # Remove NAs from the data
%>%
vgsales head() %>%
kable(caption = "Table 10. First 6 rows of the transformed vgsales dataset") %>%
kable_classic("hover", font_size = 18)
Name | Platform | Year | Genre | NA_Sales | EU_Sales | JP_Sales | Other_Sales |
---|---|---|---|---|---|---|---|
Wii Sports | Wii | 2006 | Sports | 41.49 | 29.02 | 3.77 | 8.46 |
Super Mario Bros. | NES | 1985 | Platform | 29.08 | 3.58 | 6.81 | 0.77 |
Mario Kart Wii | Wii | 2008 | Racing | 15.85 | 12.88 | 3.79 | 3.31 |
Wii Sports Resort | Wii | 2009 | Sports | 15.75 | 11.01 | 3.28 | 2.96 |
Pokemon Red/Pokemon Blue | GB | 1996 | Role-Playing | 11.27 | 8.89 | 10.22 | 1.00 |
Tetris | GB | 1989 | Puzzle | 23.20 | 2.26 | 4.22 | 0.58 |
dim(vgsales) #dimension
## [1] 16327 8
skim_without_charts(vgsales)%>%
kable(caption = "Table 11. Skimmed vgsales transformed data") %>%
kable_classic("hover", font_size = 18)
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | Name | 0 | 1 | 1 | 132 | 0 | 11360 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Platform | 0 | 1 | 2 | 4 | 0 | 31 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Genre | 0 | 1 | 4 | 12 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA |
numeric | Year | 0 | 1 | NA | NA | NA | NA | NA | 2006.4064433 | 5.8289811 | 1980 | 2003 | 2007.00 | 2010.00 | 2020.00 |
numeric | NA_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.2654150 | 0.8215909 | 0 | 0 | 0.08 | 0.24 | 41.49 |
numeric | EU_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.1475544 | 0.5087657 | 0 | 0 | 0.02 | 0.11 | 29.02 |
numeric | JP_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0786611 | 0.3115570 | 0 | 0 | 0.00 | 0.04 | 10.22 |
numeric | Other_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0483255 | 0.1898854 | 0 | 0 | 0.01 | 0.04 | 10.57 |
Observations
There are 8 variables instead of 9 since Publisher
was removed.
There are 16327 observations instead of 16598 since the NAs from Year
were removed.
Table 11 shows that the variable Year
is a numerical variable instead of a categorical variable.
Exploratory Data Analysis is required to increase performance and accuracy of predictive models. In this case, this process will allow to determine the relationships between the predictor variables and the outcome variable, identify outliers in the data and detect anomalies. The aim of this section is to have a better understanding of the data and identify which transformations are required.
#Create top 10 platforms
<- vgsales %>%
top_10 mutate(Platform = fct_lump(Platform, n = 10)) %>%
count(Platform, sort = TRUE)
#Use pivot_longer to transform the data to long format
<- vgsales %>%
vgsales_pcp select(-Year) %>% # Exclude year due to difference in scale with the sales
mutate(ID = row_number(), Platform = case_when(Platform %in% unique(top_10$Platform) ~ Platform, TRUE ~ "Other Platform")) %>%
pivot_longer(cols = NA_Sales:Other_Sales)
%>%
vgsales_pcp head()%>%
kable(caption = "Table 12. Data transformed to a long format") %>%
kable_classic("hover", font_size = 18)
Name | Platform | Genre | ID | name | value |
---|---|---|---|---|---|
Wii Sports | Wii | Sports | 1 | NA_Sales | 41.49 |
Wii Sports | Wii | Sports | 1 | EU_Sales | 29.02 |
Wii Sports | Wii | Sports | 1 | JP_Sales | 3.77 |
Wii Sports | Wii | Sports | 1 | Other_Sales | 8.46 |
Super Mario Bros. | Other Platform | Platform | 2 | NA_Sales | 29.08 |
Super Mario Bros. | Other Platform | Platform | 2 | EU_Sales | 3.58 |
%>%
vgsales_pcp ggplot( aes( x = name, y = value) ) +
geom_line(aes( group = ID), color = "#2F5755" ) +
ggtitle("Parallel Coordinate Plot")+
theme_bw()
Figure 6. Parallel Coordinate Plot of Number of Sales
The Parallel Component Plot does not show clear grouping for the sales in different regions. For all regions, most of the sales were between 0 and 5 million. However, there are some observations for Europe and North America with higher number of sales.
Check if the higher observations are related to the Genre
of the game.
%>%
vgsales_pcp ggplot(aes( x = name, y = value, colour = Genre )) +
geom_line( aes( group = ID ) ) +
ggtitle("Parallel Coordinate Plot by Genre")+
theme_bw()
Figure 7. Parallel Coordinate Plot of Number of Sales by Genre
The highest number of sales for Europe and North America correspond to the Sport genre. However, there are different genres that have higher values of sales for North America and they do not generate a clear grouping.
Check if the higher observations are related to the Platform
of the game.
%>%
vgsales_pcp ggplot(aes( x = name, y = value, colour = Platform)) +
geom_line(aes(group = ID)) +
ggtitle("Parallel Coordinate Plot by Platform")+
theme_bw()
Figure 8. Parallel Coordinate Plot of Number of Sales by Platform
The highest number of sales for Europe and North America correspond to the Wii Platform. Since some of the Platforms were aggregated into “Other Platforms” for visualisation purposes, it is possible to identify a group of “Other Platforms” aggregated in the Japan sales.
Figure 6, 7 and 8 showed that there are some outliers in the data belonging to the Wii Platform and Sport Genre. However, there is no clear grouping of the different games in relation to the number of sales per region.
<- recipe(NA_Sales ~ . , data = vgsales) %>%
vgsales_pca step_rm(Name) %>% # Remove name
step_other(Genre, Platform, threshold = 50) %>% # Create "other" category for Genre and Platform
step_dummy(Genre, Platform, one_hot = TRUE) %>% # Convert to dummy variable
step_normalize(all_predictors()) %>% # Normalise the predictors
step_zv(all_predictors()) %>% # Remove predictors with variance = 0
step_corr(all_predictors()) %>% # Remove correlated predictors
step_pca(all_predictors()) %>% # Perform PCA
prep()
Steps Explanation:
step_rm
: removes the Name
variable. To perform the PCA the Name
variable is not relevant since it represents an unique identifier of the observation.
step_other
: creates “Other” category for the variables Genre
and Platform
if the observations for the variable were less than 50.
step_dummy
: converts Genre
and Platform
to a dummy variable including all categories.
step_normalize
: normalise all the predictors (i.e. Year
, EU_Sales
, JP_Sales
, Other_Sales
, Genre
and Platform
). This means that it converts the values so that the mean is zero and the standard deviation 1.
step_zv
: removes all the predictors that have a variance equal to zero. In this case, we are looking for the variables that increase the variance.
step_corr
: removes all predictors that are correlated.
step_pca
: performs the Principal Component Analysis.
tidy(vgsales_pca, 7) %>%
filter(component %in% c("PC1", "PC2", "PC3", "PC4")) %>%
group_by(component) %>%
top_n(5, abs(value)) %>%
ungroup() %>%
ggplot(aes(x = value, y = terms, fill = abs(value))) +
geom_col(show.legend = FALSE) +
facet_wrap( ~ component, scales = "free_y") +
xlab("Loadings") +
ylab(NULL)+
ggtitle("Principal Components Loadings")+
scale_fill_distiller(palette = "Reds", direction = 1)+
theme_bw()
Figure 9. Principal Component Analysis Loadings Plot
#Calculate standard deviation
<- vgsales_pca$steps[[7]]$res$sdev sdev
#Calculate variance explained
<- sdev^2 / sum(sdev^2) ve
# Calculate the proportion of variance explained
<- tibble(
PC.pve pc = fct_inorder(str_c("PC", 1:40)),
pve = cumsum(ve) #Cumulative sum of the variance explained
)%>%
PC.pve filter(pve >= 0.9) %>% #Component that explain more than 90% of the variance
kable(caption = "Table 13. Components that explain more than 90% of the variability") %>%
kable_classic("hover", font_size = 18)
pc | pve |
---|---|
PC32 | 0.9205947 |
PC33 | 0.9413979 |
PC34 | 0.9611211 |
PC35 | 0.9792675 |
PC36 | 0.9920159 |
PC37 | 0.9980949 |
PC38 | 1.0000000 |
PC39 | 1.0000000 |
PC40 | 1.0000000 |
Observations
The Year
variable and the sales in different regions are the most important variables for the first component.
The Year
variable, sales in Europe and sales in other regions are the most important variables for the second component.
Platform DS is the most important variable for the third component.
Role-Playing and Miscellaneous genre, and Wii and DS platforms are the most important variables for the fourth component.
Table 13 shows that to explain 90% of the variance it is necessary to consider 32 out of 40 components (reduction on dimensions will be 8). Since the dimensions were not significantly reduced and adding the components will reduce the interpretability of the models, the components would not be considered for the prediction modelling.
Sales in North America by year
<- vgsales %>%
sales_year select(NA_Sales, Year) %>%
group_by(Year)%>%
summarise(sales = sum(NA_Sales))
ggplot(data = sales_year, aes(x = as.factor(Year), y = sales, fill = sales)) +
geom_col() +
xlab("Year") +
ylab("Sales in North America") +
ggtitle("Sales in North America by Year") +
scale_fill_distiller("Sales (in million)",palette = "Purples", direction = 1)+
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)
Figure 10. Bar chart of the sales in North America by year
#Create a median variable to colour the boxes
<- vgsales %>%
vgsales_mean mutate(median = as.factor(round(ave(NA_Sales, as.factor(Year), FUN = median), 1)))
ggplot(data = vgsales_mean, aes(x = as.factor(Year), y = NA_Sales, fill = median)) +
geom_boxplot() +
xlab("Year") +
ylab("Sales in North America") +
ylim(0,2)+ # Limit y scale to see the boxes for each year
scale_fill_brewer("Median sales (in million)", palette = "Purples")+
ggtitle("Sales in North America by Year") +
theme_bw() +
theme(
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)
Figure 11. Boxplot of the sales in North America by year
Observations
Figure 10 shows that the highest total number of sales were for 2008, and that the sales dropped the following years until now. However, Figure 11 shows that the median of the number of sales was higher for 1990 and it was slightly constant after the year 2000. The former could be explained by the release of gaming subscriptions such as Xbox Game Pass, Steam, PlayStation Plus, and others, which allow access to multiple video games without the need of buying the game. The latter could be explained by the release of the SNES (1991) and Genesis (1988) consoles, which created a continuous battle between Nintendo and Sega.
Figure 11 shows that there are a significant amount of outliers for each year, meaning that every year there are video games that have high number of sales in comparison to the other games released during the same year.
Sales in North America by sales in other regions
<- ggplot(data = vgsales, aes(x = JP_Sales, y = NA_Sales)) +
p1 geom_point(size = 3, col = "#002F2F", alpha = .7) +
geom_smooth(col = "#F7B15B", se = FALSE)+
xlab("Sales in Japan") +
ylab("Sales in North America") +
ggtitle("Japan")+
theme_bw()
<- ggplot(data = vgsales, aes(x = EU_Sales, y = NA_Sales)) +
p2 geom_point(size = 3, col = "#3F606F", alpha = .7) +
geom_smooth(col = "#F7B15B", se = FALSE)+
xlab("Sales in Europe") +
ylab("Sales in North America") +
ggtitle("Europe")+
theme_bw()
<- ggplot(data = vgsales, aes(x = Other_Sales, y = NA_Sales)) +
p3 geom_point(size = 3, col = "#92AA9D", alpha = .7) +
geom_smooth(col = "#F7B15B", se = FALSE)+
xlab("Sales in the rest of the world") +
ylab("Sales in North America") +
ggtitle("Other regions")+
theme_bw()
/ p2 / p3) +
(p1 plot_annotation(tag_levels = "A",
title = "Sales in North America by sales in different regions",
theme = theme(plot.title = element_text(size = 18)))
Figure 12. Scatter plot of sales in North America by sales in different regions of the world
Observations
Figure 12A shows that sales in North America have a quadratic relationship with sales in Japan. When sales in Japan increase, sales in North America stay constant until 2.5 million sales in Japan. After this point, sales in North America start increasing as well. Additionally, when reaching the 4 million sales in Japan, North America has an outlier that has over 40 million sales.
Figure 12B shows that sales in North America have a slight linear relationship with sales in Europe. Most of the sales for both regions are concentrated below 5 million. Similar to the previous plot, there are some outliers for sales in North America.
Figure 12C shows that sales in North America have a quadratic relationship with sales in other regions of the world. This plot is similar to the relationship with sales in Japan, sales in North America stay constant and after 1.5 million sales in other regions, sales in North America start increasing. Additionally, there are some outliers for sales in North America, the highest reaching over 40 million sales.
Sales in North America by genre
<- vgsales %>%
sales_gen group_by(Genre) %>%
summarise(sales = sum(NA_Sales)) %>%
arrange(desc(sales)) %>%
mutate(pal = case_when(sales >= 500 ~ "#2F5755",
>= 375 ~ "#F7B15B",
sales >= 250 ~ "#F65C51",
sales TRUE ~ "grey90"))
ggplot(data = sales_gen, aes(y = reorder(Genre, sales), x = sales, fill = pal)) +
scale_fill_identity("Sales (in million)", labels = c(">500", ">375", ">250", "<250"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
geom_col(alpha = .8) +
xlab("Sales in North America") +
ylab("Genre") +
ggtitle("Sales in North America by Genre")+
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 13. Bar chart of sales in North America by Genre
#Create a median variable to colour the boxes
<- vgsales %>%
vgsales_median mutate(median = round(ave(NA_Sales, Genre, FUN = median), 2))%>%
mutate(pal = case_when(median >= 0.1 ~ "#2F5755",
>= 0.08~ "#F7B15B",
median >= 0.06 ~ "#F65C51",
median TRUE ~ "grey90"))
ggplot(data = vgsales_median, aes(y = reorder(Genre, median), x = NA_Sales,fill = pal)) +
scale_fill_identity("Median sales (in million)", labels = c(">0.1", ">0.08", ">0.06", "<0.06"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
geom_boxplot(alpha = .8) +
xlab("Sales in North America") +
ylab("Genre") +
xlim(0,1)+ #Limit x axis to be able to see boxes
ggtitle("Sales in North America by Genre")+
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 14. Boxplot of sales in North America by Genre
Observations
Figure 13 shows that the genres Action, Sports, and Shooter were the genres with higher amount of total sales in North America. However, when plotting the median value of sales per genre (Figure 14), the median sales for Platform and Racing are as high as the genres mentioned before. Platform genre is the one with the highest median sales.
Figure 14 shows that all genres have outliers, which means that for every genre there are games that have a higher amount of sales in North America than the games in the same genre.
Sales in North America by platform
<- vgsales %>%
sales_plat group_by(Platform) %>%
summarise(sales = sum(NA_Sales)) %>%
arrange(desc(sales)) %>%
mutate(pal = case_when(sales >= 400 ~ "#2F5755",
>= 200 ~ "#F7B15B",
sales >= 100 ~ "#F65C51",
sales TRUE ~ "grey90"))
ggplot(data = sales_plat%>%filter(sales>0), aes(y = reorder(Platform, sales), x = sales, fill = pal)) +
scale_fill_identity("Sales (in million)", labels = c(">400", ">200", ">100", "<100"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
geom_col(alpha = .8) +
xlab("Sales in North America") +
ylab("Platform") +
ggtitle("Sales in North America by Platform")+
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 15. Bar chart of sales in North America by Platform
#Create a median variable to colour the boxes
<- vgsales %>%
vgsales_median mutate(median = round(ave(NA_Sales, Platform, FUN = median), 2))%>%
mutate(pal = case_when(median > 0.15 ~ "#2F5755",
> 0.1~ "#F7B15B",
median > 0.05 ~ "#F65C51",
median TRUE ~ "grey90"))
ggplot(data = vgsales_median, aes(y = reorder(Platform,median), x = NA_Sales,fill = pal)) +
scale_fill_identity("Median sales (in million)", labels = c(">0.15", ">0.10", ">0.05", "<0.05"), guide = "legend", limits = c("#2F5755", "#F7B15B","#F65C51" ,"grey90")) +
geom_boxplot(alpha = .8) +
xlab("Sales in North America") +
ylab("Platform") +
xlim(0,0.5)+ #Limit x axis to be able to see boxes
ggtitle("Sales in North America by Platform")+
theme_bw() +
theme(
panel.grid.major.y = element_blank()
)
Figure 16. Boxplot of sales in North America by Platform
Observations
Figure 15 shows that the platforms X360, PS2, and Wii were the platforms with higher amount of total sales in North America. However, when plotting the median value of sales per platform (Figure 16), the median sales for 2600, NES, N64, and X360 have the highest value.
Figure 16 shows that all platforms have outliers, which means that for every platform there are games that have a higher amount of sales in North America than the games released in the same platform.
Create initial split of the data:
set.seed(4321)
<- initial_split(vgsales)
vgsales_split vgsales_split
## <Analysis/Assess/Total>
## <12245/4082/16327>
The training set has 12245 observations and the testing set 4082.
Create training and testing set with the initial split data:
<- training(vgsales_split)
vgsales_train <- testing(vgsales_split) vgsales_test
dim(vgsales_train)
## [1] 12245 8
skim_without_charts(vgsales_train) %>%
kable(caption = "Table 14. Skimmed training set") %>%
kable_classic("hover", font_size = 18)
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | Name | 0 | 1 | 2 | 132 | 0 | 9094 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Platform | 0 | 1 | 2 | 4 | 0 | 30 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Genre | 0 | 1 | 4 | 12 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA |
numeric | Year | 0 | 1 | NA | NA | NA | NA | NA | 2006.4318497 | 5.8086130 | 1980 | 2003 | 2007.00 | 2010.00 | 2020.00 |
numeric | NA_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.2645659 | 0.8341938 | 0 | 0 | 0.08 | 0.24 | 41.49 |
numeric | EU_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.1490118 | 0.5353449 | 0 | 0 | 0.02 | 0.11 | 29.02 |
numeric | JP_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0802058 | 0.3201934 | 0 | 0 | 0.00 | 0.04 | 10.22 |
numeric | Other_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0473148 | 0.1661819 | 0 | 0 | 0.01 | 0.03 | 8.46 |
The dimensions of the training set and Table 14 show the right number of observations (i.e. 12245), and all variables have the correct data type.
dim(vgsales_test)
## [1] 4082 8
skim_without_charts(vgsales_test)%>%
kable(caption = "Table 15. Skimmed testing set") %>%
kable_classic("hover", font_size = 18)
skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
character | Name | 0 | 1 | 1 | 122 | 0 | 3614 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Platform | 0 | 1 | 2 | 4 | 0 | 28 | 0 | NA | NA | NA | NA | NA | NA | NA |
character | Genre | 0 | 1 | 4 | 12 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA |
numeric | Year | 0 | 1 | NA | NA | NA | NA | NA | 2006.3302303 | 5.8897169 | 1980 | 2003 | 2007.00 | 2010.00 | 2016.00 |
numeric | NA_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.2679618 | 0.7826621 | 0 | 0 | 0.08 | 0.25 | 26.93 |
numeric | EU_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.1431823 | 0.4190689 | 0 | 0 | 0.02 | 0.11 | 9.26 |
numeric | JP_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0740274 | 0.2840659 | 0 | 0 | 0.00 | 0.03 | 4.87 |
numeric | Other_Sales | 0 | 1 | NA | NA | NA | NA | NA | 0.0513572 | 0.2477403 | 0 | 0 | 0.01 | 0.04 | 10.57 |
The dimensions of the testing set and Table 15 show the right number of observations (i.e. 4082), and all variables have the correct data type.
<- recipe(NA_Sales ~ ., data = vgsales_train) %>%
vgsales_recipe step_rm(Name) %>%
step_other(Genre, Platform, threshold = 50) %>%
step_dummy(Genre, Platform, one_hot = TRUE) %>%
step_normalize(all_predictors()) %>%
step_zv(all_predictors()) %>% #Remove predictors with variance = 0
step_corr(all_predictors()) %>%
prep()
vgsales_recipe
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
##
## Training data contained 12245 data points and no missing data.
##
## Operations:
##
## Variables removed Name [trained]
## Collapsing factor levels for Platform [trained]
## Dummy variables from Genre, Platform [trained]
## Centering and scaling for Year, EU_Sales, JP_Sales, Other_Sales, Genre_Ac... [trained]
## Zero variance filter removed <none> [trained]
## Correlation filter on <none> [trained]
Steps Explanation:
step_rm
: removes the Name
variable. The name variable is the unique identifier for the observations, which means that if we predict the sales in North America with this variables it is likely to create overfitting.
step_other
: creates “Other” category for the variables Genre
and Platform
if the observations for the variable were less than 50. This is becessary to be able to reduce the number of dimensions for the final model.
step_dummy
: converts Genre
and Platform
to a dummy variable including all levels (i.e. one_hote = TRUE
). This step was performed so that all predictors are numerical.
step_normalize
: normalise all the predictors (i.e. Year
, EU_Sales
, JP_Sales
, Other_Sales
, Genre
and Platform
). This means that it converts the values so that the mean is zero and the standard deviation 1. This is required since the variables have a different range, which will influence the prediction of the final model.
step_zv
: removes all the predictors that have a variance equal to zero.
step_corr
: removes all predictors that are correlated. The autocorrelation is an assumption for linear regressions. Thus, this step helps avoiding autocorrelation.
Preprocess training data using the bake
function. This allows to apply all previous steps to the training data. When using new_data=NULL
, the bake
function applies these steps to the training data.
<- bake(vgsales_recipe, new_data = NULL)
vgsales_train_preproc %>%
vgsales_train_preproc head()%>%
kable(caption = "Table 16. First 6 rows of the preprocessed training data") %>%
kable_classic("hover", font_size = 18)
Year | EU_Sales | JP_Sales | Other_Sales | NA_Sales | Genre_Action | Genre_Adventure | Genre_Fighting | Genre_Misc | Genre_Platform | Genre_Puzzle | Genre_Racing | Genre_Role.Playing | Genre_Shooter | Genre_Simulation | Genre_Sports | Genre_Strategy | Platform_X2600 | Platform_X3DS | Platform_DS | Platform_GB | Platform_GBA | Platform_GC | Platform_N64 | Platform_NES | Platform_PC | Platform_PS | Platform_PS2 | Platform_PS3 | Platform_PS4 | Platform_PSP | Platform_PSV | Platform_SAT | Platform_SNES | Platform_Wii | Platform_WiiU | Platform_X360 | Platform_XB | Platform_XOne | Platform_other |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.7864442 | 0.2446799 | -0.2504918 | -0.0440170 | 0.00 | -0.4967876 | -0.289008 | -0.2311268 | 2.9267229 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | 2.5590598 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
0.4421280 | -0.2783474 | -0.1567984 | -0.2847171 | 0.06 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | 3.1270195 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | 2.5590598 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
1.4750768 | -0.2409883 | -0.2504918 | -0.2245420 | 0.07 | 2.0127684 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | 8.8026849 | -0.0836036 |
0.7864442 | -0.0915519 | -0.2504918 | 0.0763331 | 0.54 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | 4.2900119 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | 2.5590598 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
-0.9351371 | 0.7116686 | 1.0299845 | 0.0763331 | 0.92 | 2.0127684 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | 12.8241780 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
-0.9351371 | 9.2295414 | 5.5897292 | 6.6955857 | 6.85 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | 3.5291680 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | 2.5951421 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
Preprocess testing data. This allows to apply all previous recipe steps to the testing data. When using new_data=vgsales_test
, the bake
function applies these steps to the testing data.
<- bake(vgsales_recipe, new_data = vgsales_test)
vgsales_test_preproc %>%
vgsales_test_preproc head()%>%
kable(caption = "Table 17. First 6 rows of the preprocessed testing data") %>%
kable_classic("hover", font_size = 18)
Year | EU_Sales | JP_Sales | Other_Sales | NA_Sales | Genre_Action | Genre_Adventure | Genre_Fighting | Genre_Misc | Genre_Platform | Genre_Puzzle | Genre_Racing | Genre_Role.Playing | Genre_Shooter | Genre_Simulation | Genre_Sports | Genre_Strategy | Platform_X2600 | Platform_X3DS | Platform_DS | Platform_GB | Platform_GBA | Platform_GC | Platform_N64 | Platform_NES | Platform_PC | Platform_PS | Platform_PS2 | Platform_PS3 | Platform_PS4 | Platform_PSP | Platform_PSV | Platform_SAT | Platform_SNES | Platform_Wii | Platform_WiiU | Platform_X360 | Platform_XB | Platform_XOne | Platform_other |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-3.8618255 | 0.8984640 | 0.6239798 | 2.543509 | 26.93 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | 3.4016060 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | 13.3812894 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
-0.2465046 | 13.8620688 | 12.6479642 | 11.268887 | 9.81 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | 3.5291680 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | 2.5590598 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
0.6142861 | 8.9493482 | 0.4990553 | 9.764512 | 14.97 | -0.4967876 | -0.289008 | -0.2311268 | 2.9267229 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | 3.4618167 | -0.2281712 | -0.1135924 | -0.0836036 |
-0.4186627 | 0.4688345 | 1.0299845 | 63.320283 | 9.43 | 2.0127684 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | 2.5951421 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
-0.2465046 | 17.0189120 | 12.7416576 | 12.051163 | 4.75 | -0.4967876 | -0.289008 | -0.2311268 | 2.9267229 | -0.2373338 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | 2.5590598 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
-3.1731929 | 6.1474164 | 11.7422615 | 2.483334 | 9.54 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | 4.2131300 | -0.1896129 | -0.2833298 | -0.3197672 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | 13.3812894 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | -0.1440667 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
skim_without_charts(vgsales_train_preproc)%>%
kable(caption = "Table 17. Skimmed preprocessed training data") %>%
kable_classic("hover", font_size = 18)
skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
---|---|---|---|---|---|---|---|---|---|---|
numeric | Year | 0 | 1 | 0.0000000 | 1.0000000 | -4.5504580 | -0.5908209 | 0.0978117 | 0.6142861 | 2.335867 |
numeric | EU_Sales | 0 | 1 | 0.0000000 | 1.0000000 | -0.2783474 | -0.2783474 | -0.2409883 | -0.0728723 | 53.929694 |
numeric | JP_Sales | 0 | 1 | 0.0000000 | 1.0000000 | -0.2504918 | -0.2504918 | -0.2504918 | -0.1255672 | 31.667721 |
numeric | Other_Sales | 0 | 1 | 0.0000000 | 1.0000000 | -0.2847171 | -0.2847171 | -0.2245420 | -0.1041920 | 50.623353 |
numeric | NA_Sales | 0 | 1 | 0.2645659 | 0.8341938 | 0.0000000 | 0.0000000 | 0.0800000 | 0.2400000 | 41.490000 |
numeric | Genre_Action | 0 | 1 | 0.0000000 | 1.0000000 | -0.4967876 | -0.4967876 | -0.4967876 | -0.4967876 | 2.012768 |
numeric | Genre_Adventure | 0 | 1 | 0.0000000 | 1.0000000 | -0.2890080 | -0.2890080 | -0.2890080 | -0.2890080 | 3.459830 |
numeric | Genre_Fighting | 0 | 1 | 0.0000000 | 1.0000000 | -0.2311268 | -0.2311268 | -0.2311268 | -0.2311268 | 4.326276 |
numeric | Genre_Misc | 0 | 1 | 0.0000000 | 1.0000000 | -0.3416512 | -0.3416512 | -0.3416512 | -0.3416512 | 2.926723 |
numeric | Genre_Platform | 0 | 1 | 0.0000000 | 1.0000000 | -0.2373338 | -0.2373338 | -0.2373338 | -0.2373338 | 4.213130 |
numeric | Genre_Puzzle | 0 | 1 | 0.0000000 | 1.0000000 | -0.1896129 | -0.1896129 | -0.1896129 | -0.1896129 | 5.273471 |
numeric | Genre_Racing | 0 | 1 | 0.0000000 | 1.0000000 | -0.2833298 | -0.2833298 | -0.2833298 | -0.2833298 | 3.529168 |
numeric | Genre_Role.Playing | 0 | 1 | 0.0000000 | 1.0000000 | -0.3197672 | -0.3197672 | -0.3197672 | -0.3197672 | 3.127019 |
numeric | Genre_Shooter | 0 | 1 | 0.0000000 | 1.0000000 | -0.2939548 | -0.2939548 | -0.2939548 | -0.2939548 | 3.401606 |
numeric | Genre_Simulation | 0 | 1 | 0.0000000 | 1.0000000 | -0.2330806 | -0.2330806 | -0.2330806 | -0.2330806 | 4.290012 |
numeric | Genre_Sports | 0 | 1 | 0.0000000 | 1.0000000 | -0.4077842 | -0.4077842 | -0.4077842 | -0.4077842 | 2.452077 |
numeric | Genre_Strategy | 0 | 1 | 0.0000000 | 1.0000000 | -0.2065347 | -0.2065347 | -0.2065347 | -0.2065347 | 4.841405 |
numeric | Platform_X2600 | 0 | 1 | 0.0000000 | 1.0000000 | -0.0831069 | -0.0831069 | -0.0831069 | -0.0831069 | 12.031707 |
numeric | Platform_X3DS | 0 | 1 | 0.0000000 | 1.0000000 | -0.1779791 | -0.1779791 | -0.1779791 | -0.1779791 | 5.618177 |
numeric | Platform_DS | 0 | 1 | 0.0000000 | 1.0000000 | -0.3907366 | -0.3907366 | -0.3907366 | -0.3907366 | 2.559060 |
numeric | Platform_GB | 0 | 1 | 0.0000000 | 1.0000000 | -0.0779713 | -0.0779713 | -0.0779713 | -0.0779713 | 12.824178 |
numeric | Platform_GBA | 0 | 1 | 0.0000000 | 1.0000000 | -0.2295542 | -0.2295542 | -0.2295542 | -0.2295542 | 4.355913 |
numeric | Platform_GC | 0 | 1 | 0.0000000 | 1.0000000 | -0.1818491 | -0.1818491 | -0.1818491 | -0.1818491 | 5.498616 |
numeric | Platform_N64 | 0 | 1 | 0.0000000 | 1.0000000 | -0.1437712 | -0.1437712 | -0.1437712 | -0.1437712 | 6.954930 |
numeric | Platform_NES | 0 | 1 | 0.0000000 | 1.0000000 | -0.0747251 | -0.0747251 | -0.0747251 | -0.0747251 | 13.381289 |
numeric | Platform_PC | 0 | 1 | 0.0000000 | 1.0000000 | -0.2499356 | -0.2499356 | -0.2499356 | -0.2499356 | 4.000705 |
numeric | Platform_PS | 0 | 1 | 0.0000000 | 1.0000000 | -0.2773997 | -0.2773997 | -0.2773997 | -0.2773997 | 3.604612 |
numeric | Platform_PS2 | 0 | 1 | 0.0000000 | 1.0000000 | -0.3853039 | -0.3853039 | -0.3853039 | -0.3853039 | 2.595142 |
numeric | Platform_PS3 | 0 | 1 | 0.0000000 | 1.0000000 | -0.2955911 | -0.2955911 | -0.2955911 | -0.2955911 | 3.382776 |
numeric | Platform_PS4 | 0 | 1 | 0.0000000 | 1.0000000 | -0.1440667 | -0.1440667 | -0.1440667 | -0.1440667 | 6.940661 |
numeric | Platform_PSP | 0 | 1 | 0.0000000 | 1.0000000 | -0.2797824 | -0.2797824 | -0.2797824 | -0.2797824 | 3.573914 |
numeric | Platform_PSV | 0 | 1 | 0.0000000 | 1.0000000 | -0.1614245 | -0.1614245 | -0.1614245 | -0.1614245 | 6.194341 |
numeric | Platform_SAT | 0 | 1 | 0.0000000 | 1.0000000 | -0.1039858 | -0.1039858 | -0.1039858 | -0.1039858 | 9.615910 |
numeric | Platform_SNES | 0 | 1 | 0.0000000 | 1.0000000 | -0.1231681 | -0.1231681 | -0.1231681 | -0.1231681 | 8.118324 |
numeric | Platform_Wii | 0 | 1 | 0.0000000 | 1.0000000 | -0.2898365 | -0.2898365 | -0.2898365 | -0.2898365 | 3.449940 |
numeric | Platform_WiiU | 0 | 1 | 0.0000000 | 1.0000000 | -0.0925490 | -0.0925490 | -0.0925490 | -0.0925490 | 10.804205 |
numeric | Platform_X360 | 0 | 1 | 0.0000000 | 1.0000000 | -0.2888421 | -0.2888421 | -0.2888421 | -0.2888421 | 3.461817 |
numeric | Platform_XB | 0 | 1 | 0.0000000 | 1.0000000 | -0.2281712 | -0.2281712 | -0.2281712 | -0.2281712 | 4.382317 |
numeric | Platform_XOne | 0 | 1 | 0.0000000 | 1.0000000 | -0.1135924 | -0.1135924 | -0.1135924 | -0.1135924 | 8.802685 |
numeric | Platform_other | 0 | 1 | 0.0000000 | 1.0000000 | -0.0836036 | -0.0836036 | -0.0836036 | -0.0836036 | 11.960232 |
Observations - Name
was removed from the data
Genre
and Platform
were converted to numerical variables and an “Other” category was added for Platform
. The “Other” category was not added to Genre
which means that all categories had more than 50 observations.
All predictors were normalised, having a mean of 0 and variance of 1.
No variable had a variance equal to zero, and the step did not find autocorrelation between variables.
Lasso regression specification. This function creates a linear regression with mixture=1
(i.e. Lasso regression). In this case, the penalty
variable was tuned in order to obtain the optimal value that enhances the metrics for this hyperparameter. The penalty
value for a Lasso regression is of importance since it can reduce some of the coefficients to zero.
<- linear_reg(
lasso_spec mode = "regression",
penalty = tune(),
mixture = 1
%>%
) set_engine("glmnet")
Create sample data using bootstrap, which allows to reduce the RMSE of the models.
#Bootstrapped data
set.seed(4321)
<- bootstraps(vgsales_train_preproc, times = 10) vgsales_boots
Create a grid of penalty values. In this case, 50 levels were used.
<- grid_regular(penalty( range = c(-3, 0), trans = log10_trans() ),
penalty_grid levels = 50)
%>%
penalty_gridkable(caption = "Table 18. 50 penalty values used to tune the Lasso regression") %>%
kable_classic("hover", font_size = 18)
penalty |
---|
0.0010000 |
0.0011514 |
0.0013257 |
0.0015264 |
0.0017575 |
0.0020236 |
0.0023300 |
0.0026827 |
0.0030888 |
0.0035565 |
0.0040949 |
0.0047149 |
0.0054287 |
0.0062506 |
0.0071969 |
0.0082864 |
0.0095410 |
0.0109854 |
0.0126486 |
0.0145635 |
0.0167683 |
0.0193070 |
0.0222300 |
0.0255955 |
0.0294705 |
0.0339322 |
0.0390694 |
0.0449843 |
0.0517947 |
0.0596362 |
0.0686649 |
0.0790604 |
0.0910298 |
0.1048113 |
0.1206793 |
0.1389495 |
0.1599859 |
0.1842070 |
0.2120951 |
0.2442053 |
0.2811769 |
0.3237458 |
0.3727594 |
0.4291934 |
0.4941713 |
0.5689866 |
0.6551286 |
0.7543120 |
0.8685114 |
1.0000000 |
Tune the penalty
variable of the lasso regression using the previous grid:
set.seed(4321)
<- tune_grid(lasso_spec, NA_Sales ~., vgsales_boots, grid = penalty_grid) lasso_grid
Visualise the values of RMSE and \(R^2\) for different penalty values:
%>%
lasso_grid collect_metrics() %>%
ggplot( aes( penalty, mean, color = .metric ) ) +
geom_line() +
facet_wrap( ~.metric, scales = "free", nrow = 2) +
scale_x_log10()+
theme_bw()+
ggtitle("Metrics for different values of penalty")+
labs(color = "Metric")
Figure 17. Line plot of the metrics for different penalty values
Identify the penalty value that reduces the RMSE:
<- select_best( lasso_grid, "rmse" )
best_lasso_rmse %>%
best_lasso_rmsekable(caption = "Table 19. Optimal penalty value for Lasso regression") %>%
kable_classic("hover", font_size = 18)
penalty | .config |
---|---|
0.0020236 | Preprocessor1_Model06 |
Table 19 shows that the optimal penalty value that reduces the RMSE is 0.002.
Define the final Lasso model using the penalty value from Table 19:
<- finalize_model( lasso_spec, best_lasso_rmse )
final_lasso final_lasso
## Linear Regression Model Specification (regression)
##
## Main Arguments:
## penalty = 0.00202358964772516
## mixture = 1
##
## Computational engine: glmnet
The Lasso regression was correctly defined with the penalty value that reduces the RMSE.
Random Forest specification. In this case, the mtry
and min_n
variables were tuned in order to obtain the optimal value that enhances the metrics for these hyperparameters. The mtry
value defines the number of predictors considered on each resampling and the min_n
defines the minimum node size. For computing purposes, in this case, the number of trees evaluated will be set to 100 (trees = 100
) and will not be tuned.
<- rand_forest(
rf_spec mode = "regression",
mtry = tune(),
trees = 100,
min_n = tune()
%>%
) set_engine( "ranger", importance = "permutation")
Create a grid with different values for mtry
and min_n
. In this case, the levels are 5, which means that the grid will have 25 values.
<- grid_regular(
rand_spec_grid finalize(mtry(),
%>%
vgsales_train_preproc select(-NA_Sales)),
min_n(),
levels = 5 )
%>%
rand_spec_gridkable(caption = "Table 20. Grid of differents mtry and min_n values") %>%
kable_classic("hover", font_size = 18)
mtry | min_n |
---|---|
1 | 2 |
10 | 2 |
20 | 2 |
29 | 2 |
39 | 2 |
1 | 11 |
10 | 11 |
20 | 11 |
29 | 11 |
39 | 11 |
1 | 21 |
10 | 21 |
20 | 21 |
29 | 21 |
39 | 21 |
1 | 30 |
10 | 30 |
20 | 30 |
29 | 30 |
39 | 30 |
1 | 40 |
10 | 40 |
20 | 40 |
29 | 40 |
39 | 40 |
Tune the Random Forest with the 25 different values using the previous resampled values obtained with bootstrapping (i.e. vgsales_boots
):
::registerDoParallel()
doParallelset.seed(4321)
<- tune_grid( rf_spec, NA_Sales ~ .,
rf_grid
vgsales_boots, grid = rand_spec_grid )
Visualise the RMSE and \(R^2\) for the different values of mtry
and min_n
:
%>%
rf_grid collect_metrics() %>%
mutate( min_n = as.factor( min_n ) ) %>%
ggplot( aes( x = mtry, y = mean, colour = min_n ) ) +
geom_point( size = 2 ) +
geom_line( alpha = 0.75 ) +
facet_wrap( ~ .metric, scales = "free", nrow = 3 )+
theme_bw()+
ggtitle("Mean RMSE and R-Squared for mtry and min_n")+
labs(colour = "min_n")
Figure 18. Line plot of the metrics for different mtry and min_n values
Calculate the values of mtry
and min_n
that reduce the RMSE:
<- select_best( rf_grid, "rmse" )
best_rf_rmse %>%
best_rf_rmsekable(caption = "Table 21. Optimal mtry and min_n values for Random Forest") %>%
kable_classic("hover", font_size = 18)
mtry | min_n | .config |
---|---|---|
20 | 2 | Preprocessor1_Model03 |
Figure 18 and Table 21 show that the optimal mtry
and min_n
values that reduce the RMSE are 20 and 2, respectively.
Define the final Random Forest model using the penalty value from Table 21:
<- finalize_model( rf_spec, best_rf_rmse )
final_rf final_rf
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 20
## trees = 100
## min_n = 2
##
## Engine-Specific Arguments:
## importance = permutation
##
## Computational engine: ranger
The Random Forest was correctly defined with the mtry
and min_n
values that reduce the RMSE.
Perform cross-validation to avoid overfitting. Cross-validation is used so that the performance of the model is evaluated with various sampled data.
set.seed(4321)
<- vfold_cv(vgsales_train_preproc, v = 10) vgsales_cv
Evaluate metrics of the Lasso regression using the cross-validated resamples:
<- fit_resamples(final_lasso, NA_Sales~., vgsales_cv)
lasso_cv %>%
lasso_cv collect_metrics()%>%
kable(caption = "Table 22. RMSE and R-squared for Lasso regression") %>%
kable_classic("hover", font_size = 18)
.metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|
rmse | standard | 0.4460320 | 10 | 0.0571312 | Preprocessor1_Model1 |
rsq | standard | 0.6672145 | 10 | 0.0452453 | Preprocessor1_Model1 |
Evaluate metrics of the Random Forest using the cross-validated resamples:
set.seed(4321)
<- fit_resamples(final_rf, NA_Sales ~ . , vgsales_cv)
rf_cv %>%
rf_cv collect_metrics()%>%
kable(caption = "Table 23. RMSE and R-squared for Random Forest") %>%
kable_classic("hover", font_size = 18)
.metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|
rmse | standard | 0.3760635 | 10 | 0.0741628 | Preprocessor1_Model1 |
rsq | standard | 0.8059027 | 10 | 0.0294340 | Preprocessor1_Model1 |
Observations
Since the Random Forest was the best model, this will be the final model used to predict the sales in North America. Fit the Random Forest using the following code:
set.seed(4321)
<- final_rf %>%
vgsales_rf fit(NA_Sales ~ ., data = vgsales_train_preproc)
vgsales_rf
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~20L, x), num.trees = ~100, min.node.size = min_rows(~2L, x), importance = ~"permutation", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 100
## Sample size: 12245
## Number of independent variables: 39
## Mtry: 20
## Target node size: 2
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 0.1761053
## R squared (OOB): 0.7469313
The following plot allows to identify the most important predictors from the model:
%>%
vgsales_rf vip(all_permutations = TRUE, mapping = aes_string(fill = "Importance")) +
scale_fill_distiller(palette = "Reds", direction = 1)+
ggtitle("Variable Importance Scores for predictors") +
theme_bw()
Figure 19. Bar chart of the most important predictors from the Random Forest model
Figure 19 shows that the most important predictors for the Random Forest model were sales in Europe and sales in other regions of the world. Figure 12B shows that there is a slightly linear relationship between sales in North America and sales in Europe which could explain the high importance of this variable in the model.
Create model predictions with the testing set (i.e. vgsales_test_preproc
):
<- predict(vgsales_rf,
vgsales_preds new_data = vgsales_test_preproc) %>%
bind_cols(vgsales_test_preproc %>% select(NA_Sales))
%>%
vgsales_predshead()%>%
kable(caption = "Table 24. First 6 rows of the predicted and observed sales in North America for the testing set") %>%
kable_classic("hover", font_size = 18)
.pred | NA_Sales |
---|---|
2.964267 | 26.93 |
8.900740 | 9.81 |
7.875325 | 14.97 |
6.081783 | 9.43 |
9.901490 | 4.75 |
13.629000 | 9.54 |
Visualise observerd versus predicted values:
%>%
vgsales_preds ggplot(aes( x = .pred, y = NA_Sales )) +
geom_point(col = "#2F5755", alpha = 0.6) +
xlab("Prediction") +
ylab("Sales in North America") +
ggtitle("Observed versus predicted values of sales in North America")+
geom_abline(intercept = 0, slope = 1, colour = "#F7B15B") +
theme_bw()
Figure 20. Scatter plot of the observed versus predicted values of sales in North America
Determine the metrics of the model with the testing set:
%>%
vgsales_preds metrics(truth = NA_Sales, estimate = .pred)%>%
kable(caption = "Table 25. Metrics of the Random Forest on the testing set") %>%
kable_classic("hover", font_size = 18)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 0.4679026 |
rsq | standard | 0.6433416 |
mae | standard | 0.0802173 |
Observations
Figure 20 shows that there is a moderate prediction of the observed values of sales in North America when performing the Random Forest on the testing set.
Table 25 is consistent with Figure 20, the metrics show a moderate prediction of the testing observations (RMSE = 0.468
, R-squared = 0.643
). The metrics are worse than the metrics obtained on the training set (RMSE = 0.376
, R-squared = 0.806
), which could be explained because of the highly variability in the data, identified by the high number of outliers. This means that the training set might not be an accurate sample of the whole data.
Create a tibble to predict the sales in North America for The Fatal Empire game:
# Create tibble with data from The Fatal Empire game
<- tibble(
new_game Name = "The Fatal Empire", #This will be removed when preprocessing
Year = 2022,
Genre = "Role-Playing",
Platform = "PS4",
JP_Sales = 2.58,
EU_Sales = 0.53,
Other_Sales = 0.1
)
%>%
new_gamekable(caption = "Table 26. The Fatal Empire data for predicting sales in North America") %>%
kable_classic("hover", font_size = 18)
Name | Year | Genre | Platform | JP_Sales | EU_Sales | Other_Sales |
---|---|---|---|---|---|---|
The Fatal Empire | 2022 | Role-Playing | PS4 | 2.58 | 0.53 | 0.1 |
Preprocess the data utilising the recipe vgsales_recipe
and the bake
function. This function will apply all steps defined on the Preprocessing section to the data from The Fatal Empire.
<- bake(vgsales_recipe, new_data = new_game)
new_game_preproc %>%
new_game_preprockable(caption = "Table 27. The Fatal Empire preprocessed data for predicting sales in North America") %>%
kable_classic("hover", font_size = 18)
Year | EU_Sales | JP_Sales | Other_Sales | Genre_Action | Genre_Adventure | Genre_Fighting | Genre_Misc | Genre_Platform | Genre_Puzzle | Genre_Racing | Genre_Role.Playing | Genre_Shooter | Genre_Simulation | Genre_Sports | Genre_Strategy | Platform_X2600 | Platform_X3DS | Platform_DS | Platform_GB | Platform_GBA | Platform_GC | Platform_N64 | Platform_NES | Platform_PC | Platform_PS | Platform_PS2 | Platform_PS3 | Platform_PS4 | Platform_PSP | Platform_PSV | Platform_SAT | Platform_SNES | Platform_Wii | Platform_WiiU | Platform_X360 | Platform_XB | Platform_XOne | Platform_other |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2.680184 | 0.7116686 | 7.807139 | 0.3170332 | -0.4967876 | -0.289008 | -0.2311268 | -0.3416512 | -0.2373338 | -0.1896129 | -0.2833298 | 3.127019 | -0.2939548 | -0.2330806 | -0.4077842 | -0.2065347 | -0.0831069 | -0.1779791 | -0.3907366 | -0.0779713 | -0.2295542 | -0.1818491 | -0.1437712 | -0.0747251 | -0.2499356 | -0.2773997 | -0.3853039 | -0.2955911 | 6.940661 | -0.2797824 | -0.1614245 | -0.1039858 | -0.1231681 | -0.2898365 | -0.092549 | -0.2888421 | -0.2281712 | -0.1135924 | -0.0836036 |
Create predictions of sales in North America using the final model (i.e. Random forest model) and the preprocessed data from The Fatal Empire.
<- predict(
new_game_preds
vgsales_rf,new_data = new_game_preproc
)
%>%
new_game_predskable(caption = "Table 28. The Fatal Empire predicted sales in North America") %>%
kable_classic("hover", font_size = 18)
.pred |
---|
0.7790733 |
Observations
The final model predicted that the sales in North America for The Fatal Empire will be 0.779 million (Table 28).
It is important to highlight that the model has a moderate prediction accuracy (Figure 20 and Table 25), which means that the predicted value should just be used as a reference since it can be slightly different to the actual value.
To improve the accuracy of the model more variables could be considered such as publisher, targeted age group, duration of the game, and console exclusivity. Additionally, since the year variable was the fifth variable of importance (Figure 19), it would be valuable to input the missing years for the 271 observations instead of deleting them from the data. For this, the data for each game could be acquired from other data sets and merged into the current data. Furthermore, the data had a high number of outliers that could be further studied in order to increase the accuracy of the model.