Nintendo Power is going to be outsourced to Imagine Media, meaning the long-standing and proud magazine will no longer be run by Nintendo themselves. This inspired Jay and Christian to resurrect this old project. Most people claim that Nintendo Power review scores cannot be trusted, or are at least suspect, considering they come “in house” rather than from an independent source. Now that this will no longer be the case, let us look back at the old Nintendo Power and see how they stacked up to the rest of the gaming world when reviewing Nintendo DS games.
A few nuggets of info before we start. The numbers used are the weighted averages from Metacritic, Nintendo Power’s scores, and the minimum and maximum non-Nintendo Power scores. All numbers are for the top 30 and bottom 30 DS games.
There are so many problems with the numbers used here that we cannot use the results for any serious statistical results. For one, we start off with weighted averages before we even calculate any means, and we have no idea how Metacritic calculates their averages. The samples are not random, and are not necessarily independent. Also, knowing game reviewers, it is hard to say that these review scores follow a normal distribution (insert joke here about how everything is rated a 7).
The point is, we are doing this for fun. If you want to point out all the flaws, even in the calculation of the stats, you are wasting your breath. I know my stats prowess is rusty as hell, and you know you are free to stop reading whenever you please. Also, for reference, we provide the numbers we used in .xls and .csv format. Now lets get on with it.
First we will look at the Metacritic averages for all the DS games in my spreadsheet compared to NinPow’s scores by doing a classic two sample t-test. The null hypothesis is, of course, that there is no difference between them; the alternate is that there is a difference. Here are the numbers from minitab:
T-Value = -1.86 P-Value = 0.065 DF = 117
So from this we fail to reject the null hypothesis, and do not have enough evidence to say Nintendo Power’s scores are significantly different from the rest of the gaming world’s.
Let’s look at some boxplots:
Nasty! There is a lot to see here, but for our simple study I’ll just say this; Nintendo’s scores look to be on a whole higher, but its boxplot looks fairly normal, with the mean and median close to each other, and the data does not look terribly skewed. The Metacritic scores are highly skewed, and the data looks to be affected much more by outliers. To visualize this a bit better, here’s a dotplot:
Nintendo Power’s scores seem are fairly spread out, while Metacritic’s are sharply divided. This is fairly interesting — Metacritic does exactly what we might expect, considering these are the 30 best and worst rated games on the DS. Nintendo Power however has its scores fairly spread out. This may mean that they are sometimes extra harsh or extra lenient on certain games.
Now let us do a t-test on the lowest 30 games. The null hypothesis is that there is no difference, the alternate is that Metacritic scores are lower than NinPow scores.
Difference = mu Weighted Average – mu Nintendo Power
Estimate for difference: -12.07
95% CI for difference: (-17.39, -6.76)
T-Test of difference = 0 (vs < ): T-Value = -4.57 P-Value = 0.000 DF = 46 This t-test rejects the Null Hypothesis, which would support the conclusion that Nintendo Power is more favorable than other reviewers when it comes to crappy DS games. Now let us do the same for the top 30 games. The hypotheses are the same.
Difference = mu Weighted Average – mu Nintendo Power
Estimate for difference: -0.23
95% upper bound for difference: 3.78
T-Test of difference = 0 (vs <): T-Value = -0.10 P-Value = 0.461 DF = 33 We fail to reject the null hypothesis this time, so there is not enough evidence to say there is a difference in how the review outlets judge the best of the DS games. Here are the boxplots and dotplots for this test, which shows a pretty even distribution of scores from Metacritic, and some severe outliers from Nintendo Power
I took a look to see what games were causing the outliers, and the results were surprising. Several games got abysmal scores from NinPow, including Puzzle Quest and Yoshi’s Island DS. Interesting considering how popular these two games are.
The final test I wanted to do was to compare the minimum review scores from Metacritic with Nintendo Power’s scores. So I did it. The results:
Two-sample T for Min Review vs Nintendo Power
Difference = mu Min Review – mu Nintendo Power
Estimate for difference: -28.87
95% upper bound for difference: -22.29
T-Test of difference = 0 (vs <): T-Value = -7.28 P-Value = 0.000 DF = 113
So it would seem that NinPow does give more favorable scores than other review sites, though this test is useless considering it ignores all other scores for the games.
Because this is an informal analysis, we can’t really glean any factual information. But we sure can speculate. Metacritic implies that most reviewers are doing what we would expect them to, praising the best and slamming the worst the DS has to offer. Nintendo Power, however, is a little goofy. They very well may be kinder to crappy games, and the fact that the first test showed a much less extreme distribution of scores compared to Metacritic implies they have some very unexpected scores. This is supported by the scores for Puzzle Quest and Yoshi’s Island, which are surprisingly lower than you might expect. Perhaps Nintendo allows for more honest scores for games they know will sell regardless.