christian and jay

Is Nintendo Power biased? A semi-serious statistical survey

Nintendo Power is going to be outsourced to Imagine Media, meaning the long-standing and proud magazine will no longer be run by Nintendo themselves. This inspired Jay and Christian to resurrect this old project. Most people claim that Nintendo Power review scores cannot be trusted, or are at least suspect, considering they come “in house” rather than from an independent source. Now that this will no longer be the case, let us look back at the old Nintendo Power and see how they stacked up to the rest of the gaming world when reviewing Nintendo DS games.

A few nuggets of info before we start. The numbers used are the weighted averages from Metacritic, Nintendo Power’s scores, and the minimum and maximum non-Nintendo Power scores. All numbers are for the top 30 and bottom 30 DS games.

There are so many problems with the numbers used here that we cannot use the results for any serious statistical results. For one, we start off with weighted averages before we even calculate any means, and we have no idea how Metacritic calculates their averages. The samples are not random, and are not necessarily independent. Also, knowing game reviewers, it is hard to say that these review scores follow a normal distribution (insert joke here about how everything is rated a 7).

The point is, we are doing this for fun. If you want to point out all the flaws, even in the calculation of the stats, you are wasting your breath. I know my stats prowess is rusty as hell, and you know you are free to stop reading whenever you please. Also, for reference, we provide the numbers we used in .xls and .csv format. Now lets get on with it.

First we will look at the Metacritic averages for all the DS games in my spreadsheet compared to NinPow’s scores by doing a classic two sample t-test. The null hypothesis is, of course, that there is no difference between them; the alternate is that there is a difference. Here are the numbers from minitab:

T-Value = -1.86 P-Value = 0.065 DF = 117

So from this we fail to reject the null hypothesis, and do not have enough evidence to say Nintendo Power’s scores are significantly different from the rest of the gaming world’s.

Let’s look at some boxplots:

Nasty! There is a lot to see here, but for our simple study I’ll just say this; Nintendo’s scores look to be on a whole higher, but its boxplot looks fairly normal, with the mean and median close to each other, and the data does not look terribly skewed. The Metacritic scores are highly skewed, and the data looks to be affected much more by outliers. To visualize this a bit better, here’s a dotplot:

Nintendo Power’s scores seem are fairly spread out, while Metacritic’s are sharply divided. This is fairly interesting — Metacritic does exactly what we might expect, considering these are the 30 best and worst rated games on the DS. Nintendo Power however has its scores fairly spread out. This may mean that they are sometimes extra harsh or extra lenient on certain games.

Now let us do a t-test on the lowest 30 games. The null hypothesis is that there is no difference, the alternate is that Metacritic scores are lower than NinPow scores.

Difference = mu Weighted Average – mu Nintendo Power

Estimate for difference: -12.07

95% CI for difference: (-17.39, -6.76)

T-Test of difference = 0 (vs < ): T-Value = -4.57 P-Value = 0.000 DF = 46 This t-test rejects the Null Hypothesis, which would support the conclusion that Nintendo Power is more favorable than other reviewers when it comes to crappy DS games. Now let us do the same for the top 30 games. The hypotheses are the same.

Difference = mu Weighted Average – mu Nintendo Power

Estimate for difference: -0.23

95% upper bound for difference: 3.78

T-Test of difference = 0 (vs <): T-Value = -0.10 P-Value = 0.461 DF = 33 We fail to reject the null hypothesis this time, so there is not enough evidence to say there is a difference in how the review outlets judge the best of the DS games. Here are the boxplots and dotplots for this test, which shows a pretty even distribution of scores from Metacritic, and some severe outliers from Nintendo Power

I took a look to see what games were causing the outliers, and the results were surprising. Several games got abysmal scores from NinPow, including Puzzle Quest and Yoshi’s Island DS. Interesting considering how popular these two games are.

The final test I wanted to do was to compare the minimum review scores from Metacritic with Nintendo Power’s scores. So I did it. The results:

Two-sample T for Min Review vs Nintendo Power

Difference = mu Min Review – mu Nintendo Power

Estimate for difference: -28.87

95% upper bound for difference: -22.29

T-Test of difference = 0 (vs <): T-Value = -7.28 P-Value = 0.000 DF = 113 So it would seem that NinPow does give more favorable scores than other review sites, though this test is useless considering it ignores all other scores for the games. Conclusions:
Because this is an informal analysis, we can’t really glean any factual information. But we sure can speculate. Metacritic implies that most reviewers are doing what we would expect them to, praising the best and slamming the worst the DS has to offer. Nintendo Power, however, is a little goofy. They very well may be kinder to crappy games, and the fact that the first test showed a much less extreme distribution of scores compared to Metacritic implies they have some very unexpected scores. This is supported by the scores for Puzzle Quest and Yoshi’s Island, which are surprisingly lower than you might expect. Perhaps Nintendo allows for more honest scores for games they know will sell regardless.

Files –
Nin pow .csv
Nin pow .xls

Notify of

Inline Feedbacks
View all comments
16 years ago

Hmmmm… yeerrrrrsss… The wonderful world of numbers and how they influence our consumering opinions. My mind wanders off to an episode of Penn & Teller’s Bullshit (the numbers one)… Something about numbers never being wrong, just the fuckers that manipulate them!

The only purpose of putting a number on a game review is purely to influence and bedazzel the illiterate (or just plain fucking lazy, skim reading, gimmi-lotsa-sugar) market. Personally like to do away with all the little 9.5s in their pretentious, selfrightious little blue boxes and see more critical reviews that actually SAY something about what the reviewer has experienced.

Well its going to be very interesting to see how this mighty magazine shall fair under new management.

Well thats my useless little rant.

Keep up the FUNtastic work guys. You keep writing it, I’ll keep reading it!

16 years ago

Hav u sean mi bag ov shuger?

16 years ago

At the risk of falling into the just plain lazy market, I really like numbers, primarily because they enable aggregator sites like metacritic. (Whether the initial article assigns one, or metacritic comes along and does it later doesn’t matter as much, you still end up with a number) Before I buy a game, I’m definitely going to read a full review or two to get nuanced impressions, but I don’t want to waste that time on games that just plain suck.

Filtering out games which got consistently bad reviews requires a fixed standard for automated comparison, and those tend to come back down to numbers, or at least ordinal values.

16 years ago

At this point, I can live with numbering systems in reviews, opting to ignore them as much as possible. The reason they still continue to bug the hell out of me is far too often the review text does not seem to match up with the score given. You’ll see a 9 or a 10 for a game that has more flaws discussed than high points. This leads to reviewers defending themselves by saying things like “If you actually take a look at what I said, you’ll find that …” and so on.

Its usually a bullshit and contradictory defense. If your score is a 9 out of 10, that means its more favorable than not, and thus you would think the author would have more favorable than unfavorable things to say. I guess that the publishers only pay for the number, which is what I’m really starting to believe is the case.

16 years ago

I’m of the firm belief that numbers should not be used. I think the system we have now is flawed, and should not be used anymore. If I had a magazine or something, I would use this system: Buy it, rent it, buy it if you’re a fan, and stay away. I’m sure it needs some refinement, but the general idea is there. I’m way more forgiving when reviewing a game, and I believe most consumers should be like that as well.

16 years ago

I agree with Matt and think a nice merging of quick and accessible and not too stupid is a simple thumbs up/thumbs down rating. This way you can glance to see if the reviewer liked the game overall and then are forced to actually read if you want to know how good or bad something is and why.