Below is what SEO thought leader, Mike King (who happens to be my boss at iAcquire) tweeted a couple of months ago. This is a funny tweet, since he made up a statistic “67.3%” to imply how statistics can really be made up.
I don’t know what percentage of statistics in the world are made up and I am sure Mike doesn’t know either. But it correctly conveys the idea that, at least some of the statistics are made up. They are inaccurate or not credible. As an analyst, I need to do a lot of research and use statistics to back up my findings. However, there are so many ways statistics can be wrong since statistics come from data. From data to statistics there are processes like:
- data collection
- data entry
- data analysis
- data reporting
- data visualization
For different stages, there are chances of malpractice. For example, the way of data collection may be biased; errors may occur during data entry; the data analysis may be misrepresented and flawed; the results of data analysis during data reporting may be misinterpreted; the data visualization may be misleading.
As a result, evaluating the accuracy, credibility and quality of the statistics provided by different sources has become my daily practice. In this post, I will share with you five pieces of advice on how to spot that the statistics are inaccurate or misleading or lack credibility, and what makes the statistics trustworthy.
1. Do A Little Bit of Math and apply Common Sense
First and foremost, change your mindset from taking the statistics as what they are to allowing enough skepticism to question the statistics you see. Because as I discussed above, there are just too many possibilities that the statistics can go wrong.
Now look at the statistics, charts or graphs. Are there any obvious mistakes? Do the percentages in the table sum up as 100% in both rows and columns?
Let’s look at this example below. It is a screenshot from Fox News in 2009.
What?! The statistics in the pie chart add up to 167%? Isn’t it supposed to be 100%? If you see a chart like this, don’t make any guess, just discard it!
Also, apply common sense when judging the credibility and accuracy of the statistics. Think about what the purpose of the statistics is. Public secret: advertisements like to manipulate consumers’ minds with statistics. Look at the advertisement by AT&T (of course) below. But, really, don’t believe it, unless you are provided with a detailed report on it. You really don’t know how and where and whom they collect the data from. These factors can make very different results. So, statistics like these are also not convincing.
2. Always Look for the Source and check the authority of the source
When provided with statistics, the best practice would be always look into the source. A statistic without a source is useless. If the source is provided, always check the authority of the source. Some sources have no credibility. Usually it will come from websites. Google Page Rank and Domain Authority are good metrics to measure the authority of a website. You can check the Google PageRank and can also download the Moz Bar. You can even check the authority of the author by looking into how many Twitter followers the author has and other posts of the authors. Other than these, remember to look into the date as well. If the source is outdated, the statistics are less accurate and less credible.
Credible statistics look like these:
In this example, there is a source and date. A link is also provided in the post discussing how they collected the data. Also, the author — Rand Fishkin — has authority judging by his title (Founder of Moz) and the number of Twitter followers he has. Therefore, the statistics are trustworthy.
3. Question if the statistics are biased or statistically insignificant
After applying simple math and common sense and checking the source, we are now going deeper to question the methodology of the surveys, polls, studies, etc. In this section, we are questioning the statistics in the stages of ”data collection” and “data analysis” as I discussed in the beginning.
Most of the statistics are from samples. If samples are not representative, the statistics will be biased. So it is always a good practice to check the sample size. If the sample size is too small, the results will be easily biased. During data collection, there are possibilities of sampling bias: unrepresentative demographics, unrepresentative geographic locations, etc. With sampling bias, the results of data would be of no value or very little value since they can be quite different from what what the actual world is like.
Still remember the presidential election in 1936 between Roosevelt and Landon? The Literary Digest Magazine, one of the most respected magazines at that time predicted that Landon would win the election by a large margin while the real election results turned out to be the opposite. The cause of this is sampling bias. The Literary Digest Magazine polled over 10 million people and received 2.4 million responses. Those who responded to the poll were mostly upper class people who are more likely to vote for Republican candidate.
During the “data analysis” stage, some results will be used even though they are statistically insignificant. A common use of statistical significance is running regressions. As we know, regressions are used to demonstrate the correlations of different variables. However, not all the regressions are representative and some correlations are incorrect. There are two important metrics to check the statistical significance of regressions: R Square or Adjusted R Square and P-Value. R Square is used to measure “good to fit.” It is between 0 to 100% and 0.55 means 55% of the relationship can be explained by the data. Usually, the higher the better. More than 60% is usually a good fit. If the R square is too low, 20% for example, the regression is probably not representative. After looking at R-Square, you should look at p-value. You can find more explanation about R-Square and p-value from my post on quantitative data analysis. If you are not familiar with statistics, generally you can memorize the rule as “p-value less than 0.1 is statistically significant.”
Assume that an article uses a regression output to conclude that the product revenue is strongly correlated with quality and packaging. (Shown in below example.) Given a regression output, we need to look at two things. First, as discussed earlier, we should look at the R Square or Adjusted R Square. 82% is a good fit. But article fails to discuss the p-values. The p-values of Packaging is not statistically significant (p-value more than 0.1). Therefore, the finding is not completely supported by this regression output. So the data analysis results would be invalid in the article.
4. Question if the statistics are skewed purposely or Misinterpreted
Even with correct data results, statistics can be misinterpreted. In this case, you will see wrong conclusions drawn from accurate data analysis results. On the other hand, some statistics are skewed or exaggerated visually to make them serve the author’s purposes. In this part, we will address the issues raised from the stages of “data reporting” and “data visualization.”
Statistics that Are Skewed Purposely
Below is the example of Fox News (Yes, Fox AGAIN!):
Looks like the percentage changed a lot from “now” to Jan 1, 2013. But examining closely, you can see that the minimum point on the vertical axis is 34% instead of 0. That’s what made it misleading. Fox News exaggerated the percentage just to serve the purpose of pushing Bush’s tax cut renewal. With that being said, the way Fox shows the statistics here is not objective and misleading. Below is what the real percentage looks like:
More examples can be found on Fox’s bad usage of charts. Probably Fox needs to hire a better media analyst!
Misinterpretation and Logical Fallacies
The conversation below is what I heard from a couple:
Boyfriend: You’re cool when you’re drunk.
Girlfriend: So I am not cool when I am not drunk?!
This is a typical logical fallacy: using a proposition against the original propositions while the two propositions are not collectively exhaustive. Collectively exhaustive means one of the two propositions must happen and there are no other possibilities of other events. However, “cool when drunk” and “not cool when not drunk” are not collectively exhaustive. “Cool when not drunk” can also be a possibility. So “girlfriend” just eliminates the “cool when not drunk” proposition.
When interpreting the data results, some people also made some logical fallacies like the above example. When interpreting 37% of New York City citizens have gone to Central Park once, a conclusion like “this indicates 63% of NYC citizens have never been to Central Park” is incorrect. 0 and 1 are not collectively exhaustive. There are possibilities of having been to Central Park for 2 times, 3 times, etc. So, 63% not only includes those who have never been to Central Park, but those who have been there multiple times. Whenever you see some interpretation like this, be mindful of the logical fallacies problem.
5. fully utilize Your resources to conduct more research
Huffington Post published American’s Top 10 Favorite Actors without proper citation of the source. It led to some discussions as comments of the post as shown below:
So there is no detailed report attached in this post, no wonder readers will question about sampling bias. At this point, in order to evaluate the list, we need to do extra research. Search engines are always good friend to you for research. After poking around, I found this poll of America’s favorite actors. The report actually looks very professional and very representative. Aha, looks like Johnny Depp is really America’s favorite man!
Other than search engines, there are also lots of other resources from which you can find more information. Pew Internet is a very good resource as well as Simmons, Nielsen and other reputable market research companies.
Always keep in mind that the statistics can be wrong. Allow yourself to have skepticism towards statistics. When it is possible, ask for the root source of the statistics or even the raw data. You can conduct research on the domain and the authors of the statistics. Question what purpose the statistics serve for the author of the statistics. Do more research on statistics and look for the similar version from other websites.