Posts Tagged 'Statistical signficance'

What is p-hacking, and why do most researchers do it?

What sets good researchers apart is their ability to find a compelling story in a data set. It is what we do – we review various data points, combine that with our knowledge of a client’s business, and craft a story that leads to market insight.

Unfortunately, researchers can be too good at this. We have a running joke in our firm that we could probably hand a random data set to an analyst, and they could come up with a story that was every bit as convincing as the story they would develop from actual data.

Market researchers need to be wary of something well-known among academic researchers: a phenomenon known as “p-hacking.” It is a tendency to run and re-run analyses until we discover a statistically significant result.

A “p-value” is one of the most important statistics in research. It can be tricky to define precisely — it is the probability that your effect (research result) is due to chance and not the difference between your test and control. It is the chance that your hypothesis will be falsely rejected. We say the result is statistically significant when a p-value is less than 5%. We mean there is less than 5% we got this result by chance.

Researchers widely use p-values to determine if a result is worth mentioning. In academia, most papers will not be published in a peer-reviewed journal if their p-value is not below 5%. Most quant analysts will not highlight a finding in market research if the p-value isn’t under 5%.

P-hacking is what happens when the initial analysis doesn’t hit this threshold. Researchers will do things such as:

  • Change the variable. Our result doesn’t hit the threshold, so we search for a new measure where it does.
  • Redefine our variables. Using the full range of the response didn’t work, so we look at the top box, the top 2 boxes, the mean, etc., until the result we want pans out.
  • Change the population. It didn’t work with all respondents, but is there something among a subgroup, such as males, young respondents, or customers?
  • Run a table that does statistical testing of all subgroups compared to each other. (Guaranteeing that one in 20 of these significant findings will be due to chance.)
  • Relax the threshold. The findings didn’t work at 5%, so we go ahead and report them anyway and say they are “directional.”’

These tactics are all inappropriate and common. If you are a market researcher and reading this, I’d be surprised if you haven’t done all of these at some point in your career. I have done them all.

P-hacking happens for understandable reasons. Other information outside the study points towards a result we should be getting. Our clients pressure us to do it. And, with today’s sample sizes being so large, p-hacking is easy to do. Give me a random data set with 2,000 respondents, and I will guarantee that I can find statistically significant results and create a story around them that will wow your marketing team.

I learned about p-hacking the hard way. Early in my career, I gathered an extensive data set for a college professor who was well-known and well-published within his field. He asked me to run some statistical analyses for him. When the ones he specified didn’t pan out, I started running the data on subgroups, changing how some variables were defined, etc., until I could present him with significant statistical output.

Fortunately, rather than chastise me, he went into teaching mode. He told me that just fishing around in the data set until you find something that works statistically is not how data analysis should be done. With a big data set and enough hooks in the water, you will always find some insight ready to bite.

Instead, he taught me that you always start with a hypothesis. If that hypothesis doesn’t pan out, first recognize that there is some learning in that. And it is okay to use that learning to adjust your hypothesis and test again, but your analysis has to be driven by the theory instead of the theory being driven by the data.

Good analysis is not about tinkering with data through trial and error. Too many researchers do this until something works. They fail to report on the many unproductive rabbit holes they dug. But, by definition, you’d randomly get a statistically significant result about one time in 20.

This sounds obscure, but I would say that it is the most common mistake I see marketing analysts make. Clients will press us to redefine variables to make a regression work better. We’ll use “top box” measures rather than the full variable range, with no real reason except that it makes our models fit. We relax the level of statistical significance. We p-hack.

In general, market researchers “fish in the data” a lot. I sometimes wonder how many lousy marketing decisions have been made over time due to p-hacking.

I used to sit next to an incredible statistician. As good a data analyst as he was, he was one of the worst questionnaire writers I have ever met. He didn’t seem to care too much, as he felt he could wrangle almost any data into submission with his talent. He was a world-class p-hacker.

I was the opposite. I’ve never been a great statistician. So, I’ve learned to compensate by developing design talent, as I quickly noticed that a well-written questionnaire makes data analysis easy and often obviates the need for complex statistics. I learned over time that a good questionnaire is an antidote to p-hacking. 

Start with hypotheses and think about alternative hypotheses when you design the project. And develop these before you even compose a questionnaire. Never believe that the story will magically appear in your data – instead, start with a range of potential stories and then, in your design, allow for data to support or refute each of them. Be balanced in how you go about it, but be directed as well.

It is vital to push for the time upfront to accomplish this, as the collapsed time frames for today’s projects are a key cause of p-hacking.

Of course, nobody wants to conduct a project and be unable to conclude anything. If that happens, you likely went wrong at the project’s design stage – you didn’t lay out objectives and potential hypotheses well. Resist the tendency to p-hack, be mindful of this issue, and design your studies well so you won’t be tempted to do it.

Should we get rid of statistical significance?

There has been recent debate among academics and statisticians surrounding the concept of statistical significance. Some high-profile medical studies have just narrowly missed meeting the traditional statistical significance cutoff of 0.05. This has resulted in potentially life changing drugs not being approved by regulators or pursued for further development by pharma companies. These cases have led to a much-needed review and re-education as to what statistical significance means and how it should be applied.

In a 2014 blog post (Is This Study Significant?) we discussed common misunderstandings market researchers have regarding statistical significance. The recent debate suggests this misunderstanding isn’t limited to market researchers – it appears that academics and regulators have the same difficulty.

Statistical significance is a simple concept. However, it seems that the human brain just isn’t wired well to understand probability and that lies at the root of the problem.

A measure is typically classified as statistically significant if its p-value is 0.05 or less. This means that there is a less than 5% probability that the result came from chance or random fluctuation. Two measures are deemed to be statistically different if there is a 19 out of 20 chance or greater that they are.

There are real problems with this approach. Foremost, it is unclear how this 5% probability cutoff was chosen. Somewhere along the line it became a standard among academics. This standard could have just as easily been 4% or 6% or some other number. This cutoff was chosen subjectively.

What are the chances that this 5% cutoff is optimal for all studies, regardless of the situation?

Regulators should look beyond statistical significance when they are reviewing a new medication. Let’s say a study was only significant at 6%, not quite meeting the 5% standard. That shouldn’t automatically disqualify a promising medication from consideration. Instead, regulators should look at the situation more holistically. What will the drug do? What are its side effects? How much pain does it alleviate? What is the risk of making mistakes in approval: in approving a drug that doesn’t work or in failing to approve a drug that does work? We could argue that the level of significance required in the study should depend on the answers to these questions and shouldn’t be the same in all cases.

The same is true in market research. Suppose you are researching a new product and the study is only significant at 10% and not the 5% that is standard. Whether you should greenlight the product for development depends on considerations beyond statistical significance. What is the market potential of the product? What is the cost of its development? What is the risk of failing to greenlight a winning idea or greenlighting a bad idea? Currently, too many product managers rely too much on a research project to give them answers when the study is just one of many inputs into these decisions.

There is another reason to rethink the concept of statistical significance in market research projects. Statistical significance assumes a random or a probability sample. We can’t stress this enough – there hasn’t been a market research study conducted in at least 20 years that can credibly claim to have used a true probability sample of respondents. Some (most notably ABS samples) make a valiant attempt to do so but they still violate the very basis for statistical significance.

Given that, why do research suppliers (Crux Research included) continue to do statistical testing on projects? Well, one reason is clients have come to expect it. A more important reason is that statistical significance holds some meaning. On almost every study we need to draw a line and say that two data poworints are “different enough” to point out to clients and to draw conclusions from. Statistical significance is a useful tool for this. It just should no longer be viewed as a tool where we can say precise things like “these two data points have a 95% chance of actually being different”.

We’d rather use a probability approach and report to clients the chance that two data points would be different if we had been lucky enough to use a random sample. That is a much more useful way to look at data, but it probably won’t be used much until colleges start teaching it and a new generation of researchers emerges.

The current debate over the usefulness of statistical significance is a healthy one to have. Hopefully, it will cause researchers of all types to think deeper about how precise a study needs to be and we’ll move away from the current one-size-fits-all thinking that has been pervasive for decades.


Visit the Crux Research Website www.cruxresearch.com

Enter your email address to follow this blog and receive notifications of new posts by email.