Archive for the 'Methodology' Category

Less useful research questions

Questionnaire “real estate” is limited and valuable. Most surveys fielded today are too long and this causes problems with respondent fatigue and trust. Researchers tend to start the questionnaire design process with good intent and aim to keep survey experiences short and compelling for respondents. However, it is rare to see a questionnaire get shorter as it undergoes revision and review, and many times the result is impossibly long surveys.

One way to guard against this is to be mindful. All questions included should have a clear purpose and tie back to study objectives. Many times, researchers include some questions and options simply out of habit, and not because these questions will add value to the project.

Below are examples of question types that, more often or not, add little to most questionnaires. These questions are common and used out of habit. There are certainly exceptions when it makes sense to include these questions, but for the most part we advise against using them unless there is a specific reason to include them.

Marital status

Somewhere along the way, asking a respondent’s marital status became standard on most consumer questionnaires. Across thousands of studies, I can only recall a few times when I have actually used it for anything. It is appropriate to ask if it is relevant. Perhaps your client is a jewelry company or in the bridal industry. Or, maybe you are studying relationships. However, I would nominate marital status as being the least used question in survey research history.

Other (specify)

Many multiple response questions ask a respondent to select all that apply from a list, and then as a final option will have “other.” Clients constantly pressure researchers to leave a space for respondents to type out what this “other” option is. We rarely look at what they type in. I tell clients that if we expect a lot of respondents to select the other option, it probably means that we have not done a good job at developing the list. It may also mean that we should be asking the question in an open-ended fashion. Even when it is included, most of the respondents who select other will not type anything into the little box anyway.

Don’t Know Options

We recently composed an entire post about when to include a Don’t Know option on a question. To sum it up, the incoming assumption should be that you will not use a Don’t Know option unless you have an explicit reason to do so. Including Don’t Know as an option can make a data set hard to analyze. However, there are exceptions to this rule, as Don’t Know can be an appropriate choice. That said, it is overused on surveys currently.

Open-Ends

The transition from telephone to online research has completely changed how researchers can ask open-ended questions. In the telephone days, we could pose questions that were very open-ended because we had trained interviewers who could probe for meaningful answers. With online surveys, open-ended questions that are too loose rarely produce useful information. Open-ends need to be specific and targeted. We favor the inclusion of just a handful of open-ends in each survey, and that they are a bit less “open-ended” than what has been traditionally asked.

Grid questions with long lists

We have all seen these. These are long lists of items that require a scaled response, perhaps a 5-point agree/disagree scale. The most common abandon point on a survey is the first time a respondent encounters a grid question with a long list. Ideally, these lists are about 4 to 6 items and there are no more than two or three of them on a questionnaire.

We currently field a study that has a list like this with 28 items in it. There is no way we are getting good information from this question and we are fatiguing the respondent for the remainder of the survey.

Specifying time frames

Survey research often seeks to find out about a behavior across a specified time frame. For instance, we might want to know if a consumer has used a product in the past day, past week, past month, etc. The issue here is not so much the time frame, it is when we consider the responses to be literal. I have seen clients take past day usage and multiply it by 365 and assume that will equate to past year usage. Technically and mathematically, that might be true, but it isn’t how respondents react to questions.

In reality, it is likely accurate to ask if a respondent has done something in the past day. But, once the time frames get longer, we are really asking about “ever” usage. It depends a bit on the purchase cycle of the product and its cost, but for most products, asking if they have used in the past month, 6 months, year, etc. will yield similar responses.

Some researchers work around this by just asking “ever used” and “recently used.” There are times when that works, but we tend to set a reasonable time frame for recent use and go with that, typically within the past week.

Household income

Researchers have asked household income as long as the survey research field has been around. There are at least three serious problems with it. First, many respondents are not knowledgeable about what their household income is. Most households have a “family CFO” who takes the lead on financial issues, and even this person often will not know what the family income is. 

Second, the categories chosen affect the response to the income question, indicating just how unstable it is. Asking household income in say, ten categories versus five categories will not result in comparable data. Respondents tend to assume the middle of the range given is normal, and respond using that as a reference point.

Third, and most importantly, household income is a lousy measure of socio-economic status (SES). Many young people have low annual incomes but a wealthy lifestyle as they are still being supported by their parents. Many older people are retired and may have almost non-existent incomes, yet live a wealthy lifestyle off of their savings. Household income tends to only be a reasonable measure of SES for respondents aged about 30 to 60,

There are better measures of SES. Education level can work, and a particularly good question is to ask the respondent about their mother’s level of education, which has been shown to correlate strongly with SES. We also ask about their attitudes towards their income – whether they have all the money they need, just enough, or if they struggle to meet basic expenses.

Attention spans are getting shorter and as more and more surveys are being completed on mobile devices there are plenty of distractions as respondents answer questionnaires. Engage them, get their attention, and keep the questionnaire short. There may be no such thing as a dumb question, but there are certainly questions that when asked on a survey do not yield useful information.

Should you include a “Don’t Know” option on your survey question?

Questionnaire writers construct a bridge between client objectives and a line of questioning that a respondent can understand. This is an underappreciated skill.

The best questionnaire writers empathize with respondents and think deeply about tasks respondents are asked to perform. We want to strike a balance between the level of cognitive effort required and a need to efficiently gather large amounts of data. If the cognitive effort required is too low, the data captured is not of high quality. If it is too high, respondents get fatigued and stop attending to our questions.

One of the most common decisions researchers have to make is whether or not to allow for a Don’t Know (DK) option on a question. This is often a difficult choice, and the correct answer on whether to include a DK option might be the worst possible answer: “It depends.”

Researchers have genuine disagreements about the value of a DK option. I lean strongly towards not using DK’s unless there is a clear and considered reason for doing so.

Clients pay us to get answers from respondents and to find out what they know, not what they don’t know. Pragmatically, whenever you are considering adding a DK option your first inclination should be that you perhaps have not designed the question well. If a large proportion of your respondent base will potentially choose “don’t know,” odds are high that you are not asking a good question to begin with, but there are exceptions.

If you get in a situation where you are not sure if you should include a DK option, the right thing to do is to think broadly and reconsider your goal: why are you asking the question in the first place? Here is an example which shows how the DK decision can actually be more complicated than it first appears.

We recently had a client that wanted us to ask a question similar to this: “Think about the last soft drink you consumed. Did this soft drink have any artificial ingredients?”

Our quandary was whether we should just ask this as a Yes/No question or to also give the respondent a DK option. There was some discussion back and forth, as we initially favored not including DK, but our client wanted it.

Then it dawned on us that whether or not to include DK depended on what the client wanted to get out of the question. On one hand, the client might want to truly understand if the last soft drink consumed had any artificial ingredients in it, which is ostensibly what the question asks. If this was the goal, we felt it was necessary to better educate the respondent on what an “artificial ingredient” was so they could provide an informed answer and so all respondents would be working from a common definition. Or, alternatively, we could ask for the exact brand and type of soft drink they consumed and then on the back-end code which ones have artificial ingredients and which do not, and thus get a good estimate for the client.

The other option was to realize that respondents might have their own definitions of “artificial ingredients” that may or may not match our client’s definition. Or, they may have no clue what is artificial and what is not.

In the end, we decided to use the DK option in this case because understanding how many people are ignorant to artificial ingredients fit well with our objectives. When we pressed the client, we learned that they wanted to document this ambiguity. If a third of consumers don’t know whether or not their soft drinks have artificial ingredients in them, this would be useful information for our client to know.

This is a good example on how a seemingly simple question can have a lot of thinking behind it and how it is important to contextualize this reasoning when reporting results. In this case, we are not really measuring whether people are drinking soft drinks with artificial ingredients. We are measuring what they think they are doing, which is not the same thing and likely more relevant from a marketing point-of-view.

There are other times when a DK option makes sense to include. For instance, some researchers will conflate the lack of an option (a DK response) with a neutral opinion and these are not the same thing. For example, we could be asking “how would you rate the job Joe Biden is doing as President?” Someone who answers in the middle of the response scale likely has a considered, neutral opinion of Joe Biden. Someone answering DK has not considered the issue and should not be assumed to have a neutral opinion of the president. This is another case where it might make sense to use DK.

However, there are probably more times when including a DK option is a result of lazy questionnaire design than any deep thought regarding objectives. In practice, I have found that it tends to be clients who are inexperienced in market research that press hardest to include DK options.

There are at least a couple of serious problems with including DK options on questionnaires. The first is “satisficing” – which is a tendency respondents have to not place a lot of effort on responding and instead choose the option that requires the least cognitive effort. The DK option encourages satisficing. A DK option also allows respondents to disengage with the survey and can lead to inattention on subsequent items.

DK responses create difficulties when analyzing data. We like to look at questions on a common base of respondents, and that becomes hard to comprehend when respondents choose DK on some questions but not others. Including DK makes it harder to compare results across questions. DK options also limit the ability to use multivariate statistics, as a DK response does not fit neatly on a scale.

Critics would say that researchers should not force respondents to express and opinion they do not have and therefore should provide DK options. I would counter by saying that if you expect a substantial amount of people to not have an opinion, odds are high you should reframe the question and ask them about something they do know about. It is usually (but not always) the case that we want to find out more about what people know than what they don’t know.

“Don’t know” can be a plausible response. But, more often than not, even when it is a plausible response if we feel a lot of people will choose it, we should reconsider why we are asking the question. Yes, we don’t want to force people to express an option they don’t have. But rather than include DK, it is better to rewrite a question to be more inclusive of everybody.

As an extreme example, here is a scenario that shows how a DK can be designed out of a question:

We might start with a question the client provides us: “How many minutes does your child spend doing homework on a typical night?” For this question, it wouldn’t take much pretesting to realize that many parents don’t really know the answer to this, so our initial reaction might be to include a DK option. If we don’t, parents may give an uninformed answer.

However, upon further thought, we should realize that we may not really care about how many minutes the child spends on homework and we don’t really need to know whether the parent knows this precisely or not. Thinking even deeper, some kids are much more efficient in their homework time than others, so measuring quantity isn’t really what we want at all. What we really want to know is, is the child’s homework level appropriate and effective from the parent’s perspective?

This probing may lead us down a road to consider better questions, such as “in your opinion, does your child have too much, too little, or about the right amount of homework?” or “does the time your child spends on homework help enhance his/her understanding of the material?” This is another case when thinking more about why we are asking the question tends to result in better questions being posed.

This sort of scenario happens a lot when we start out thinking we want to ask about a behavior, when what we really want to do is ask about an attitude.

The academic research on this topic is fairly inconclusive and sometimes contradictory. I think that is because academic researchers don’t consider the most basic question, which is whether or not including DK will better serve the client’s needs. There are times that understanding that respondents don’t know is useful. But, in my experience, more often than not if a lot of respondents choose DK it means that the question wasn’t designed well. 

Which quality control checks questions should you use in your surveys?

While it is no secret that the quality of market research data has declined, how to address poor data quality is rarely discussed among clients and suppliers. When I started in market research more than 30 years ago, telephone response rates were about 60%. Six in 10 people contacted for a market research study would choose to cooperate and take our polls. Currently, telephone response rates are under 5%. If we are lucky, 1 in 20 people will take part. Online research is no better, as even from verified customer lists response rates are commonly under 10% and even the best research panels can have response rates under 5%.

Even worse, once someone does respond, a researcher has to guard against “bogus” interviews that come from scripts and bots, as well as individuals who are cheating on the survey to claim the incentives offered. Poor-quality data is clearly on the rise and is an existential threat to the market research industry that is not being taken seriously enough.

Maximizing response requires a broad approach with tactics deployed throughout the process. One important step is to cleanse each project of bad quality respondents. Another hidden secret in market research is that researchers routinely have to remove anywhere from 10% to 50% of respondents from their database due to poor quality.

Unfortunately, there is no industry standard way of doing this – of identifying poor-quality respondents. Every supplier sets their own policies. This is likely because there is considerable variability in how respondents are sourced for studies, and a one-size-fits-all approach might not be possible, and some quality checks depend on the specific topic of the study. Unfortunately, researchers are left to largely fend for themselves when trying to come up with a process for how to remove poor quality respondents from their data.

One of the most important ways to guard against poor quality respondents is to design a compelling questionnaire to begin with. Respondents will attend to a short, relevant survey. Unfortunately, we rarely provide them with this experience.

We have been researching this issue recently in an effort to come up with a workable process for our projects. Below, we share our thoughts. The market research industry needs to work together on this issue, as when one of us removes a bad respondent from a database in helps the next firm with their future studies.

There is a practical concern for most studies – we rarely have room for more than a handful of questions that relate to quality control. In addition to speeder and straight-line checks, studies tend to have room for about 4-5 quality control questions. With the exception of “severe speeders” as described below, respondents will be automatically removed if they fail three or more of the checks. We use a “three strikes and you’re out” rule to remove respondents. If anything, this is probably too conservative, but we’d rather err on the side of retaining some bad quality respondents in than inadvertently removing some good quality ones.

When possible, we favor checks that can be done programmatically, without human intervention, as that keeps fielding and quota management more efficient. To the degree possible, all quality check questions should have a base of “all respondents” and not be asked of subgroups.

Speeder Checks

We aim to set up two criteria: “severe” speeders are those that complete the survey in less than one-third of the median time. These respondents are automatically tossed. “Speeders” are those that take between one-third and one-half of the median time, and these respondents are flagged.

We also consider setting up timers within the survey – for example, we may place timers on a particularly long grid question or a question that requires substantial reading on the part of the respondent. Note that when establishing speeder checks it is important to use the median length as a benchmark and not the mean. In online surveys, some respondents will start a survey and then get distracted for a few hours and come back to it, and this really skews the average survey length. Using the median gets around that.

Straight Line Checks

Hopefully, we have designed our study well and do not have long grid type questions. However, more often than not these types of questions find their way into questionnaires.  For grids with more than about six items, we place a straight-lining check – if a respondent chooses the same response for all items in the grid, they are flagged.

Inconsistent Answers

We consider adding two question that check for inconsistent answers. First, we re-ask a demographic question from the screener near the end of the survey. We typically use “age” as this question. If the respondent doesn’t choose the same age in both questions, they are flagged.

In addition, we try to find an attitudinal question that is asked that we can re-ask in the exact opposite way. For instance, if earlier we asked “I like to go to the mall” on a 5-point agreement scale, we will also ask the opposite: “I do not like to go to the mall” on the same scale. Those that answer the same for both are flagged. We try to place these two questions a few minutes apart in the questionnaire.

Low Incidence items

This is a low attentiveness flag. It is meant to catch people who say they do really unlikely things and also catch people who say they don’t do likely things because they are not really paying attention to the questions we pose. We design this question specific to each survey and tend to ask what respondents have done over the past weekend. We like to have two high incidence items (such as “watched TV,” or “rode in a car”), 4 to 5 low incidence items (such as “flew in an airplane,” “read an entire book,” “played poker”) and one incredibly low incidence item (such as “visited Argentina”).  Respondents are flagged if they didn’t do at least one of our high incidence items, if they said they did more than two of our low incidence items, or if they say they did our incredibly low incidence item.

Open-ended check

We try to include this one in all studies, but sometimes have to skip it if the study is fielding on a tight timeframe because it involves a manual process. Here, we are seeing if a respondent provides a meaningful response to an open-ended question. Hopefully, we can use a question that is already in the study for this, but when we cannot we tend to use one like this: “Now I’d like to hear your opinions about some other things. Tell me about a social issue or cause that you really care about.  What is this cause and why do you care about it?” We are manually looking to see if they provide an articulate answer and they are flagged if they do not.

Admission of inattentiveness

We don’t use this one as a standard, but are starting to experiment with it. As the last question of the survey, we can ask respondents how attentive they were. This will suffer from a large social desirability bias, but we will sometimes directly ask them how attentive they were when taking the survey, and flag those that say they did not pay attention at all.

Traps and misdirects

I don’t really like the idea of “trick questions” – there is research that indicates that these types of questions tend to trap too many “good” respondents. Some researchers feel that these questions lower respondent trust and thus answer quality. That seems to be enough to recommend against this style of question. The most common types I have seen ask a respondent to select the “third choice” below no matter what, or to “pick the color from the list below,” or “select none of the above.” We counsel against using these.

Comprehension

This was recommended by a research colleague and was also mentioned by an expert in a questionnaire design seminar we attended. We don’t use this as a quality check, but like to use it during a soft-launch period. The question looks like this: “Thanks again for taking this survey.  Were there any questions on this survey you had difficulty with or trouble answering?  If so, it will be helpful to us if you let us know what those problems were in the space below.” This is a useful question, but we don’t use it as a quality check per se.

Preamble

I have mixed feelings on this type of quality check, but we use it when we can phrase it positively. A typical wording is like this: “By clicking yes, you agree to continue to our survey and give your best effort to answer 10-15 minutes of questions. If you speed through the survey or otherwise don’t give a good effort, you will not receive credit for taking the survey.”

This is usually one of the first questions in the survey. The argument I see against this is it sets the respondent up to think we’ll be watching them and that could potentially affect their answers. Then again, it might affect them in a good way if it makes them attend more.

I prefer a question that takes a gentler, more positive approach – telling respondents we are conducting this for an important organization, that their opinions will really matter, promise them confidentiality, and then ask them to agree to give their best effort, as opposed to lightly threatening them as this one does.

Guarding against bad respondents has become an important part of questionnaire design, and it is unfortunate that there is no industry standard on how to go about it. We try to build in some quality checks that will at least spot the most egregious cases of poor quality. This is an evolving issue, and it is likely that what we are doing today will change over time, as the nature of market research changes.

Oops, the polls did it again

Many people had trouble sleeping last night wondering if their candidate was going to be President. I couldn’t sleep because as the night wore on it was becoming clear that this wasn’t going to be a good night for the polls.

Four years ago on the day after the election I wrote about the “epic fail” of the 2016 polls. I couldn’t sleep last night because I realized I was going to have to write another post about another polling failure. While the final vote totals may not be in for some time, it is clear that the 2020 polls are going to be off on the national vote even more than the 2016 polls were.

Yesterday, on election day I received an email from a fellow market researcher and business owner. We are involved in a project together and he was lamenting how poor the data quality has been in his studies recently and was wondering if we were having the same problems.

In 2014 we wrote a blog post that cautioned our clients that we were detecting poor quality interviews that needed to be discarded about 10% of the time. We were having to throw away about 1 in 10 of the interviews we collected.

Six years later that percentage has moved to be between 33% and 45% and we tend to be conservative in the interviews we toss. It is fair to say that for most market research studies today, between a third and a half of the interviews being collected are, for a lack of a better term, junk.  

It has gotten so bad that new firms have sprung up that serve as a go-between from sample providers and online questionnaires in order to protect against junk interviews. They protect against bots, survey farms, duplicate interviews, etc. Just the fact that these firms and terms like “survey farms” exist should give researchers pause regarding data quality.

When I started in market research in the late 80s/early 90’s we had a spreadsheet program that was used to help us cost out projects. One parameter in this spreadsheet was “refusal rate” – the percent of respondents who would outright refuse to take part in a study. While the refusal rate varied by study, the beginning assumption in this program was 40%, meaning that on average we expected 60% of the time respondents would cooperate. 

According to Pew and AAPOR in 2018 the cooperation rate for telephone surveys was 6% and falling rapidly.

Cooperation rates in online surveys are much harder to calculate in a standardized way, but most estimates I have seen and my own experience suggest that typical cooperation rates are about 5%. That means for a 1,000-respondent study, at least 20,000 emails are sent, which is about four times the population of the town I live in.

This is all background to try to explain why the 2020 polls appear to be headed to a historic failure. Election polls are the public face of the market research industry. Relative to most research projects, they are very simple. The problems pollsters have faced in the last few cycles is emblematic of something those working in research know but rarely like to discuss: the quality of data collected for research and polls has been declining, and should be alarming to researchers.

I could go on about the causes of this. We’ve tortured our respondents for a long time. Despite claims to the contrary, we haven’t been able to generate anything close to a probability sample in years. Our methodologists have gotten cocky and feel like they can weight any sampling anomalies away. Clients are forcing us to conduct projects on timelines that make it impossible to guard against poor quality data. We focus on sampling error and ignore more consequential errors. The panels we use have become inbred and gather the same respondents across sources. Suppliers are happy to cash the check and move on to the next project.

This is the research conundrum of our times: in a world where we collect more data on people’s behavior and attitudes than ever before, the quality of the insights we glean from these data is in decline.

Post 2016 the polling industry brain trust rationalized and claimed that the polls actually did a good job, convened some conferences to discuss the polls, and made modest methodological changes. Almost all of these changes related to sampling and weighting. But, as it appears that the 2020 polling miss is going to be way beyond what can be explained by sampling (last night I remarked to my wife that “I bet the p-value of this being due to sampling is about 1 in 1,000”), I feel that pollsters have addressed the wrong problem.

None of the changes pollsters made addressed the long-term problems researchers face with data quality. When you have a response rate of 5% and up to half of those are interviews you need to throw away, errors that can arise are orders of magnitude greater than the errors that are generated by sampling and weighting mistakes.

I don’t want to sound like I have the answers.  Just a few days ago I posted that I thought that on balance there were more reasons to conclude that the polls would do a good job this time than to conclude that they would fail. When I look through my list of potential reasons the polls might fail, nothing leaps to me as an obvious cause, so perhaps the problem is multi-faceted.

What I do know is the market research industry has not done enough to address data quality issues. And every four years the polls seem to bring that into full view.

Will the polls be right this time?

The 2016 election was damaging to the market research industry. The popular perception has been that in 2016 the pollsters missed the mark and miscalled the winner. In reality, the 2016 polls were largely predictive of the national popular vote. But, 2016 was largely seen by non-researchers as disastrous. Pollsters and market researchers have a lot riding on the perceived accuracy of 2020 polls.

The 2016 polls did a good job of predicting the national vote total but in a large majority of cases final national polls were off in the direction of overpredicting the vote for Clinton and underpredicting the vote for Trump. That is pretty much a textbook definition of bias. Before the books are closed on the 2016 pollster’s performance, it is important to note that the 2012 polls were off even further and mostly in the direction of overpredicting the vote for Romney and underpredicting the vote for Obama. The “bias,” although small, has swung back and forth between parties.

Election Day 2020 is in a few days and we may not know the final results for a while. It won’t be possible to truly know how the polls did for some weeks or months.

That said, there are reasons to believe that the 2020 polls will do an excellent job of predicting voter behavior and there are reasons to believe they may miss the mark.  

There are specific reasons why it is reasonable to expect that the 2020 polls will be accurate. So, what is different in 2020? 

  • There have been fewer undecided voters at all stages of the process. Most voters have had their minds made up well in advance of election Tuesday. This makes things simpler from a pollster’s perspective. A polarized and engaged electorate is one whose behavior is predictable. Figuring out how to partition undecided voters moves polling more in a direction of “art” than “science.”
  • Perhaps because of this, polls have been remarkably stable for months. In 2016, there was movement in the polls throughout and particularly over the last two weeks of the campaign. This time, the polls look about like they did weeks and even months ago.
  • Turnout will be very high. The art in polling is in predicting who will turn out and a high turnout election is much easier to forecast than a low turnout election.
  • There has been considerable early voting. There is always less error in asking about what someone has recently done than what they intend to do in the future. Later polls could ask many respondents how they voted instead of how they intended to vote.
  • There have been more polls this time. As our sample size of polls increases so does the accuracy. Of course, there are also more bad polls out there this cycle as well.
  • There have been more and better polls in the swing states this time. The true problem pollsters had in 2016 was with state-level polls. There was less attention paid to them, and because the national pollsters and media didn’t invest much in them, the state-level polling is where it all went wrong. This time, there has been more investment in swing-state polling.
  • The media invested more in polls this time. A hidden secret in polling is that election polls rarely make money for the pollster. This keeps many excellent research organizations from getting involved in them or dedicating resources to them. The ones that do tend to do so solely for reputational reasons. An increased investment this time has helped to get more researchers involved in election polling.
  • Response rates are upslightly. 2020 is the first year where we have seen a long-term trend towards declining response rates on survey stabilize and even kick up a little. This is likely a minor factor in the success of the 2020 polls, but it is in the right direction.
  • The race isn’t as close as it was in 2016. This one might only be appreciated by statisticians. Since variability is maximized in a 50/50 distribution the further away from an even race it is the more accurate a poll will be. This is another small factor in the direction of the polls being accurate in 2020.
  • There has not been late breaking news that could influence voter behavior. In 2016, the FBI director’s decision to announce a probe into Clinton’s emails came late in the campaign. There haven’t been any similar bombshells this time.
  • Pollsters started setting quotas and weighting on education. In the past, pollsters would balance samples on characteristics known to correlate highly with voting behavior – characteristics like age, gender, political party affiliation, race/ethnicity, and past voting behavior. In 2016, pollsters learned the hard way that educational attainment had become an additional characteristic to consider when crafting samples because voter preferences vary by education level. The good polls fixed that this go round.
  • In a similar vein, there has been a tighter scrutiny of polling methodology. While the media can still be a cavalier about digging into methodology, this time they were more likely to insist that pollsters outline their methods. This is the first time I can remember seeing news stories where pollsters were asked questions about methodology.
  • The notion that there are Trump supporters who intentionally lie to pollsters has largely been disproven by studies from very credible sources, such as Yale and Pew. Much more relevant is the pollster’s ability to predict turnout from both sides.

There are a few things going on that give the polls some potential to lay an egg.

  • The election will be decided by a small number of swing states. Swing state polls are not as accurate and are often funded by local media and universities that don’t have the funding or the expertise to do them correctly. The polls are close and less stable in these states. There is some indication that swing state polls have been tightening, and Biden’s lead in many of them isn’t much different than Clinton’s lead in 2020.
  • Biden may be making the same mistake Clinton made. This is a political and not a research-related reason, but in 2016 Clinton failed to aggressively campaign in the key states late in the campaign while Trump went all in. History could be repeating itself. Field work for final polls is largely over now, so the polls will not reflect things that happen the last few days.
  • If there is a wild-card that will affect polling accuracy in 2020, it is likely to center around how people are voting. Pollsters have been predicting election day voting for decades. In this cycle votes have been coming in for weeks and the methods and rules around early voting vary widely by state. Pollsters just don’t have past experience with early voting.
  • There is really no way for pollsters to account for potential disqualifications for mail-in votes (improper signatures, late receipts, legal challenges, etc.) that may skew to one candidate or another.
  • Similarly, any systematic voter suppression would likely cause the polls to underpredict Trump. These voters are available to poll, but may not be able to cast a valid vote.
  • There has been little mention of third-party candidates in polling results. The Libertarian candidate is on the ballot in all 50 states. The Green Party candidate is on the ballot in 31 states. Other parties have candidates on the ballot in some states but not others. These candidates aren’t expected to garner a lot of votes, but in a close election even a few percentage points could matter to the results. I have seen national polls from reputable organizations where they weren’t included.
  • While there is little credible data supporting that there are “shy” Trump voters that are intentionally lying to pollsters, there still might be a social desirability bias that would undercount Trump’s support. That social desirability bias could be larger than it was in 2016, and it is still likely in the direction of under predicting Trump’s vote count.
  • Polls (and research surveys) tend to underrepresent rural areas. Folks in rural areas are less likely to be in online panels and to cooperate on surveys. Few pollsters take this into account. (I have never seen a corporate research client correcting for this, and it has been a pet peeve of mine for years.) This is a sample coverage issue that will likely undercount the Trump vote.
  • Sampling has continued to get harder. Cell phone penetration has continued to grow, online panel quality has fallen, and our best option (ABS sampling) is still far from random and so expensive it is beyond the reach of most polls.
  • “Herding” is a rarely discussed, but very real polling problem. Herding refers to pollsters who conduct a poll that doesn’t conform to what other polls are finding. These polls tend to get scrutinized and reweighted until they fit to expectations, or even worse, buried and never released. Think about it – if you are a respected polling organization that conducted a recent poll that showed Trump would win the popular vote, you’d review this poll intensely before releasing it and you might choose not to release it at all because it might put your firm’s reputation at risk to release a poll that looks different than the others. The only polls I have seen that appear to be out of range are ones from smaller organizations who are likely willing to run the risk of being viewed as predicting against the tide or who clearly have a political bias to them.

Once the dust settles, we will compose a post that analyzes how the 2020 polls did. For now, we feel there are a more credible reasons to believe the polls will be seen as predictive than to feel that we are on the edge of a polling mistake.  From a researcher’s standpoint, the biggest worry is that the polls will indeed be accurate, but won’t match the vote totals because of technicalities in vote counting and legal challenges. That would reflect unfairly on the polling and research industries.

Researchers should be mindful of “regression toward the mean”

There is a concept in statistics known as regression toward the mean that is important for researchers to consider as we look at how the COVID-19 pandemic might change future consumer behavior. This concept is as challenging to understand as it is interesting.

Regression toward the mean implies that an extreme example in a data set tends to be followed by an example that is less extreme and closer to the “average” value of the population. A common example is if two parents that are above average in height have a child, that child is demonstrably more likely to be closer to average height than the “extreme” height of their parents.

This is an important concept to keep in mind in the design of experiments and when analyzing market research data. I did a study once where we interviewed the “best” customers of a quick service restaurant, defined as those that had visited the restaurant 10 or more times in the past month. We gave each of them a coupon and interviewed them a month later to determine the effect of the coupon. We found that they actually went to the restaurant less often the month after receiving the coupon than the month before.

It would have been easy to conclude that the coupon caused customers to visit less frequently and that there was something wrong with it (which is what we initially thought). What really happened was a regression toward the mean. Surveying customers who had visited a large number of times in one month made it likely that these same customers would visit a more “average” amount in a following month whether they had a coupon or not. This was a poor research design because we couldn’t really assess the impact of the coupon which was our goal.

Personally, I’ve always had a hard time understanding and explaining regression toward the mean because the concept seems to be counter to another concept known as “independent trials”. You have a 50% chance of flipping a fair coin and having it come up heads regardless of what has happened in previous flips. You can’t guess whether the roulette wheel will come up red or black based on what has happened in previous spins. So, why would we expect a restaurant’s best customers to visit less in the future?

This happens when we begin with a skewed population. The most frequent customers are not “average” and have room to regress toward the mean in the future. Had we surveyed all customers across the full range of patronage there would be no mean to regress to and we could have done a better job of isolating the effect of the coupon.

Here is another example of regression toward the mean. Suppose the Buffalo Bills quarterback, Josh Allen, has a monster game when they play the New England Patriots. Allen, who has been averaging about 220 yards passing per game in his career goes off and burns the Patriots for 450 yards. After we are done celebrating and breaking tables in western NY, what would be our best prediction for the yards Allen will throw for the second time the Bills play the Patriots?

Well, you could say the best prediction is 450 yards as that is what he did the first time. But, regression toward the mean would imply that he’s more likely to throw close to his historic average of 220 yards the second time around. So, when he throws for 220 yards the second game it is important to not give undue credit to Bill Belichick for figuring out how to stop Allen.

Here is another sports example. I have played (poorly) in a fantasy baseball league for almost 30 years. In 2004, Derek Jeter entered the season as a career .317 hitter. After the first 100 games or so he was hitting under .200. The person in my league that owned him was frustrated so I traded for him. Jeter went on to hit well over .300 the rest of the season. This was predictable because there wasn’t any underlying reason (like injury) for his slump. His underlying average was much better than his current performance and because of the concept of regression toward the mean it was likely he would have a great second half of the season, which he did.

There are interesting HR examples of regression toward the mean. Say you have an employee that does a stellar job on an assignment – over and above what she normally does. You praise her and give her a bonus. Then, you notice that on the next assignment she doesn’t perform on the same level. It would be easy to conclude that the praise and bonus caused the poor performance when in reality her performance was just regressing back toward the mean. I know sales managers who have had this exact problem – they reward their highest performers with elaborate bonuses and trips and then notice that the following year they don’t perform as well. They then conclude that their incentives aren’t working.

The concept is hard at work in other settings. Mutual funds that outperform the market tend to fall back in line the next year. You tend to feel better the day after you go to the doctor. Companies profiled in “Good to Great” tend to have hard times later on.

Regression toward the mean is important to consider when designing sampling plans. If you are sampling an extreme portion of a population it can be a relevant consideration. Sample size is also important. When you have just a few cases of something, mathematically an extreme response can skew your mean.

The issue to be wary of is that when we fail to consider regression toward the mean, we tend to overstate the importance of correlation between two things. We think our mutual fund manager is a genius when he just got lucky, that our coupon isn’t working, or that Josh Allen is becoming the next Drew Brees. All of these could be true, but be careful in how you interpret data that result from extreme or small sample sizes.

How does this relate to COVID? Well, at the moment, I’d say we are still in an “inflated expectations” portion of a hype curve when we think of what permanent changes may take place resulting from the pandemic. There are a lot of examples. We hear that commercial real estate is dead because businesses will keep employees working from home. Higher education will move entirely online. In-person qualitative market research will never happen again. Business travel is gone forever. We will never again work in an office setting. Shaking hands is a thing of the past.

I’m not saying there won’t be a new normal that results from COVID, but if we believe in regression toward the mean and the hype curve we’d predict that the future will look more like the past than how it is currently being portrayed. The post-COVID world will certainly look more like the past than a more extreme version of the present. We will naturally regress back toward the past and not to a more extreme version of current behaviors. The “mean” being regressed to has likely changed, but not as much as the current, extreme situation implies.

“Margin of error” sort of explained (+/-5%)

It is now September of an election year. Get ready for a two-month deluge of polls and commentary on them. One thing you can count on is reporters and pundits misinterpreting the meaning behind “margin of error.” This post is meant to simplify the concept.

Margin of error refers to sampling error and is present on every poll or market research survey. It can be mathematically calculated. All polls seek to figure out what everybody thinks by asking a small sample of people. There is always some degree of error in this.

The formula for margin of error is fairly simple and depends mostly on two things: how many people are surveyed and their variability of response. The more people you interview, the lower (better) the margin of error. The more the people you interview give the same response (lower variability), the better the margin of error. If a poll interviews a lot of people and they all seem to be saying the same thing, the margin of error of the poll is low. If the poll interviews a small number of people and they disagree a lot, the margin of error is high.

Most reporters understand that a poll with a lot of respondents is better than one with fewer respondents. But most don’t understand the variability component.

There is another assumption used in the calculation for sampling error as well: the confidence level desired. Almost every pollster will use a 95% confidence level, so for this explanation we don’t have to worry too much about that.

What does it mean to be within the margin of error on a poll? It simply means that the two percentages being compared can be deemed different from one another with 95% confidence. Put another way, if the poll was repeated a zillion times, we’d expect that at least 19 out of 20 times the two numbers would be different.

If Biden is leading Trump in a poll by 8 points and the margin of error is 5 points, we can be confident he is really ahead because this lead is outside the margin of error. Not perfectly confident, but more than 95% confident.

Here is where reporters and pundits mess it up.  Say they are reporting on a poll with a 5-point margin of error and Biden is leading Trump by 4 points. Because this lead is within the margin of error, they will often call it a “statistical dead heat” or say something that implies that the race is tied.

Neither is true. The only way for a poll to have a statistical dead heat is for the exact same number of people to choose each candidate. In this example the race isn’t tied at all, we just have a less than 95% confidence that Biden is leading. In this example, we might be 90% sure that Biden is leading Trump. So, why would anyone call that a statistical dead heat? It would be way better to be reporting the level of confidence that we have that Biden is winning, or the p-value of the result. I have never seen a reporter do that, but some of the election prediction websites do.

Pollsters themselves will misinterpret the concept. They will deem their poll “accurate” as long as the election result is within the margin of error. In close elections this isn’t helpful, as what really matters is making a correct prediction of what will happen.

Most of the 2016 final polls were accurate if you define being accurate as coming within the margin of error. But, since almost all of them predicted the wrong winner, I don’t think we will see future textbooks holding 2016 out there as a zenith of polling accuracy.

Another mistake reporters (and researchers make) is not recognizing that the margin of error only refers to sampling error which is just one of many errors that can occur on a poll. The poor performance of the 2016 presidential polls really had nothing to do with sampling error at all.

I’ve always questioned why there is so much emphasis on sampling error for a couple of reasons. First, the calculation of sampling error assumes you are working with a random sample which in today’s polling world is almost never the case. Second, there are many other types of errors in survey research that are likely more relevant to a poll’s accuracy than sampling error. The focus on sampling error is driven largely because it is the easiest error to mathematically calculate. Margin of error is useful to consider, but needs to be put in context of all the other types of errors that can happen in a poll.

The myth of the random sample

Sampling is at the heart of market research. We ask a few people questions and then assume everyone else would have answered the same way.

Sampling works in all types of contexts. Your doctor doesn’t need to test all of your blood to determine your cholesterol level – a few ounces will do. Chefs taste a spoonful of their creations and then assume the rest of the pot will taste the same. And, we can predict an election by interviewing a fairly small number of people.

The mathematical procedures that are applied to samples that enable us to project to a broader population all assume that we have a random sample. Or, as I tell research analysts: everything they taught you in statistics assumes you have a random sample. T-tests, hypotheses tests, regressions, etc. all have a random sample as a requirement.

Here is the problem: We almost never have a random sample in market research studies. I say “almost” because I suppose it is possible to do, but over 30 years and 3,500 projects I don’t think I have been involved in even one project that can honestly claim a random sample. A random sample is sort of a Holy Grail of market research.

A random sample might be possible if you have a captive audience. You can random sample some the passengers on a flight or a few students in a classroom or prisoners in a detention facility. As long as you are not trying to project beyond that flight or that classroom or that jail, the math behind random sampling will apply.

Here is the bigger problem: Most researchers don’t recognize this, disclose this, or think through how to deal with it. Even worse, many purport that their samples are indeed random, when they are not.

For a bit of research history, once the market research industry really got going the telephone random digit dial (RDD) sample became standard. Telephone researchers could randomly call land line phones. When land line telephone penetration and response rates were both high, this provided excellent data. However, RDD still wasn’t providing a true random, or probability sample. Some households had more than one phone line (and few researchers corrected for this), many people lived in group situations (colleges, medical facilities) where they couldn’t be reached, some did not have a land line, and even at its peak, telephone response rates were only about 70%. Not bad. But, also, not random.

Once the Internet came of age, researchers were presented with new sampling opportunities and challenges. Telephone response rates plummeted (to 5-10%) making telephone research prohibitively expensive and of poor quality. Online, there was no national directory of email addresses or cell phone numbers and there were legal prohibitions against spamming, so researchers had to find new ways to contact people for surveys.

Initially, and this is still a dominant method today, research firms created opt-in panels of respondents. Potential research participants were asked to join a panel, filled out an extensive demographic survey, and were paid small incentives to take part in projects. These panels suffer from three response issues: 1) not everyone is online or online at the same frequency, 2) not everyone who is online wants to be in a panel, and 3) not everyone in the panel will take part in a study. The result is a convenience sample. Good researchers figured out sophisticated ways to handle the sampling challenges that result from panel-based samples, and they work well for most studies. But, in no way are they a random sample.

River sampling is a term often used to describe respondents who are “intercepted” on the Internet and asked to fill out a survey. Potential respondents are invited via online ads and offers placed on a range of websites. If interested, they are typically pre-screened and sent along to the online questionnaire.

Because so much is known about what people are doing online these days, sampling firms have some excellent science behind how they obtain respondents efficiently with river sampling. It can work well, but response rates are low and the nature of the online world is changing fast, so it is hard to get a consistent river sample over time. Nobody being honest would ever use the term “random sampling” when describing river samples.

Panel-based samples and river samples represent how the lion’s share of primary market research is being conducted today. They are fast and inexpensive and when conducted intelligently can approximate the findings of a random sample. They are far from perfect, but I like that the companies providing them don’t promote them as being random samples. They involve some biases and we deal with these biases as best we can methodologically. But, too often we forget that they violate a key assumption that the statistical tests we run require: that the sample is random. For most studies, they are truly “close enough,” but the problem is we usually fail to state the obvious – that we are using statistical tests that are technically not appropriate for the data sets we have gathered.

Which brings us to a newer, shiny object in the research sampling world: ABS samples. ABS (addressed-based samples) are purer from a methodological standpoint. While ABS samples have been around for quite some time, they are just now being used extensively in market research.

ABS samples are based on US Postal Service lists. Because USPS has a list of all US households, this list is an excellent sampling frame. (The Census Bureau also has an excellent list, but it is not available for researchers to use.) The USPS list is the starting point for ABS samples.

Research firms will take the USPS list and recruit respondents from it, either to be in a panel or to take part in an individual study. This recruitment can be done by mail, phone, or even online. They often append publicly-known information onto the list.

As you might expect, an ABS approach suffers from some of the same issues as other approaches. Cooperation rates are low and incentives (sometimes large) are necessary. Most surveys are conducted online, and not everyone in the USPS list is online or has the same level of online access. There are some groups (undocumented immigrants, homeless) that may not be in the USPS list at all. Some (RVers, college students, frequent travelers) are hard to reach. There is evidence that ABS approaches do not cover rural areas as well as urban areas. Some households use post office boxes and not residential addresses for their mail. Some use more than one address. So, although ABS lists cover about 97% of US households, the 3% that they do not cover are not randomly distributed.

The good news is, if done correctly, the biases that result from an ABS sample are more “correctable” than those from other types of samples because they are measurable.

A recent Pew study indicates that survey bias and the number of bogus respondents is a bit smaller for ABS samples than opt-in panel samples.

But ABS samples are not random samples either. I have seen articles that suggest that of all those approached to take part in a study based on an ABS sample, less than 10% end up in the survey data set.

The problem is not necessarily with ABS samples, as most researchers would concur that they are the best option we have and come the closest to a random sample. The problem is that many firms that are providing ABS samples are selling them as “random samples” and that is disingenuous at best. Just because the sampling frame used to recruit a survey panel can claim to be “random” does not imply that the respondents you end up in a research database constitute a random sample.

Does this matter? In many ways, it likely does not. There are biases and errors in all market research surveys. These biases and errors vary not just by how the study was sampled, but also by the topic of the question, its tone, the length of the survey, etc. Many times, survey errors are not the same throughout an individual survey. Biases in surveys tend to be “unknown knowns” – we know they are there, but aren’t sure what they are.

There are many potential sources of errors in survey research. I am always reminded of a quote from Humphrey Taylor, the past Chairman of the Harris Poll who said “On almost every occasion when we release a new survey, someone in the media will ask, “What is the margin of error for this survey?” There is only one honest and accurate answer to this question — which I sometimes use to the great confusion of my audience — and that is, “The possible margin of error is infinite.”  A few years ago, I wrote a post on biases and errors in research, and I was able to quickly name 15 of them before I even had to do an Internet search to learn more about them.

The reality is, the improvement in bias that is achieved by an ABS sample over a panel-based sample is small and likely inconsequential when considered next to the other sources of error that can creep into a research project. Because of this, and the fact that ABS sampling is really expensive, we tend to only recommend ABS panels in two cases: 1) if the study will result in academic publication, as academics are more accepting of data that comes from and ABS approach, and 2) if we are working in a small geography, where panel-based samples are not feasible.

Again, ABS samples are likely the best samples we have at this moment. But firms that provide them are often inappropriately portraying them as yielding random samples. For most projects, the small improvements in bias they provide is not worth the considerable increased budget and increased study time frame, which is why, for the moment, ABS samples are currently used in a small proportion of research studies. I consider ABS to be “state of the art” with the emphasis on “art” as sampling is often less of a science than people think.


Visit the Crux Research Website www.cruxresearch.com

Enter your email address to follow this blog and receive notifications of new posts by email.