Advanced Statistical Methods Part II: Introduction to Missing Data
BrandAsset Valuator (BAV) has measured brands for over two decades and provides an empirical developmental framework for understanding changes in brand equity. What is more, BAV monitors how data points behave and shows how brands create momentum by analyzing changes in consumer perceptions over time. As a result, we are able to precisely explain how brands grow and decay by understanding the subtle nuances in data behavior (e.g., implications of missing data). Despite the importance of recognizing how data points behave, we must also keep in mind the data that is not there – that is, missing data – and attempt to explain ambiguous data behavior.
It is very common to have missing data or incomplete responses from survey respondents. There are many reasons why missing data may occur. Respondents, for example, may refuse to answer “sensitive” questions (e.g., income), may lose motivation in responding to survey items, and may drop out of a longitudinal study (i.e. attrition). If missing data is not addressed properly, it will yield biased parameter as well as reduced statistical and predictive power. Parameter estimates (sample statistic) are values that are reflective of the entire population. Failure to capture accurate parameter estimates due to missing data can either compromise or help preserve the validity of an actionable insight derived from data.
Although BAV has made substantial progress in addressing missing data, it remains one of the most pervasive problems in data analysis for many leading brand consulting firms and even academics. The pervasiveness of missing data holds especially true in quantitative brand tracker studies. And, the seriousness depends on the pattern of missing data, how much is missing, and why it is missing. In sum, the pattern of missing data is more important than the amount missing.
Missing Data Mechanisms
Missing data are characterized as Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). MCAR describes the condition where data are missing for purely random reasons (unpredictable) and is the best scenario if data must be missing. A second mechanism for missing data is MAR. In MAR, data are not missing completely at random and the pattern of missing data is predictable from other variables. The third missing data mechanism is MNAR. In MNAR, the missingness is related to the variable itself, and therefore, cannot be ignored (worst case scenario). Although, the temptation to assume that data are MCAR is appealing, the safest thing to do is to test it via IBM SPSS MVA (Missing Values Analysis). The decision about how to handle and address missing data is vital and below I review a few out-of-date methods.
Strategies for Missing Data
Imputing data involves replacing missing data with reasonable substitutional values. Imputation methods are an attractive option because once the data are imputed, subsequent statistical analyses can utilize a “complete” set of data. A relatively straightforward but unsatisfactory method of imputing data is to replace missing values with the mean of the available scores for a given variable. This method is referred to as Mean Substitution and this method can produce biased parameter estimates. The main issue with this procedure is that it assumes that all cases having missing data for a given variable (e.g., Net Promoter Score) score only at the mean of the variable of interest. As a result, I strongly recommend that researchers do not use Mean Substitution as a viable missing data treatment.
Another imputation method involves using a multiple regression to replace missing values, a method known as Regression Imputation. In this method, a given variable (e.g., Net Promoter Score) with missing data serves as the dependent variable and is regressed on the other variables in the data set (e.g., relevance, consideration, usage, etc.). Although Regression Imputation is an improvement over Mean Substitution, I do not recommend this method because it can yield inaccurate estimates of the variable in interest (e.g., Net Promoter Score) due to the lack of variability (also referred to action) that is apparent in using the predicted scores as replacement values.
In contrast, Multiple Imputation (MI) is the most widely recommended procedure for dealing with missing data. In MI, missing data are imputed using a logistic/stochastic regression imputation and this procedure is replicated several times which creates multiple “complete” data sets (In SPSS, I recommend that you impute no less than twenty-five data sets). The main advantage of MI is the unbiased parameter estimates it creates. Furthermore, MI can be applied to longitudinal tracker data while retaining an ideal sampling variability. At BAV, we have utilized MI in our global 4Cs values segmentation and Shoppers Segmentation with tremendous results.
Missing data is a common issue that must be addressed and cannot be ignored in brand equity research. This post highlights selected issues that may benefit individuals who are looking for a viable solution in addressing missing data. These issues include the need to consider whether the missing data is missing completely at random, missing at random, or missing not at a random. Consideration of these issues should improve the actionable insights as individuals build precise and scalable brand equity models.