People Analytics Deconstructed

Data Cleaning, Part 2

Season 1 Episode 22

In this episode, co-hosts Jennifer Miller and Ron Landis continue their discussion on the importance of data cleaning and management. They review three of five aspects of data cleaning that are critical to checking prior to the analytic phase. In this episode, they discuss how to check for linearity and normality, outliers, and multicollinearity.  

 In this episode, we had conversations around these questions:  

  • How do you check for linearity and normality in a data set?  
  • Why is normality important to check for in a data set?  
  • What are outliers? This includes both univariate and multivariate outliers. 
  • How do you identify outliers in your data?  
  • What are some ways to handle outliers?  
  • What is multicollinearity?  
  • Why is multicollinearity important to check and consider during the data analytic process?  

Key Takeaways:  

  • We should always consider the distribution of a variable with respect to our expectations regarding the distribution. If the distribution is inconsistent with what we expect, we should devote time and energy toward understanding why. In cases where our ultimate analyses require assumptions of normality, we need to ensure that our data are consistent with that assumption. We may elect to transform our data on the basis of these analyses, but should always be able to explain why we have done so. 
  • Outliers are cases inconsistent from other cases. In the univariate case, these are scores that are either extremely high or low. In the multivariate situation, we inspect the "profile" of scores across measured variables to assess the degree to which the case is consistent with others. Once cases are identified as outliers, then determining what to do with them is important. Our discussion focused on some common ways of dealing with outliers. 
  • Multicollinearity exists when two or more of the predictors are moderately or highly correlated. This is typically of concern when conducting analyses using the multiple regression framework. Specifically, we need to assess the degree to which predictor variables are overly redundant (highly correlated) prior to including them in our models. The variance inflation factor (VIF) or tolerance are commonly used to assess multicollinearity.