Feature selection
What is feature selection?
Feature selection is the process by which we select a subset of input features from the data for a model to reduce noise. We eliminate some of the available features in this process to get the best results from the model using minimum data and to ensure model explainability and simplicity.
What is the purpose of feature selection in data analytics?
The goal of feature selection is to find the best set of features from the available data that models the given problem to yield a machine learning model with good performance and robustness. Feature selection also reduces model complexity, which helps avoid some of the common challenges in machine learning, such as the curse of dimensionality, computational efficiency, and model explainability.
Avoiding the feature selection process might lead to a suboptimal model with low performance and robustness, limited model explainability, and high computational requirements which lead to higher model latency in production settings.
Feature selection algorithms
Feature selection methods in machine learning can be broadly classified into supervised and unsupervised. Supervised methods use the target variable to do feature selection while unsupervised feature selection methods do not rely on the target variable. Supervised methods are further classified as
- Wrapper: The feature selection process is done specifically for a given machine learning algorithm and evaluation criteria.
- Recursive feature elimination (RFE): Is an easy-to-use and common feature selection algorithm that runs n iterations by selecting and evaluating different subsets of input variables with the machine learning algorithm that is specified. The selected features are the subset of features that yields the best results.
- Filter
- Chi-squared test: Chi-squared test is the most common feature selection method used with structured data containing categorical features. The chi-square score is calculated for each input variable against the target variable and the input variables with the best chi-square scores are selected as the input set of features.
- Pearson correlation: This is one of the best ways to deal with multicollinearity. Multicollinearity is when several input independent variables in the data are correlated. Pearson correlation coefficient is calculated for each input variable against each other variable. One among each highly correlated variable (greater than 0.8) is removed during feature selection.
- Embedded methods: These methods combine the qualities of the wrapper and filter methods to accommodate feature interactions and computational efficiency.
- Decision tree algorithms: In decision trees, the features are selected such that a node condition should be able to split the data so that similar values in the target variable end up in the same split. Feature importance is computed based on how much each feature contributed to reducing weighted impurity which is Gini/entropy for classification and variance for regression.
There are many ways to perform unsupervised feature selection as well. Dimensionality Reduction is one popular method. This technique seeks to find a lower-dimensional representation of the data that retains as much of the original information as possible. It does this by identifying and combining highly correlated features, then reducing the dimensionality of the data while maintaining as much information as possible. The dimension of the input vectors is reduced using any of the popular dimensionality reduction algorithms like PCA, TSNE, etc. Principle component analysis (PCA) It is the most common algorithm used for dimensionality reduction. It is the process of finding principal components of the given data that maximize variance.