Feature selection is the process by which we select a subset of input features from the data for a model to reduce noise. We eliminate some of the available features in this process to get the best results from the model using minimum data and to ensure model explainability and simplicity.
The goal of feature selection is to find the best set of features from the available data that models the given problem to yield a machine learning model with good performance and robustness. Feature selection also reduces model complexity, which helps avoid some of the common challenges in machine learning, such as the curse of dimensionality, computational efficiency, and model explainability.
Avoiding the feature selection process might lead to a suboptimal model with low performance and robustness, limited model explainability, and high computational requirements which lead to higher model latency in production settings.
Feature selection methods in machine learning can be broadly classified into supervised and unsupervised. Supervised methods use the target variable to do feature selection while unsupervised feature selection methods do not rely on the target variable. Supervised methods are further classified as
There are many ways to perform unsupervised feature selection as well. Dimensionality Reduction is one popular method. This technique seeks to find a lower-dimensional representation of the data that retains as much of the original information as possible. It does this by identifying and combining highly correlated features, then reducing the dimensionality of the data while maintaining as much information as possible. The dimension of the input vectors is reduced using any of the popular dimensionality reduction algorithms like PCA, TSNE, etc. Principle component analysis (PCA) It is the most common algorithm used for dimensionality reduction. It is the process of finding principal components of the given data that maximize variance.