Dimension reduction might sound trivial but is an essential part of the machine learning model. Too much dimension acts as noise to our machine learning model, hence it is vital in many cases to use dimension reduction.
Why is dimension reduction helpful?
- The increase in the number of data dimensions increases the difficulty for Machine Learning models to take account of all the representation. Sometimes, it negatively impacts the accuracy of the outcome. This is called “Curse of Dimension” in the data world.
- More dimensions can act as noise rather than a feature. Some features might not be related to the output value. Hence, it is just creating random weight for the decision to be made.
- Less dimension means fewer parameters to learn and hence, results in a faster and efficient model in general. Sometimes a huge number of dimensions cause noise and the model learns nothing.
- In tabular data, when the column decreases, then the number of rows needed to learn is also decreased in most cases.
Techniques for Dimension Reduction
1. Dimension Reduction by feature elimination
We can reduce the dimension by merely eliminating unnecessary features. To do so, we consider the following methods:
a. Removing highly correlated features
Real data set often has too many highly correlated features like age and experience, salary, and education. We can remove one of the highly correlated features as one is enough.
b. Feature importance
Next, we can remove unimportant features by looking at feature importance from a model like Logistic Regression or Random Forest. Also, if features have zero variance, we can remove it, but remember- it has to be zero, as small variance doesn’t mean it is not essential.
c. Recursive Feature Elimination (For models with feature importance)
We fit out model with all the available features then remove the most unimportant feature and again fit the model. Then, we measure the drop in evaluation metrics. This method gives us information on which information is most vital.
d. Additive Feature Extraction (For models with feature importance)
We start with the essential feature and then keep adding features in order of their importance and look at the results. And from here, we can choose a set of features to use for our model. As here in this example, choosing more than the 5 most essential features give us no benefits.
e. Taking all subsets (hotchpotch for all models)
This technique is a brute force technique providing the combinations of all sizes of the set of features to the model and look at which subset performs the best.
Note: Playing around with the loss types (L1 and L2) in logistic and linear regression is a good idea as L1 tends to squeeze the unimportant features to zero, and L2 tends to make the best model in evaluation. Then we can look at the weights to reduce the number of features to be needed.
Here’s a related article on interpretable AI that you might find interesting.
2. Dimension Reduction by Matrix Factorization
Instead of eliminating features, we can aggregate the complex representation of features in many dimensions to a few dimensions i.e. the manifold embedding of the higher dimension data points. Manifold embedding is the representation of data in the least dimension which covers all the information of the main data points. Many different variants of matrix factorization can do this, few of which are mentioned below:
a. Principal Component Analysis (PCA)
One of the most widely used techniques for dimension reduction is PCA. It is fast and efficient. We take the covariance matrix of the data points we have. Then, we compute eigenvalue and eigenvectors for the covariance matrix. As we are finding dependencies between variables and finding the eigenvectors, the covariance matrix gives us the axis in the dimension where most of the points would lie linearly, this is called Eigen basis.
PCA takes a strong assumption that the data points are linear in some matter.
Now, we know the direction where the eigenvectors lie according to covariance, we can also have the variance strength in that vector from the eigenvalue. We get the principal component by eigenvectors, and the variance explained from the eigenvalues. To get the points in the new vector space, we multiply it with the selected eigenvectors according to top eigenvalues.
For more math and a detail explanation, watch StatQuest: PCA main ideas in only 5 minutes!!!
b. Singular Value Decomposition (SVD)
Singular value decomposition merely is decomposing a matrix into three matrices.
Here Σ represents the variance of the axis (new principal components), which is given by the matrix U whereas the new point that we get by transforming into the new vector space is V^T. So, theoretically, we should get the points back when we multiply these matrices. Note here Σ is playing a weighing role. The variance covered is given by Σ. This can also be related to PCA as the U matrix gives us the eigenvalues (manifold embeddings) and Σ gives us the new Eigenvectors.
c. Independent Component Analysis (ICA)
We sum up all the features of the data to single points and should get a distribution, which is Gaussian.
So, we are looking for a transformation that changes the kurtosis and make the distribution non-Gaussian.
Hence the new data point from the transformation is independent of each other. It means that the new axis doesn’t necessarily be orthogonal to the primary axis. It also means that instead of looking at covariance and separating the new axis in terms of the variance explained, ICA tries to separate the independent components in the data points. Here is an example of how PCA works vs. how ICA works.
PCA and SVD assume linearity in data and thus gives us orthogonal axes. In contrast, ICA and projection-based methods don’t, meaning ICA should perform better in data having non-linear relations.
3. Dimension Reduction by projection-based techniques (Self-organizing map)
There are many projection-based dimension reduction methods that are famous for data visualization in MNIST data-set in TensorFlow. Some of them include t-SNE, UMAP. But we generally don’t use these in dimension reduction as these are stochastic methods, meaning that each different time dimension reduced, we get different data points. Great for visualization, but not so much for dimension reduction.
Regardless, let’s discuss one for fun.
T-distributed Stochastic Neighbor Embedding (t-SNE)
T-SNE assumes non-linearity in data; that is why it works so well in a large dimension data-set like MNIST data-set. Here, we can see how well it works.
It’s complex, looking into the inner workings of the algorithm and is not in the scope of the article. For further study here is an excellent explanation of how t-SNE works: https://www.youtube.com/watch?v=NEaUSP4YerM