Topic outline

  • Principal Component Analysis vs. Discriminant Analysis

    Real-world data is often structured in a complex manner. This is especially true for pattern-classification and machine-learning applications. The challenge is to reduce the dimensions of the data sets with minimal loss of information. 

    There are two commonly used techniques to achieve this: Principal Component Analysis (PCA) and Discriminant Analysis (DA). To illustrate both techniques we use the Iris dataset first established by Fisher in 1936. It consists of a sample of the size n = 150, containing three classes (types of Iris flower), each with four flower features (sepal / petal lengths / widths). Each class has a subsample of the size n = 50.

    Both Principal Component Analysis and Discriminant Analysis are linear transformation methods and closely related to each other. When using the PCA we are interested to find the components (directions) that maximize the variance in our dataset. With the DA we are additionally interested to find the components (directions) that maximize the separation (discrimination) between differen classes. In the DA, classes are expressed with class labels. In contrast, the PCA ignores class labels. In pattern recognition problems a PCA is often followed by a DA.

    The difference between the two techniques is summarized Table 1:

    Table 1: PCA vs DA
    PCA DA
    Projection of the whole data set (without class labels) onto a different subspace.
    Identification of a suitable subspace to distinguish between patterns that belong to different classes (with class labels).
    Whole data set is treated as one class.
    Classes in data set are retained.
    Identification of the axes with maximum variances  where data is most spread.
    Identification of components that maximize the spread between classes.

    To demonstrate these techniques we use the Iris data set. The flower colours are varied, which is why the name Iris is taken from Old Greek, meaning rainbow.

    It contains only four variables, measured in centimeters: sepal length, sepal width, petal length and petal width. There also only three classes: Iris Setosa (Beachhead Iris), Iris Versicolour (Larger Blue Flag or Harlequin Blue Flag) and Iris Virginica (Virginia Iris). The data set is also known as Anderson's Iris data, since it was Edgar Anderson who collected the data to quantify the morphologic variation of three related Iris species.

    (Sir) Ronald Fisher prepared the multivariate data set and developed a linear discriminant model to distinguish the species from each other.

    Iris species

    Iris petal sepal length width

    Even though it is a very simple data set, it becomes difficult to visualize the three classes along all dimensions (variables). The following four histograms show the distribution of the four dimensions against the classes (species):

    We notice that the distribution of sepal width and sepal length is overlapping, we therefore cannot separate one species from another.

    Let us look at Iris Setosa only. As can be seen from the bottom histograms, the distribution of of petal width and petal length is very different from the other two species. We further see that petal length can be used as a differentiating factor in terms of the distribution of the three species.

    • Principal Component Analysis

      In Principal Component Analysis we are interested to find the components (directions) that maximize the separation (discrimination) between different classes. The basic principle is to transform the data into a subspace that summarizes properties of the whole data set with a reduced number of dimensions. We can then use these newly formed dimensions to visualize the data.

      The new dimensions are called principal components. The first principal components capture most of the variation in the data. Therefore, along the first principal component are the data that express most of its variability, followed by the second principal component, and so forth. Principal components are orthogonal to each other and therefore not correlated.

      Example: Two-dimensional component axes (orthogonal to each other) that maximize the variance

      PCA two-dimensional component axes

      The following scatter plots of the Iris data set show the features petal length and petal width:

      Iris petal length width


      We turn to the Iris data set and look at the scatter plots of the features petal length and petal width clearly against the three classes. It can be seen that these two dimensions (features) clearly separate the variability of the three classes. In a real-life data set this would normally not be the case and we would have to use Principal Component Analysis to identify the principal components. For the Iris data set, the output of the importance of the components looks as follows:

      Iris PCA importance of components

      (output of R Command princomp(); source:

      The first principal component (PC1) explains 92% of the variance in the data. We can therefore assume that PC1 expresses the size of flower that is reflected in all four dimensions.
      Since the PCA does not include class labels, the next task is to identify the dimension (feature) that is represented by the first principal component. In the Iris data set, this is petal length. No other feature separates the variability across the three classes as well, as the scatter diagramsabove show. In contrast, sepal length and sepal width do not separate the variability well, as the following scatter plots show:

      Iris sepal length width

      From the above histograms and scatter plots we can therefore visually determine that the feature petal length is the first principal component.
      We used the Iris data set for demonstration purposes. Real-life data sets can have hundreds of dimensions which means that more advanced techniques than visualization need to be employed.

      • Linear Discriminant Analysis

        Performing a Discriminant Analysis (summary)

        Credit: Sebastian Raschka,

        Linear Discriminant Analysis (LDA) is most commonly used as dimensionality reduction technique in the pre-processing step for pattern-classification  and machine learning applications. The goal is to project a dataset onto a lower-dimensional space with good class-separability in order to avoid overfitting ("curse of dimensionality") and also toreduce computational cost.

        Ronald A. Fisher formulated the Linear Discriminant in 1936 (The Use of Multiple Measurements in Taxonomic Problems). It also has some practical use as classifier. The original linear discriminant was described for a 2-class problem. Later it was generlized as "Multi-Class Linear Discriminant Analysis" or "Multiple Discriminant Analysis" By C. R. Rao in 1948 (The utilization of multiple measurements in problems of biological classification).

        The general LDA approach is very similar to a PCA, but in addition to finding the component axes that maximize the variance of our data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA).

        The main goal of an LDA is therefore to project a feature space (an n-dimensional dataset) onto a smaller subspace k (where k <= n -1) while maintaining the class-discriminatory information.

        Example: Maximizing two-dimensional component axes for class-separation

        LDA Maximizing component axes

        When discussing the PCA we have already seen that for low-dimensional data sets like Iris, visual inspection of histograms and scatter plots are very informative. For a data set with k classes and d dimensions we can perform a discriminant analysis in five steps as follows (for a detailed presentation including Python code see the referenced paper by Raschka):

        1. Computing the d-dimensional mean vectors. The result is k vectors, each containing d means (see python code snippet below).
        2. Computing the Within-Class scatter matrix SW and Between-Class scatter matrix SB.
        3. Solving the generalized eigenvalue problem for the matrix S-1W SB. The result is the linear discriminants.
        4. Selecting linear discriminants for the new feature subspace. This is done by sorting the eigenvectors by decreasing eigenvalues, followed by choosing k eigenvectors with the largest eigenvalues.
        5. Transforming the samples onto the new subspace.

        Python code and vector means (step 1):

        LDA vector means and python code

        Performing an LDA on the Iris data set we arrive two linear discriminants, as shown in the scatter plot below:
        LDA scatter plot
        The scatter plot reprents our new feature space, showing that the first linear discriminant LD1 separates the classes very well. However, the second linear discriminant LD2 does not add much valuable information.
        • Beyond the Iris data set


          The Iris data set was useful to present the principles of Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

          To demonstrate both dimensionality reduction techniques using a higher-dimensional data set, we use the Digit Dataset from the UCI Machine Learning Repository at

          250 samples were collected from 44 writers using a pressure sensitive tablet. The writers were asked to write 250 digits from 0 to 9 in random order. Each image is of dimension 8x8 = 64, and is reduced to a two-dimensional data point.

          PCA applied to this data identfies the combination of principal components (attributes, or directions in feature space) that account for most of the variance in the data. Below is a scatter plot of different samples on the first two principal components:

          Digits data set PCA

          Below is a scatter plot following an LDA, which tries to identify attributes that account for the most variance between classes.  As can be seen, the LDA provided for a highter test accuracy (0.66 vs. 0.52 for PCA):

          digits data set LDA

          • "Curse of Dimensionality"

            To illustrate the so-called "Curse of Dimensionality" we use a set of images, depicting either a cat or a dog.

            In this example we create a classifier that is able to distinguish dogs from cats automatically. To do so, we first need to think about a descriptor for each object class (cat, dog) that can be expressed by numbers to recognize the object. We could, for instance, argue that cats and dogs generally differ in colour. A possible descriptor that descriminates these two classes could then consist of three numbers: the average red colour, the average green clolur and the average blue colour of the image. A simple classifier could combine these features to decide on the class label:

            If 0.5*red + 0.3*green + 0.2*blue > return cat; else return dog

            However, these three colour-describing numbers (features) will obviously not suffice to obtain a good classification. Therefore, we add some features that describe the texture of the image, for instance by calculating the average gradient intensity in both the X and Y direction. We now have five features that, in combination, could be used by classification algorithm to distinguish cats from dogs.

            We are not yet satisfied and continue to add more features. Perhaps we can obtain a perfect classification by carefully designing a few hundred features? The answer to this question may sound counter-intuitive: No, we cannot!  In fact, after a certain point, increasing the dimensionality by adding new features would actually degrade the performance of our classifier. This is illustrated by the figure below:

            Illustration curse of dimensionality


            As the dimensionality increases, the classifier's performance increases until the optimal number of features is reached. We will now turn to the example of cats and dogs to see why this is true.

            We assume there are an infinite number of cats and dogs living on our planet. However, due to our limited time and processing power, we were only able to obtain 10 pictures of cats and dogs. The end-goal in classification is then to train a classifier based on these 10 training instances, that is able to correctly classify the infinite number of dog and cat instances which we do not have any information about.

            We use a simple linear classifier and try to obtain a good classification. We can start by a single feature, e.g., the average "red colour" in the image:

            dimensionality 1 feature

            We note that a single feature does not result in good separation of our training data. We therefore add the feature average "green colour" in the image:

            dimensionality 2 features

            Adding a second feature still does not result in a linearly separable classification problem: No single line can separate all cats from all dogs.

            Finally, we decide to add a third feature: the averate "blue colour" in the image, yielding a three-dimensional feature space:


            Adding a third feature results in a linearly separable classification problem in our example. A plane exists that separates dogs from cats very well. This means that a linear combination of the three features can be used to obtain good classification results on our training data of 10 images:

            dimensionality 3 features plane

            These illustrations might seem to suggest that increasing the number of features until perfect classifications results are obtained is the best way to train a classifier. However, in the introduction we argued that this is not the case.

            Note how the density of the training samples decreased exponentially when we increased the dimensionality of the problem. In the case of one feature, 10 training instances covered the complete one-dimensional feature space, the width of which was five unit intervals. Therefore, the sample density was 10 / 5 = 2 samples / interval.

            In the two-dimensional case we still had ten training instances, which now cover feature space with an area of 5 x 5 = 25 unit squares. Therefore, the sample density was 10 / 25 = 0.4 samples / interval.

            Finally, in the three-dimensional case, the ten samples had to cover a feature space volume of 5 x 5 x 5 = unit cubes. Therefore, the sample density was 10 / 125 = 0.08 samples / interval.

            Adding features means that the dimensionality of the feature space grows and becomes sparser and sparser. Due to this sparsity, it becomes much more easy to find a separable hyperplane. This is so because the likelihood that a training sample lies on the wrong side of the best hyperplane becomes infinitely small when the number of features becomes infinitely large.

            However, if we project the highly dimensional classification back to a lower dimensional space a serious problem becomes evident:

            Dimensionality high back projection

            Using too many features results in overfitting. The classifier starts learning exceptions that are specific to the training data. While data was linearly separable in the three-dimensional space, this is not the case in the two-dimensional space. In fact, adding the third dimension to obatain better classification results, simply corresponds to using a complicated non-linear classifier in the lower dimensional feature space. As a result, the classifier learns the appearance of specific instances and exceptions of our training data set. Because of this, the resulting classifier would fail on real-world data, consisting of an infinite amount of unseen cats and dogs that often do not adhere to these exceptions. It is a direct result of the curse of dimensionality.

            The following figure shows the result of a linear classifier that has been trained only two features instead of three:

            dimensionality result two features

            Although the simple linear two-dimensional classifier seems to perform worse than the non-linear classifier above, this simple classifier generalizes much better to unseen data because it did not learn specific exceptions that were only in our training data by coincidence. In other words, by using less features, the curse of dimensionality was avoided such that the classifier did not overfit the training data.

            We now illustrate this concept in a different manner. We assume we want to train a classifier using only a single feature whose value ranges from 0 to 1. We also assume that this feature is unique for each cat and dog. If we want our training data to cover 20% of this range, then the amount of training data needed is 20% of the complete population of cats and dogs. If we add another feature, resulting in a two-dimensional feature space, things change: To cover 20% of the two-dimensional range, we now need to obtain 45% of the complete population of cats and dogs in each dimension, since 0.45^2 = 0.2 (rounded). In the three-dimensional space it gets even worse: To cover 20% of the three-dimensional feature range, we need to obtain 58% of the population in each dimension, since 0.58^3 = 0.2 (rounded). 

            dimensionality training data needed

            This illustrates the fact that the amount of training data needed to cover 20% of the feature range grows esponentially with the number of dimensions.

            We showed that increasing dimensionality introduces sparseness of the training data. The more features we use, the more sparse the data becomes such that accurate estimation of the classifier's parameters (i.e., its decision boundaries) becomes more difficult. Another effect is that this sparseness is not uniformly distributed over the search space. In fact, data around the origin (at the centre of the hypercube) is much more sparse than data in the corners of the search space. This can be understood as follows:

            Imagine a unit square that represent the two-dimensional space. The average of the feature space is the centre of this unit square, and all points within unit distance from this center, are inside a unit circle that inscribes the unit square. The training samples that do not fall within this unit circle are closer to the corners of the search space than to its center. These samples are difficult to classify because their feature values greatly differs (e.g., samples in opposite corners of the unit square). Therefore, classification is easier if most samples fall inside the inscribed unit circle, as illustrated by the figure below:

            dimensionality unit circle

            Training samples that fall outside the unit circle are in the corners of the feature space and are more difficult to classify than samples near the centre of the feature space. The volume of the hypersphere tends to zero as the dimensionality tends to infinity, whereas the volume of the surrounding hypercube remains constant. This surprising and rather counter-intuitive observation partially explains the problems associated with the curse of dimensionality in classification: In high dimensional spaces, most of the training data resides in the corners of the hypercube defining the feature space. As mentioned before, instances in the corners of the feature space are much more difficult to classify than instances around the centroid of the hypersphere. This is illustrated by the figure below, which shows (from left to right) a two-dimensional unit square, a three-dimensional unit cube, and a creative visualization of an eight-dimensional hypercube which has 2^8 = 256 corners:

            dimensionality hypercube

            For an eight-dimensional hypercube, about 98% of the data is concentrated in its 256 corners.

            We have shown that the performance of a classifier decreases when the dimensionality of the problem becomes too large. The question then is what "too large" means, and how overfitting can be avoided. Regrettably there is no fixed rule that defines how many feature should be used in a classification problem. In fact, this depends on the amount of training data available, the complexity of the decision boundaries, and the type of classifier used.

            • References

              Reference 1 Reference 2 Reference 3