Principal Component Analysis and Discriminant Analysis

1. Principal Component Analysis vs. Discriminant Analysis

Real-world data is often structured in a complex manner. This is especially true for pattern-classification and machine-learning applications. The challenge is to reduce the dimensions of the data sets with minimal loss of information. Axioms

There are two commonly used techniques to achieve this: Principal Component Analysis (PCA) and Discriminant Analysis (DA). To illustrate both techniques we use the Iris dataset first established by Fisher in 1936. It consists of a sample of the size n = 150, containing three classes (types of Iris flower), each with four flower features (sepal / petal lengths / widths). Each class has a subsample of the size n = 50.

Both Principal Component Analysis and Discriminant Analysis are linear transformation methods and closely related to each other. When using the PCA we are interested to find the components (directions) that maximize the variance in our dataset. With the DA we are additionally interested to find the components (directions) that maximize the separation (discrimination) between differen classes. In the DA, classes are expressed with class labels. In contrast, the PCA ignores class labels. In pattern recognition problems a PCA is often followed by a DA.

The difference between the two techniques is summarized Table 1:

Table 1: PCA vs DA
Projection of the whole data set (without class labels) onto a different subspace.
Identification of a suitable subspace to distinguish between patterns that belong to different classes (with class labels).
Whole data set is treated as one class.
Classes in data set are retained.
Identification of the axes with maximum variances  where data is most spread.
Identification of components that maximize the spread between classes.

To demonstrate these techniques we use the Iris data set. The flower colours are varied, which is why the name Iris is taken from Old Greek, meaning rainbow.

It contains only four variables, measured in centimeters: sepal length, sepal width, petal length and petal width. There also only three classes: Iris Setosa (Beachhead Iris), Iris Versicolour (Larger Blue Flag or Harlequin Blue Flag) and Iris Virginica (Virginia Iris). The data set is also known as Anderson's Iris data, since it was Edgar Anderson who collected the data to quantify the morphologic variation of three related Iris species.

(Sir) Ronald Fisher prepared the multivariate data set and developed a linear discriminant model to distinguish the species from each other.

Iris species

Iris petal sepal length width

Even though it is a very simple data set, it becomes difficult to visualize the three classes along all dimensions (variables). The following four histograms show the distribution of the four dimensions against the classes (species):

We notice that the distribution of sepal width and sepal length is overlapping, we therefore cannot separate one species from another.

Let us look at Iris Setosa only. As can be seen from the bottom histograms, the distribution of of petal width and petal length is very different from the other two species. We further see that petal length can be used as a differentiating factor in terms of the distribution of the three species.