implementation of centeralize_data() and pca_components()
Yields the following when running plot_class_representatives: result
task 2: PCA transformation and reconstructing
part A
Implement pca_tranform
part B
Implement pca_inverse_transform
Which yields the following for TNC visualisation:
and LFW visualisation:
We also expect some loss in information while reconstructing:
task 3: average reconstruction error for LFW
error=n1i=1∑n∣∣xi−reconstruct(pca(xi))∣∣22
part A
plot average reconstruction error on training and testing data points
Training code:
yields the following observation
The eval results graph:
part B
Explains the difference between the two graphs
What would the error be if we compute it for the TNC dataset while using two components and 2000 samples?
The following observation can be made:
Both decreases as the number of components increases (lower means better reconstruction quality). However, we observe test error line (red) is higher than train error (blue). This shows some overfitting given smaller training data size (400) against LFW dataset (which includes 1288 entries)
Both show diminishing returns, yet this effect is more pronounced on test error
As n_components increases, we see a decreases in bias (improving reconstruction for both train and test data). However, test error decreases more slowly given later components are less effective in reconstructing features for unseen data
Error for average reconstruction error for TNC is shown below:
task 4: Kernel PCA
part A
Apply Kernel PCA and plot transformed Data
Applied a StandardScaler to X_TNC and plot 3x4 grid with the (1,1) being the original data plot, followed by 11 slots for gamma from [0.0001⋯1].
Run on n_components=2
Yield the following graph:
part B
Based on your observations, how does Kernel PCA compare to Linear PCA on this dataset with red and blue labels? In what ways does Kernel PCA affect the distribution of the data points, particularly in terms of how well the red and blue points are organized? Choose the best value(s) for gamma and report it (them). What criteria did you use to determine the optimal gamma value?
Comparison:
Kernel PCA is more effective in capturing the non-linear relationships in the data, in which we see the spread between blue and red circles, which
modify the data distribution. Whereas with linear PCA, it maintains the circular structure, meaning linear PCA doesn’t alter data distribution that much
Effects:
For small value of gamma [0.0001,0.0005,0.001] the points are highly concentrated, meaning kernels is too wide (this makes sense given that gamma is the inverse of standard deviations)
For gamma [0.005,⋯0.05], we notice a separation between blue and red circles.
For gamma [0.1,0.2] , we start to see similar features from original data entries, albeit scaled down given RBF kernels.
At gamma [0.5,1], we notice datasets to spread out, forming elongated features.
For gamma [0.1,0.2] seems to provide best representation of the original data
Criteria:
class separation: how well the blue and red circles are separated from each other
compact: how tightly clustered the points within each classes are.
structure preservation: how well the circular nature of the original datasets are preserved.
dimensionality reduction: how well the data is projected in lower dimensions space
part C
Find best values for reconstruction error of kernel PCA
training loop yields the following:
part D
Visualisation of Reconstruction Error
How does kernel PCA compare to Linear PCA on this dataset? If Kernel PCA shows improved performance, please justify your answer. If Linear PCA performs better, explain the reasons for its effectiveness.
Reconstruction Error from kernel PCA as well as linear PCA:
Performance:
Linear PCA has significantly better reconstruction error than kernel PCA (6.68 of linear PCA against 47.48 at gamma=0.01 of kernel PCA)
Regardless of gamma, Kernel PCA shows a lot higher error
Reasoning for Linear PCA:
Data characteristic: most likely LFW contains mostly linear relationship between features (face images have strong linear correlations in pixel intensities and structures)
Dimensionality: This aligns with Task 3 Part B where we observe same value with n_components=60 for linear PCA
Overfitting: less prone to overfitting, given that Kernel PCA might find local optima that overfit given patterns of data (in this case face features). Additionally, RBF is more sensitive to outliers
Explanation why Kernel PCA doesn’t work as well:
Kernel: RBF assumes local, non-linear relationships. This might not work with facial data given strong linear correlation among facial features.
Gamma: We notice that with gamma=0.01 achieve lowest error, still underperformed comparing to linear PCA.
Noise: non-linear kernel mapping are more prone to capture noise or irrelevant patterns in facial images.
question 2.
problem statement
“Driving high” s prohibited in the city, and the police have started using a tester that shows whether a driver is high on cannabis.
The tester is a binary classifier (1 for positive result, and 0 for negative result) which is not accurate all the time:
if the driver is truly high, then the test will be positive with probability 1−β1 and negative with probability β1 (so the probability of wrong result is β1 in this case)
if the driver is not high, then the test will be positive with probability β2 and negative with probability 1−β2 (so the probability of wrong result is β2 in this case)
Assume the probability of (a randomly selected driver from the population) being “truly high” is α
part 1
What is the probability that the tester shows a positive result for a (randomly selected) driver? (write your answer in terms of α,β1,β2)
Probability of a driver being truly high: P(High)=α
Probability of a driver not being high: P(Not High)=1−α
Probability of a positive test given the dirver is high: P(Positive∣High)=1−β1
Probability of a positive test given the dirver is not high: P(Positive∣Not High)=β2
using law of total probability to find overall probability of a positive test result:
The police have collected test results for n randomly selected drivers (i.i.d. samples). What is the likelihood that there are exactly n+ positive samples among the n samples? Write your solution in terms of α,β1,β2,n+,n
Let probability of positive test result for a randomly selected driver is
p=P(Positive)=(1−β1)⋅α+(β2)⋅(1−α)
Now, apply binomial probability to find the likelihood of n+ positive samples among n samples:
What is the maximum likelihood estimate of α given a set of n random samples from which n+ are positive results? In this part, you can assume that β1 and β2 are fixed and given. Simplify your final result in terms of n,n+,β1,β2