- Table of content
TBD
e.g. SVM is better to normalize data between -1 and 1
- Use the feature’s mean value from all the available data.
- Fill in the unknown with a special value like -1.
- Ignore the instance.
- Use a mean value from similar items.
- Use another machine learning algorithm to predict the value.
Background
- The relevant features may not be explicitly presented in the data.
- We have to identify the relevant features before we can begin to apply other machine learning algorithm
The reasons we want to simplify our data
- Making the dataset easier to use
- Reducing computational cost of many algorithms
- Removing noise
- Making the results easier to understand
- We assume that some unobservable latent variables are generating the data we observe.
- The data we observe is assumed to be a linear combination of the latent variables and some noise
- The number of latent variables is possibly lower than the amount of observed data, which gives us the dimensionality reduction
ICA stands for Independent Component Analysis
- ICA assumes that the data is generated by N sources, which is similar to factor analysis.
- The data is assumed to be a mixture of observations of the sources.
- The sources are assumed to be statically independent, unlike PCA, which assumes the data is uncorrelated.
- If there are fewer sources than the amount of our observed data, we'll get a dimensionality reduction.
- PCA
- find the eigenvalues of a matrix
- these eigenvalues told us what features were most important in our data set
- find the eigenvalues of a matrix
- SVD
- find the singular values in
$\Sigma$ - singular values and eigenvlues are related
- singular vlues are the square root of the eigenvlues of
$AA^T$
- find the singular values in
To alter the data used to train the classifier to deal with imbalanced classification tasks.
- Oversample: means to duplicate examples
- Undersample: means to delete examples
Scenario
- You want to preserve as much information as possible about the rare case (e.g. Credit card fraud)
- Keep all of the examples form the positive class
- Undersample or discard examples form the negative class
Drawback
- Deciding which negaive examples to toss out. (You may throw out examples which contain valuable information)
Solution
- To pick samples to discard that aren't near the decision boundary
- Use a hybrid approach of undersampling the negative class and oversampling the positive class
(Oversample the positive class have some approaches)
- Replicate the existing examples
- Add new points similar to the existing points
- Add a data point interpolated between existing data points (can lead to overfitting)
While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.
Main Ideas
- Decompose the multiclass classification problem into multiple binary classification problems
- Use the majority voting principle (a combined decision from the committee) to predict the label
Tutorial:
- The error rate = the number of misclassified instances / the total number of instances tested.
- Measuring errors this way hides how instances were misclssified.
-
With a confusion matrix you get a better understanding of the classification errors.
-
If the off-diagonal elements are all zero, then you have a perfect classifier
-
Construct a confusion matrix: a table about Actual labels vs. Predicted label
These metrics that are more useful than error rate when detection of one class is more important than another class.
Consider a two-class problem. (Confusion matrix with different outcome labeled)
| Actual \ Redicted | +1 | -1 |
|---|---|---|
| +1 | True Positive (TP) | False Negative (FN) |
| -1 | False Positive (FP) | True Negative (TN) |
-
Precision = TP / (TP + FP)
- Tells us the fraction of records that were positive from the group that the classifier predicted to be positive
-
Recall = TP / (TP + FN)
- Measures the fraction of positive examples the classifier got right.
- Classifiers with a large recall dont have many positive examples classified incorectly.
-
F₁ Score = 2 × (Precision × Recall) / (Precision + Recall)
Summary:
- You can easily construct a classifier that achieves a high measure of recall or precision but not both.
- If you predicted everything to be in the positive class, you'd have perfect recall but poor precision.
Wiki - Receiver operating characteristic
ROC stands for Receiver Operating Characteristic
- The ROC curve shows how the two rates chnge as the threshold changes
- The ROC curve has two lines, a solid one and a dashed one.
- The solid line:
- the leftmost point corresponds to classifying everything as the negative class.
- the rightmost point corresponds to classifying everything in the positive class.
- The dashed line:
- the curve you'd get by randomly guessing.
- The solid line:
- The ROC curve can be used to compare classifiers and make cost-versus-benefit decisions.
- Different classifiers may perform better for different threshold values
- The best classifier would be in upper left as much as possible.
- This would mean that you had a high true positive rate for a low false positive rate.
AUC (Area Under the Curve): A metric to compare different ROC
- The AUC gives an average value of the classifier's performance and doesn't substitute for looking at the curve.
- A perfect classifier would have an AUC of 1.0, and random guessing will give you a 0.5.
- Elbow method
- Max score => the best number of groups (clustes)
- The different incorrect classification will have different costs.
- This gives more weight to the smaller class, which when training the classifier will allow fewer errors in the smaller class
- There are many ways to include the cost information in classification algorithms
- AdaBoost
- Adjust the error weight vector D based on the cost function
- Naive Bayes
- Predict the class with the lowest expected cost instead of the class with the highest probability
- SVM
- Use different C parameters in the cost function for the different classes
- AdaBoost