Skip to content

Latest commit

 

History

History
269 lines (176 loc) · 9.4 KB

File metadata and controls

269 lines (176 loc) · 9.4 KB

Machine Learning Concepts

Data Preprocessing

Normalization

TBD

e.g. SVM is better to normalize data between -1 and 1

Splitting Data

Missing Value

Options

  • Use the feature’s mean value from all the available data.
  • Fill in the unknown with a special value like -1.
  • Ignore the instance.
  • Use a mean value from similar items.
  • Use another machine learning algorithm to predict the value.

Dimensionality Reduction

Background

  • The relevant features may not be explicitly presented in the data.
  • We have to identify the relevant features before we can begin to apply other machine learning algorithm

The reasons we want to simplify our data

  • Making the dataset easier to use
  • Reducing computational cost of many algorithms
  • Removing noise
  • Making the results easier to understand

Factor Analysis

  • We assume that some unobservable latent variables are generating the data we observe.
  • The data we observe is assumed to be a linear combination of the latent variables and some noise
  • The number of latent variables is possibly lower than the amount of observed data, which gives us the dimensionality reduction

ICA

ICA stands for Independent Component Analysis

  • ICA assumes that the data is generated by N sources, which is similar to factor analysis.
  • The data is assumed to be a mixture of observations of the sources.
  • The sources are assumed to be statically independent, unlike PCA, which assumes the data is uncorrelated.
  • If there are fewer sources than the amount of our observed data, we'll get a dimensionality reduction.

PCA vs SVD

  • PCA
    • find the eigenvalues of a matrix
      • these eigenvalues told us what features were most important in our data set
  • SVD
    • find the singular values in $\Sigma$
      • singular values and eigenvlues are related
      • singular vlues are the square root of the eigenvlues of $AA^T$

Label Encoding

Classification Imbalance

To alter the data used to train the classifier to deal with imbalanced classification tasks.

  • Oversample: means to duplicate examples
  • Undersample: means to delete examples

Scenario

  • You want to preserve as much information as possible about the rare case (e.g. Credit card fraud)
    • Keep all of the examples form the positive class
    • Undersample or discard examples form the negative class

Drawback

  • Deciding which negaive examples to toss out. (You may throw out examples which contain valuable information)

Solution

  1. To pick samples to discard that aren't near the decision boundary
  2. Use a hybrid approach of undersampling the negative class and oversampling the positive class

(Oversample the positive class have some approaches)

  • Replicate the existing examples
  • Add new points similar to the existing points
  • Add a data point interpolated between existing data points (can lead to overfitting)

Model Expansion

Binary to Multi-class

While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.

Main Ideas

  • Decompose the multiclass classification problem into multiple binary classification problems
  • Use the majority voting principle (a combined decision from the committee) to predict the label

One-vs-rest (one-vs-all) Approaches

Tutorial:

Pairwise (one-vs-one, all-vs-all) Approaches

Model Evaluation

Classification

Accuracy (Error Rate)

  • The error rate = the number of misclassified instances / the total number of instances tested.
  • Measuring errors this way hides how instances were misclssified.

Confusion Matrix

Wiki - Confusion Matrix

  • With a confusion matrix you get a better understanding of the classification errors.

  • If the off-diagonal elements are all zero, then you have a perfect classifier

  • Construct a confusion matrix: a table about Actual labels vs. Predicted label

Precision, Recall Ratio

These metrics that are more useful than error rate when detection of one class is more important than another class.

Consider a two-class problem. (Confusion matrix with different outcome labeled)

Actual \ Redicted +1 -1
+1 True Positive (TP) False Negative (FN)
-1 False Positive (FP) True Negative (TN)
  • Precision = TP / (TP + FP)

    • Tells us the fraction of records that were positive from the group that the classifier predicted to be positive
  • Recall = TP / (TP + FN)

    • Measures the fraction of positive examples the classifier got right.
    • Classifiers with a large recall dont have many positive examples classified incorectly.
  • F₁ Score = 2 × (Precision × Recall) / (Precision + Recall)

Summary:

  • You can easily construct a classifier that achieves a high measure of recall or precision but not both.
  • If you predicted everything to be in the positive class, you'd have perfect recall but poor precision.

ROC curve

Wiki - Receiver operating characteristic

ROC stands for Receiver Operating Characteristic

  • The ROC curve shows how the two rates chnge as the threshold changes
  • The ROC curve has two lines, a solid one and a dashed one.
    • The solid line:
      • the leftmost point corresponds to classifying everything as the negative class.
      • the rightmost point corresponds to classifying everything in the positive class.
    • The dashed line:
      • the curve you'd get by randomly guessing.
  • The ROC curve can be used to compare classifiers and make cost-versus-benefit decisions.
    • Different classifiers may perform better for different threshold values
  • The best classifier would be in upper left as much as possible.
    • This would mean that you had a high true positive rate for a low false positive rate.

AUC (Area Under the Curve): A metric to compare different ROC

  • The AUC gives an average value of the classifier's performance and doesn't substitute for looking at the curve.
  • A perfect classifier would have an AUC of 1.0, and random guessing will give you a 0.5.

Regression

Mean Absolute Error (MAE)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Clustering

Within Groups Sum of Squares

  • Elbow method

Mean Silhouette Coefficient of all samples

Calinski and Harabaz score

  • Max score => the best number of groups (clustes)

Fitting and Model Complexity

Overfitting

Underfitting

Generalization

Regularization

Reducing Loss

Learning Rate

Gradient Descent

Other Learning Method

Cost-sensitive Learning

  • The different incorrect classification will have different costs.
  • This gives more weight to the smaller class, which when training the classifier will allow fewer errors in the smaller class
  • There are many ways to include the cost information in classification algorithms
    • AdaBoost
      • Adjust the error weight vector D based on the cost function
    • Naive Bayes
      • Predict the class with the lowest expected cost instead of the class with the highest probability
    • SVM
      • Use different C parameters in the cost function for the different classes

Lazy Learning

Incremental Learning (Online Learning)

Multi-label Classification

Wiki - Multi-label Classification