Make a classifier

Using data from the Spotify API, train a classifier to recognize taylor swift vs beyonce

1. Getting some data

1.1 Load a dataset

First, read the .csv files located on python_scratch path (on google drive) as a pandas dataframe. If possible, use the following variables to load the dataframes into:

1.2 Features and Labels

Now, the important part: you need to choose which features to use to train your model.

1.2.0 Look at the data:

1.2.1 Features

Choose features for both taylor and beyonce and concatenate them inside a features variables. (Make sure these are the same features for both beyonce and taylor)

1.2.2 Labels

You also need to make your labels variable. The length of the labels must be the same lenght as the rows of the feature variable.

2. Split data into Train and Test sets

You can choose how much of the dataset goes to the test set and to the train set by the test_size parameter

Use:

3. Train your model

3.1. Load your model

We will use the SVC class (aka Support Vector Classifier) available within the svm (aka Support Vector Machine) module of the sklearn library.

You can fiddle with the C and gamma parameters, say, by changing them to some value (C=50, gamma=0.01), or just leave the defaults (ie., just run SVC() without arguments)

Optionally, you can perform a grid search for hyperparameter optimization. (See below)

3.2. Train the model

Run the fit() method of the SVC class passing as arguments your X_train and y_train variables, i.e., your feature training set and its labels.

3.3 Get your Accuracy score

Run the score() method of the SVC class with the test set so that you can estimate the model's accuracy

4. Make a prediction

To make a prediction, you need to pick a random sample from the feature test set, and run the predict method of the SVC class (i.e., of your model).

NOTE: You can also use the same random number to index into the test set labels so that you check if your prediction is correct

5. Grid Search for best parameters (Optional)

In here you will first run the grid search, and then you need to make sure you use the best model that is returned from the search.

To perform a grid search, you need to load the GridSearchCV class from the model_selection module of the sklearn library.

  1. Make a dictionary holding two keys C and gamma, whose values will be arrays holding something like [10, 50, 25, 75 ] for the former, and [0.05, 0.01, 0.001, 0.0055].
  2. Instantiate the GridSearchCV class with two arguments: the first is your model, the second is the dictionary we just made. You can also pass a verbose=1 parameter. See the User Guide for more information. Make sure you store the instantiation in a variable such as, e.g., grid = GridSearchCV(...)
  3. Then, you can call the fit method, as if it were our model, and pass in the test features and labels
  4. After a while, you get some printouts, and you can now ask from your grid class, the attribute grid.best_estimator_, which is the best model your grid could found given those parameters.
  5. Finally, you can ask for the score method of such best model, and compare to your previous result without grid searching

5.2. Run a prediction again

Now that you have your best parameters, and a good score, you can predict the model again and test the model for yourself. Go back to 4

6. Interpreting Results

You can now make a classification report and a confusion matrix

6.1 Classification Report

For the report, you need to call the classification_report function of the metrics module within the sklearn library.

Then, you need to:

  1. run a prediction using the test set
  2. run the classification_report function with:

    • the true labels of the test set (arg 1)
    • the predicted labels of the entire test set (arg 2)
    • the two names of the labes using the target_names= parameter
  1. finally, you print the report to screen

6.2 Confusion Matrix

To make a confusion matrix, you need to import the confusion_matrix function from the metrics module of the sklearn library.

You will also need to import other plotting libraries matplotlib.pyplot and seaborn for the heatmap() function

  1. Pass the true labels and the predicted labels to the confusion_matrix function. This function returns a matrix, call it mat
  2. Run the heatmap function with the resultin mat matrix (NOTE: you will need to transpose this matrix for it to work, so better use mat.T.
  3. Useful parameters for the heatmap function are:
    square=True # so it's a nice square
    annot=True  # so we can read the values on the cells
    cbar=True   # so we have some color bar for reference
    fmt='d'     # nice and round integers
    xticklabels=[] # arrays with the names of our labels
    yticklabels=[] # arrays with the names of our labels
    

8. Extra

Now you can think of ways of extending this binary classifier with more than 2 groups: could we add nirvanna, rolling stones and beatles to the mix and see if we can tell nirvana from taylor swift?