Description
Theory
- Briefly explain the difference between batch gradient descent and stochastic gradient descent. When would you prefer one over the other?
- In gradient descent, you need to stop training when the model converges. How will you know that your model has converged? State the stopping criteria and apply the criteria throughout your implementation.
- What will be the effect of bias term in BGD/SGD learning?
- True/False. Stochastic gradient descent performs less computation per update than batch gradient descent. Give brief reasoning?
- Why do we randomly shuffle the training examples before using SGD optimization?
- Hinge Loss which was introduced in class is not differentiable everywhere unlike log loss. Then show how we can use hinge loss function in the gradient descent optimization algorithm. What will be the gradient of hinge loss function without regularization term?
- What will be the gradient of the log and hinge loss function if you add the L2 regularization term (1/2)λ||w2||?
- Why is regularization used? What’s a potential issue with L2 regularization, if the λ hyperparameter can take negative values?
3 Batch Gradient Descent
3.1 Algorithm
Write down the batch gradient descent algorithm for log loss in appropriate algorithmic format.
3.2 Implementation
- Implement batch gradient descent algorithm for ‘log’ and ‘hinge’ loss functions.
- Apply L2 regularization for both loss functions. Note that a hyperparameter λ is used which controls the amount of regularization. Also, tune the value of λ when you are using regularization.
- Both implementations will have the ‘bias term’.
This part could be completed by editing only bgd.py, unless you plan to extract any new features for the task. You need to initialize the parameters ( init () method), learn them from training data (fit() method) and use the learned parameters to classify new instances (predict() method) for each of the loss functions. Take note that init .py, main.py and classifier.py may be replaced while grading. Do not make any modifications to these files. Report the results obtained on the given test set for both the loss functions in your report. You should submit your code with the hyperparameter settings (config.py) that produce the best performance.
Output format is given below. Your final numbers might differ. The given numbers are just a guideline, meant for sanity checking.
$ python3 main_bgd.py
BGD Log Loss (No Regularization) Results:
Accuracy: 74.09, Precision: 80.17, Recall: 62.83, F1: 70.45
BGD Log Loss (With Regularization) Results:
Accuracy: 76.41, Precision: 80.31, Recall: 68.91, F1: 74.18
BGD Hinge Loss (No Regularization) Results:
Accuracy: 77.41, Precision: 79.84, Recall: 72.29, F1: 75.88
BGD Hinge Loss (With Regularization) Results:
Accuracy: 78.07, Precision: 79.70, Recall: 74.32, F1: 76.92
3.3 BGD Analysis
You need to perform additional experiments to answer the following questions. You don’t need to submit your code for this part. You only need to include your plots and discussions in your report. Make sure that the code you submit doesn’t include any changes you don’t want to be included, as that might affect your chances of getting extra credit.
You need to record the training and test accuracy after every weight update. Let’s define this as one epoch.
- Plot training and test accuracy vs epoch without regularization in a graph. We will have one graph for each type of loss function. A sample graph attached in the end of this handout. For this question 2 graphs are expected.
- Run the same experiments with regularization and plot the training and test accuracy for each epoch again for each type of loss function. For this question 2 graphs are expected.
4 Stochastic Gradient Descent Implementation
4.1 Algorithm
Write down the stochastic gradient descent algorithm for hinge loss in appropriate algorithmic format.
4.2 Implementation
- Implement stochastic gradient descent algorithm for ’log’ and ’hinge’ loss functions.
- Apply L2 regularization for both loss functions. Note that a hyperparameter λ is used which controls the amount of regularization. Also, tune the value of λ when you are using regularization.
- Both implementations will have the ’bias term’.
This part could be completed by editing only sgd.py, unless you plan to extract any new features for the task. You need to initialize the parameters ( init () method), learn them from training data (fit() method) and use the learned parameters to classify new instances (predict() method) for each of the loss functions. Take note that init .py, main.py and classifier.py may be replaced while grading. Do not make any modifications to these files. Report the results obtained on the given test set for both the loss functions in your report. You should submit your code with the hyperparameter settings (config.py) that produce the best performance.
Output format is given below. Your final numbers might differ. The given numbers are just a guideline, meant for sanity checking.
$ python3 main_sgd.py
SGD Log Loss (No Regularization) Results:
Accuracy: 75.42, Precision: 80.83, Recall: 65.54, F1: 72.38
SGD Log Loss (With Regularization) Results:
Accuracy: 76.41, Precision: 80.31, Recall: 68.91, F1: 74.18
SGD Hinge Loss (No Regularization) Results:
Accuracy: 77.41, Precision: 79.84, Recall: 72.29, F1: 75.88
SGD Hinge Loss (With Regularization) Results:
Accuracy: 78.07, Precision: 79.70, Recall: 74.32, F1: 76.92
4.3 SGD Analysis
You need to perform additional experiments to answer the following questions. You don’t need to submit your code for this part. You only need to include your plots and discussions in your report. Make sure that the code you submit doesn’t include any changes you don’t want to be included, as that might affect your chances of getting extra credit.
You need to record the training and test accuracy after every weight update. Let’s define this as one epoch.
- Plot training and test accuracy vs epoch without regularization in a graph. We will have one graph for each type of loss function. A sample graph attached in the end of this handout. For this question 2 graphs are expected.
- Run the same experiments with regularization and plot the training and test accuracy for each epoch again for each type of loss function. For this question 2 graphs are expected.
7 Extra Credit
7.1 Top – 10% on Leaderboard
Your submission will be evaluated on a hidden dataset of similar nature. The hidden dataset is of similar size and splits. Top-10% (17 in number) of the students in the class would be awarded 5 extra credit points. You are not allowed to use any optimization libraries but you are allowed to use feature extraction software. If in doubt, ask a question on piazza and the instructors would respond. Remember that the extra credit only depends on the results on the hidden dataset. Overfitting the given dataset might prove counter-productive.
Figure 1: Sample graph. Curves are randomly drawn.
7.2 K-Fold Cross Validation
Perform K-fold cross validation (with K = 5) using the train set for evaluating any one of the models (GD or SGD). Any loss function can be used. Include a file called cv.py which has the implementation and report results in the homework report.
- Randomly split your entire dataset into K – ‘folds’
- For each K – fold in your dataset, build your model on K – 1 folds of the dataset. Then, test the model to check the effectiveness for Kth fold
- Record the accuracy you see on each of the predictions
- Repeat this until each of the K – folds has served as the test set
- The average of your K recorded accuracy is called the cross-validation accuracy and will serve as your performance metric for the model.




