Description
In this project, you will be asked to implement two model-free algorithms. The first one is Monte-Carlo(MC), including the first visit of on-policy MC prediction and on-policy MC control for blackjack. The second one is Temporal-Difference(TD), including Sarsa(on-policy) and Q-Learning(off-policy) for cliffwalking.
TA will run your code twice. You will get full credits if one of the tests passes.
Hints
- On-policy first visit Monte-Carlo prediction
- On-policy first visit Monte-Carlo control
- Sarsa (on-policy TD control)
- Q-learing (off-policy TD control)









