Description
Introduction to sequence to sequence
Sequence to sequence
Generate a sequence from another sequence
Translation ASR TTS
text to text speech to text text to speech
and more…
Sequence to sequence
Often composed of encoder and decoder
- Encoder: encodes input sequence into a vector or sequence of vectors
- Decoder: decodes a sequence one token at a time, based on 1) encoder output and 2) previous decoded tokens
HW5: Machine Translation
Neural Machine Translation
We will translate from english to traditional chinese
- Cats are so cute. -> 貓咪真可愛。
A sentence is usually translated into another language with different length.
Naturally, the seq2seq framework is applied on this task.
Training datasets
- Paired data
- TED2020: TED talks with transcripts translated by a global community of volunteers to more than 100 language
○ We will use (en, zh-tw) aligned pairs
- Monolingual data
- More TED talks in traditional Chinese
source: Cats are so cute.
Evaluation
target:貓咪真可愛。
BLEU output: 貓好可愛。
- Modified[1] n-gram precision (n=1~4)
- Brevity penalty: penalizes short hypotheses
○ c is the hypothesis length, r is the reference length
- The BLEU score is the geometric mean of n-gram precision, multiplied by brevity penalty
Workflow
Workflow
- Preprocessing
- download raw data
- clean and normalize
- remove bad data (too long/short)
- tokenization Training
- initialize a model
- train it with training data
- Testing
- generate translation of test data
- evaluate the performance
Training tips
- Tokenize data with sub-word units
- Label smoothing regularization
- Learning rate scheduling
- Back-translation
- Tokenize data with sub-word units
- For one, we can reduce the vocabulary size (common prefix/suffix)
○ For another, alleviate the open vocabulary problem
○ example
■ ▁new ▁ways ▁of ▁making ▁electric ▁trans port ation ▁.
■ new ways of making electric transportation.
- Label smoothing regularization
- When calculating loss, reserve some probability for incorrect labels ○ Avoids overfitting
- Learning rate scheduling
- Linearly increase lr and then decay by inverse square root of steps
○ Stablilize training of transformers in early stages
Back-translation (BT)
Leverage monolingual data by creating synthetic translation data
- Train a translation system in the opposite direction
- Collect monolingual data in target side and apply machine translation
- Use translated and original monolingual data as additional parallel data to train stronger translation systems
back-translation
translated monoligual datamonolingual data
original data original data
Back-translation
Some points to note about back-translation
- Monolingual data should be in the same domain as the parallel corpus
- The performance of the backward model is critical
- You should increase model capacity (both forward and backward), since the data amount is increased.
Requirements
Requirements
You are encouraged to follow these tips to improve your performance in order to pass the 3 baselines.
- Train a simple RNN seq2seq to acheive translation
- Switch to transformer to boost performance
- Apply back-translation to furthur boost performance
Train a simple RNN seq2seq to acheive translation ● Running the sample code should pass the baseline!
Switch to transformer to boost performance
- Change the encoder/decoder architecture to transformer based, according to the hints in sample code
- RNNEncoder -> TransformerEncoder
○ RNNDecoder -> TransformerDecoder
- Change architecture configurations
- encoder_ffn_embed_dim -> 1024
○ encoder_layers/decoder_layers -> 4
○ #add_transformer_args(arch_args) -> add_transformer_args(arch_args)
Apply back-translation to furthur boost performance
- Train a backward model by switching languages
- source_lang = “zh”
○ target_lang = “en”
- Remember to change architecture to transformer-base
- Translate monolingual data with backward model to obtain synthetic data
- complete TODOs in the sample code.
○ all the TODOs can be completed by using commands from earlier cells.
- Train a stronger forward model with the new data
- if done correctly, ~30 epochs on new data should pass the baseline.
Expected Run Time
- on colab with Tesla T4
| Baseline | Details | Total time |
| Simple | 2m15s x 30 epochs | 1hr 8m |
| Medium | 4m x 30 epochs | 2hr |
| Strong | 8m x 30 epochs (backward)
+ 1hr (back-translation) + 15m x 30 epochs (forward) |
12hr 30m |
- TA’s training curve https://wandb.ai/george0828zhang/hw5.seq2seq.ne
Regulation
- You should NOT plagiarize, if you use any other resource, you should cite it in the reference. (*)
- You should NOT modify your prediction files manually.
- Do NOT share codes or prediction files with any living creatures.
- Do NOT use any approaches to submit your results more than 5 times a day.
- Do NOT search or use additional data or pre-trained models.
- Your final grade x 0.9 if you violate any of the above rules.
- Lee & TAs preserve the rights to change the rules & grades.
(*) Academic Ethics Guidelines for Researchers by the
Ministry of Science and Technology







