Machine Translation using Sequence-to-Sequence Learning

An Annotated Introduction to LSTM-based Encoder-Decoder Models

In this article we're training a Recurrent Neural Network (RNN) model based on two Long Short-Term Memory (LSTM) layers to translate English sentences to German inspired by a tutorial on the official Keras blog. For an overview of RNNs and a more detailed look at LSTMs, please refer to this great blogpost by Chris Olah.

In sequence-to-sequence learning we want to convert input sequences, in the general case of arbitraty length, to sequences in another domain. An obvious application of this is machine translation.

'Go on.' -> [Sequence-to-Sequence Model] -> 'Mach weiter.'

The model we're building will be processing the English sentences as a sequence of characters and produce the translated sentences character by character.

A schematic overview of the encoder-decoder model for machine translation.

Note that more advanced machine translation models are usually processing sentences word by word. However, this would require us to first embed each word as a vector. For simplicity reasons we'll stick to the character by character basis here.

Data Preparation

Luckily there are quite a few datasets for language translation tasks available here. They all consist of sentence pairs delimited by tabs. The German-English dataset contains nil sentence pairs prepared by the Tatoeba Project. Before we can feed English sentences to the model, they must first be conformed for use as input to Keras' LSTM layers.

The dataset is already uploaded as deu-eng.txt. The sentence pairs can be loaded by creating a reference (via @...) in the Python runtime.


Let's create a list of lines by splitting the text file at every occurance of '\n'.

with open(deu-eng.txt, 'r', encoding='utf-8') as f:
  lines ='\n')

Let's look at an example.


Sweet! So we have both the input (English) and the target (German) sentences in every line, separated by '\t'.


Let's go ahead and split each line into input text and target text. Since we'll do the translation character by character we also want to compute a set of every character we encounter in the dataset, both for inputs as well as targets.

input_texts = []
target_texts = []
input_characters = set()
target_characters = set()

Now we can loop over every sample we choose for training and fill the lists and sets. We'll also add '\t' (start-of-sequence) and '\n' (end-of-sequence) characters to every target text. This will later help our model determine when to start and - more importantly - end sequences. We need this due to the fact that we don't know a-priori how long the output sequences should be. That's why we teach our model to decide on that by itself.

num_samples = 10000
for line in lines[: min(num_samples, len(lines) - 1)]:
  input_text, target_text = line.split('\t')
  target_text = '\t' + target_text + '\n'
  for char in input_text:
    if char not in input_characters:
  for char in target_text:
    if char not in target_characters:

Ok, let's look at some characteristics the input and target sequences.

input_characters = sorted(list(input_characters))
target_characters = sorted(list(target_characters))
num_encoder_tokens = len(input_characters)
num_decoder_tokens = len(target_characters)
max_encoder_seq_length = max([len(txt) for txt in input_texts])
max_decoder_seq_length = max([len(txt) for txt in target_texts])

print('Number of samples:', len(input_texts))
print('Number of unique input tokens:', num_encoder_tokens)
print('Number of unique output tokens:', num_decoder_tokens)
print('Max sequence length for inputs:', max_encoder_seq_length)
print('Max sequence length for outputs:', max_decoder_seq_length)

By now we have nil samples consisting of input/target texts. Along the input texts we have 70 unique characters, while the target texts contain 87 unique characters.


This is, in part, due to these weird German umlauts.

Another characteristic for the German language is that sentences tend to be a bit longer than their English counterparts (maybe you can find a way to test that hypothesis for yourself). In any case, the longest target sequence in the 10000 sample sentences we're using contains 53 characters, while the longest input sequence only contains 16.

But these input texts still don't work as input to our model. We'll need to convert the characters into numeric values. In our case, one-hot encodings are fine, but when turning to more involved models, using more advanced embedding methods such as Word2Vec would make more sense.

Here we first tokenize our characters by assigning each unique character to an integer value.

input_token_index = dict(
  [(char, i) for i, char in enumerate(input_characters)])
target_token_index = dict(
  [(char, i) for i, char in enumerate(target_characters)])

With that we can start creating numeric data. We'll need input data for both the encoder and the decoder of the model, as well as the target data (used only in the decoder part).

import numpy as np

encoder_input_data = np.zeros(
  (len(input_texts), max_encoder_seq_length, num_encoder_tokens),
decoder_input_data = np.zeros(
  (len(input_texts), max_decoder_seq_length, num_decoder_tokens),
decoder_target_data = np.zeros(
  (len(input_texts), max_decoder_seq_length, num_decoder_tokens),

The encoder_input_data will consist of nil samples of the maximum sequence length (16) filled with the respective one-hot-encoded tokens (in this case a vector of length 70).

The decoder_input_data and the decoder_target_data are both constructed in the same way as the input data for the encoder. We need to construct those two sequences because we're training our model through a process called teacher forcing, where the decoder learns to generate decoder_target_data[t+1...] given decoder_input_data[...t] while taking into account the input sequence via the encoder's internal state. Therefore we have to offset decoder_target_data by one timestep.

Time to fill in the data with the actual tokens. For that we iterate over all input and target texts and insert the respective one-hot encoding each character in the sequence.

for i, (input_text, target_text) in enumerate(zip(input_texts, target_texts)):
  for t, char in enumerate(input_text):
    encoder_input_data[i, t, input_token_index[char]] = 1.
  for t, char in enumerate(target_text):
    # decoder_target_data is ahead of decoder_input_data by one timestep
    decoder_input_data[i, t, target_token_index[char]] = 1.
    if t > 0:
      # decoder_target_data will be ahead by one timestep
      # and will not include the start character.
      decoder_target_data[i, t - 1, target_token_index[char]] = 1.

With that our example sentence nil turned into a sequence of length 16 with one-hot encodings of tokens for every character.


Building the Model

Now it's time to take a closer look at our encoder-decoder model. Our model will consist of two LSTMs. One will serve as an encoder, encoding the input sequence and producing internal state vectors which serve as conditioning for the decoder. The decoder, another LSTM, is responsible for predicting the individual characters of the target sequence. Its initial state is set to the state vectors from the encoder. This passes information about what to generate from the encoder to the decoder.

Let's build this model using Keras. For that we'll need the LSTM layer, as well as a Dense layer.

import keras, tensorflow
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import numpy as np

Before building the encoder and decoder parts we first have to define some hyperparameters. Since the dimensionality for the encoder and decoder LSTM layers have to match, one parameter, latent_dim, is fine here.

batch_size = 64  # batch size for training
epochs = 100  # number of epochs to train for
latent_dim = 256  # latent dimensionality of the encoding space

With that we can start building the model. First we have to construct the encoder. When creating the LSTM layer we have to pay attention to set the return_state argument to true, since we want to use the encoder's internal state vectors for the decoder.

encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
encoder_states = [state_h, state_c]

Note that we discard encoder_outputs as depicted in the schematic overview above since we're only interested in the state vectors.

Using encoder_states we can now build the decoder. Again we'll use Keras' Input layer to be flexible concerning input sequence lengths. When creating the LSTM, we now want it to return full output sequences as well as the internal state vectors. We're not using the decoder's internal states during training, but we will need them later for inference.

To arrive at the individual characters from the decoder's output we attach a Dense layer to the decoder's LSTM outputs where the number of units match the number of decoder tokens. This way we can just use a softmax activation for the dense layer's outputs and train the whole model using a categorical cross-entropy loss - a standard choice for classification problems.

decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

Now let's glue both parts together using Keras' Model functional API. Specify the inputs needed to produce the outputs and the resulting model will automatically include all layers necessary to compute the outputs.

model = Model(inputs=[encoder_inputs, decoder_inputs], 

Training the Model

Let's compile the model defining the optimizer and our cross-entropy loss. As a cross-check we can also print out a summary of the individual layers included in our model.

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

No surprises here. lstm_1, our encoder, takes as input the first input layer, while the decoder, lstm_2, uses the encoder's internal states as well as the second input layer. Our model has about 700.000 parameters in total!

Time to run the training! Since we're running a CPU-only instance, the training will take about 1 hour (compared to only about 20 minutes when using a GPU). The patient reader can un-comment the following cell to run the training. We'll skip it and simply load the weights of a pre-trained model.

'''[encoder_input_data, decoder_input_data], decoder_target_data,

Testing the Machine Translation

Since we're impatient by nature, we'll simply use the pre-trained weights. Lets load them.

model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

Ok, with the weights in place it's time to test our machine translation model by translating a few test sentences.

The inference mode works a bit differently than the training procedure. The procedure can be broken down into 4 steps:

1. Encode the input sequence, return its internal states.

2. Run the decoder using just the start-of-sequence character as input and the encoder internal states as the decoder's initial states.

3. Append the character predicted (after lookup of the token) by the decoder to the decoded sequence.

4. Repeat the process with the previously predicted character token as input and updates internal states.

Let's go ahead and implement this. Since we only need the encoder for encoding the input sequence we'll split encoder and decoder into two separate models.

encoder_model = Model(encoder_inputs, encoder_states)

decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(
  decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

decoder_model = Model(
  [decoder_inputs] + decoder_states_inputs,
  [decoder_outputs] + decoder_states)

In order to conveniently perform the lookup from step 3 above we'll create reverse-lookup dictionaries for both the input and target tokens.

# reverse-lookup token index to turn sequences back to characters
reverse_input_char_index = dict(
  (i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict(
  (i, char) for char, i in target_token_index.items())

With that we can create a function to perform the whole process of decoding a given input sequence (inputs already tokenized).

def decode_sequence(input_seq):
  # encode the input sequence to get the internal state vectors.
  states_value = encoder_model.predict(input_seq)
  # generate empty target sequence of length 1 with only the start character
  target_seq = np.zeros((1, 1, num_decoder_tokens))
  target_seq[0, 0, target_token_index['\t']] = 1.
  # output sequence loop
  stop_condition = False
  decoded_sentence = ''
  while not stop_condition:
    output_tokens, h, c = decoder_model.predict(
      [target_seq] + states_value)
    # sample a token and add the corresponding character to the 
    # decoded sequence
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    sampled_char = reverse_target_char_index[sampled_token_index]
    decoded_sentence += sampled_char
    # check for the exit condition: either hitting max length
    # or predicting the 'stop' character
    if (sampled_char == '\n' or 
        len(decoded_sentence) > max_decoder_seq_length):
      stop_condition = True
    # update the target sequence (length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, sampled_token_index] = 1.
    # update states
    states_value = [h, c]
  return decoded_sentence

With that let's sample a few test cases!

for seq_index in range(10):
  input_seq = encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(input_seq)
  print('Input sentence:', input_texts[seq_index])
  print('Decoded sentence:', decoded_sentence)

Not bad for a model consisting only of two LSTM layers and a linear one!

But those were all examples from the training set. Let's validate the model using our own example: Let's have the model translate something simple, like:

"How are you?"

We'll put all the tokenization and decoding into one cell and print out the decoded sequence.

input_sentence = "How are you?"
test_sentence_tokenized = np.zeros(
  (1, max_encoder_seq_length, num_encoder_tokens), dtype='float32')
for t, char in enumerate(input_sentence):
  test_sentence_tokenized[0, t, input_token_index[char]] = 1.

Great! But surely this simple model trained on 10k examples will have some failure cases! Let's try out some examples the model didn't see during training.

val_input_texts = []
val_target_texts = []
line_ix = 12000
for line in lines[line_ix:line_ix+10]:
  input_text, target_text = line.split('\t')

val_encoder_input_data = np.zeros(
  (len(val_input_texts), max([len(txt) for txt in val_input_texts]),
   num_encoder_tokens), dtype='float32')

for i, input_text in enumerate(val_input_texts):
  for t, char in enumerate(input_text):
    val_encoder_input_data[i, t, input_token_index[char]] = 1.
for seq_index in range(10):
  input_seq = val_encoder_input_data[seq_index: seq_index + 1]
  decoded_sentence = decode_sequence(input_seq)
  print('Input sentence:', val_input_texts[seq_index])
  print('Decoded sentence:', decoded_sentence[:-1])
  print('Ground Truth sentence:', val_target_texts[seq_index])

Ok, that clearly shows some failure cases. Note that for the evaluation texts we had input sequences longer than the longest sequence in the training set.

max([len(txt) for txt in val_input_texts])

While the model is able to produce a decoded sequence for these inputs, it will produce worse outputs the longer the input sequences become. After all, the model was not trained to decode sequences that long. You can try this out by changing the line_ix parameter to something later in the dataset (maybe something around 50000).

Summary and Outlook

This article showed the capabilities of encoder-decoder models combined with LSTM layers for sequence-to-sequence learning.

The dataset consisted of English-German sentence pairs. All characters of the respective sentences were first tokenized. The sequences of tokenized characters were then used as input and target for the encoder-decoder model.

During training we used the teacher forcing method where we offset the decoders input and target by one timestep. For inference, we used a slightly different setup, but still used the same modules we trained earlier. The input sequence was processed by the encoder and its final hidden states, along with the start-of-sequence character, were used as input for the decoder. Each predicted character was then fed back into the decoder while the hidden states were updated. We repeated this until the decoder predicted the end-of-sequence character telling us the predicted sequence is complete.

While this model is a rather simple and far away from state-of-the-art neural machine translation models it still shows some principles still used in more recent approaches (e.g. encoder-decoder models in the infamous Attention Is All You Need paper). Even Google Translate seems to be using a more sophisticated encoder-decoder model based on (bidirectional) LSTMs.

Going further, turning the LSTMs we used into their bidirectional version might help improve the model. To become even better, one might want to turn to methods such as self-attention (described for example in a great blogpost on and embed them in the encoder-decoder structure.