GSoC 2021: PPLM.jl - Controlled Text Generation with Julia (Part I)

Being a part of Google Summer of Code 2021 with Julia Community has been one of the most amazing experiences so far. All thanks to my mentors Avik Sengupta and Tejas Vaidhya for their encouragement and support. It has been a month since I started working on PPLM.jl, a Julia based package for controlled text generation, based on Plug and Play Language Models by Uber. Here is my first blog on what PPLM.jl has to offer so far.

What's PPLM.jl?

Plug and Play Language Model or PPLM is an approach that allows a user to flexibly plug in one or more tiny attribute models representing the desired steering objective into a large, unconditional language model (LM).

While one can easily load the pre-trained language models like GPT-2 from Transformers.jl to generate coherent text, controlling and steering of text towards the desired attribute is not directly possible with these pre-trained models as it is (particularly without fine-tuning). Moreover, even the simple generation lacks convenience in terms of API. PPLM.jl not only simplifies the text generation process but also allows some degree of steering based on Attribute Models. It does so with the following three steps:

  1. Given a partially generated sentence, compute log(p(x)) and log(p(a|x)) and the gradients of each with respect to the hidden representation of the underlying language model. 

  2. Use the gradients to move the hidden representation of the language model a small step in the direction of increasing log(p(a|x)) and increasing log(p(x)).

  3. Sample the next word.

Link to the repo: https://github.com/AdarshKumar712/PPLM.jl

Key Features of PPLM.jl

While PPLM.jl is based on Plug and Play Language Models, it has more to it than just being an adaptation from the original implementation. Some of the features implemented so far are:

  1. API for GPT2 pre-trained Tokenizer

  2. Normal Text Generation

  3. Attribute Models: BagOfWords Model and Discriminator Model.

  4. Pre-defined Bag Of Words list (from Huggingface)

  5. Pre-trained Discriminators (from Huggingface) with Converted weights to BSON

  6. For perturbed text generation:

    a. Steering of last hidden states based on BagOfWords

    b. Steering of last hidden states based on Attribute Discriminators

    c. More Coming Soon...

  7. Train Discriminators (GPT2 based Classifier)

Let's discuss some of these points in this blog. Rest will be covered in the future blogs along with some other features implemented by then. Also please note that the project is not yet registered.

In order to load the PPLM package you can use the following code:

using Pkg;
Pkg.add(url="https://github.com/AdarshKumar712/PPLM.jl")
using PPLM  # this may take some time in precompilation
Julia

1. GPT2 Tokenizer

PPLM.jl allows users to load pre-trained GPT2 tokenizers based on BytePairEncoding.jl and Transformers.jl, which can then be used to tokenize/encode English text with a single line of code. The tokenizer is implemented with the following structure:

abstract type GPT2 <: PretrainedTokenizer end
struct GPT2Tokenizer <: GPT2
    encoder::Vocabulary{String}
    bpe_encode::GenericBPE
    bpe_decode::UnMap
    vocab::Dict{String, Any}
    unk_token::String
    unk_id::Int   
    eos_token::String
    eos_token_id::Int
    pad_token::String
    pad_token_id::Int
end
Julia

Let's see how you can tokenize text with PPLM.

# Load Tokenizer
using PPLM
tokenizer = PPLM.load_pretrained_tokenizer(GPT2)
sentence = "This is an example of Tokenization"

Once, you have loaded your tokenizer, one can use either of the options:

tokens = tokenizer(sentence)
# or 
tokens = encode(tokenizer, sentence)
"""
returns the following output
7-element Vector{Int64}:
  1213
   319
   282
  1673
   287
 29131
  1635
"""
Julia

Now you have your list of tokens. Suppose you want to get back your sentence. This can be done in two ways:

# Firsth Method:
sentence = detokenize(tokenizer, tokens)
# Second Method:
decoded_tokens_list = decode(tokenizer, tokens)	
# returns vector: ["This", "Ġis", "Ġan", "Ġexample", "Ġof", "ĠToken", "ization"]
sentence = detokenizer(tokenizer, decoded_tokens_list) 
Julia

If you apply either of the methods, you will get back your sentence:

This is an example of Tokenization

2. Normal Text Generation

PPLM.jl can be used to generate normal (unperturbed) text with the GPT2 model, with any of the two sampling methods `top_k` and `nucleus:

To generate text, you can use the following code:

sample_normal(;primer="Fruits are", tokenizer=tokenizer, model=model, method="top_k")
Julia

Here is a Sample text generated with GPT2 using the above code:

With Top_k sampling, k=50, prompt = "Fruits are"

"Fruits are the key ingredient in our diet; their vitamins, and proteins are essential to build our immune system. What makes a good fruit one of them is simply as simple as your diet is. Fruit is one simple nutrient that is used effectively as a defense against sickness and stress (which can be very life changing indeed). When the body has just consumed enough fat for at least 40-50 days, the body also releases hormones known as the hormone estrogen in order to prevent infection. A good diet makes life easier"
Julia

With Nucleus sampling, p=0.6, prompt = "Fruits are"

"Fruits are packed with the goodness of ancient Greek life, plants that protect and revive us from death. At every stone in the garden, your fruit may reflect on the people who once carried you from town to town, those who would still give you food to live, and the perfect pair of hand-gloved fingers you may wear in your golden bedroll."
Julia

PPLM based Text Generation: A Preview

While the API for the PPLM based controlled text generation part is yet to be finished, here is a preview/example of text generated with BagOfWords Model, keeping the same prompt. More detail on this will be covered in the next blog.

Prompt = "To conclude"

# Without Perturbation:
"To conclude, it is only fair to ask, on the other hand, what we think about one particular type of religious denomination that has an unusual relation to American history (other than the ones associated with Catholicism)?\n\nI could imagine it is just because American social studies scholars aren't as committed to explaining the causes of the American revival. Nor would I imagine Protestant professors who write for the Nation, not least because they might fear an attack by critics on their writings that might bring a backlash against their conclusions"
# With Bag of Word: [politics]
"To conclude this brief essay, I have briefly discussed one of my own writing's main points: How a great many poor working people who were forced by the government to sell goods to high-end supermarkets to make ends meet were put off purchasing goods at a time they wouldn't be able afford. That point of distinction arises in every social democracy I identify as libertarian.\n\nA large number of people in this group simply did not follow basic political norms, and in order not to lose faith that politics was in"
# With Bag of Word: [science]
"To conclude this post, I've gathered data from more than 40,000 users over a span 5 years. In my research I collected user logs and data analysis charts to create data for our project.\n\nWhile data is useful. But most data is missing data not needed by a scientific project or data analysis. What's more, the statistical data itself cannot really be quantified. Instead, it is a matter of data sets, data collection charts, data analyses and analysis charts.\n\nData is a"
Julia

Conclusion

That's all for this blog. This is just an introductory blog on the project. Will continue with BagOfWords Model in the next blog. See you soon!

References

Runtimes (1)