GSoC 2020-Blog#2: Summary of the Summer - HuggingFace Transformers with Julia
It's coming to the end of the GSoC 2020. We have implemented lot's of stuff during this summer and still some work remain unfinished, so let's see what is done during the GSoC coding period. Here is the summary of the my work. All the code are under Transformers.HuggingFace
.
using Pkg
pkg"add Transformers#master CUDA Flux PyCall; build"
using Transformers, Flux, CUDA
using Transformers.HuggingFace
The code is implemented in pure Julia, but to see it work correctly, we'll use PyCall here just for demonstration. It's actually unnecessary for using Transformers.HuggingFace
.
using PyCall
pytorch = pyimport_conda("torch", "pytorch")
pip
pip.main(["install", "transformers"])
pytransformers = pyimport("transformers")
Project: Leveraging Hugging Face Transformers package in Julia
As the title says, the goal of this project is to reuse the existing python transformer ecosystem - the Huggingface transformers package. To achieve this, we start with the model loader and saver.
The Loader API
Huggingface has a really great ecosystem, they build not only that python library but also an amazing model hub that everyone can upload/download models to/from there. These helps the NLP community grow faster than ever before. We don't want to be missing from the fast growing trend of the NLP technology, so we build a few functionality above their download mechanism and end up some like this:
cfg = hgf"bert-base-cased:config"
The @hgf_str
API handle the whole process automatically. It download the required file (here is the bert config file) from their model hub. The downloaded file will be managed by Julia's Artifacts system, so there will be no duplicate files on our computer. Moreover, if there are already files cached by huggingface/transformers, we also reuse those files.
pygpt_cfg = pytransformers.AutoConfig.from_pretrained("gpt2")
gpt_cfg = hgf"gpt2:config"
Beside the model hub, we can also use the API with our own local files. Under the hood are several low-level api that make this happened. The complete workflow is like this:
Is Pretrained model from huggingface model hub? (yes=>2. / no=>3.)
Already use the model in python before? (yes=>2-1. / no=>2-2.)
There is a cached pretrain file on the computer: Use
get_or_download_hgf_file
directly. This will copy the file from cached dir to our Artifacts dir and register onArtifacts.toml
(Julia's Artifacts system).No cached files: Also use
get_or_download_hgf_file
directly. This will download the file from huggingface's model server to our Artifacts dir and register onArtifact.toml
.
Using the local pretrained files: Use
HuggingFace.find_or_register_hgf_file_hash
to register the file to our Artifacts system. Once the file is registered, you can find the entry appear onArtifacts.toml
.Once the model is managed under Julia's Artifacts system. we can use either
HuggingFace.get_registered_file_dir
orget_or_download_hgf_file
to get the pretrain file or dir.Then we can enjoy the
@hgf_str
API.
?get_or_download_hgf_file
get_or_download_hgf_file(model_name, file_name)
get the file path of the given model_name
and file_name
. Automatically download and register from huggingface server if file not found on Artifacts.toml
. To use with a local file, register with find_or_register_hgf_file_hash
first.
?HuggingFace.find_or_register_hgf_file_hash
find_or_register_hgf_file_hash(path, model_name, file_name)
Get the artifacts hash of the give <model_name>/<file_name>
. If not found in Artifacts.toml, get the file from path
and register on Artifacts.toml. path
can be either a url or a local path.
?
hgf"<model-name>:<item>"
Get item
from model-name
. This will ensure the required data are downloaded and registered. item
can be "config", "tokenizer", and model related like "model", or "formaskedlm", etc. Use get_model_type
to see what model/task are supported.
Loading model
we have seen how to use the API with config, but what about the model? Can we also use the @hgf_str
API? The answer is yes! But things are more complicate here. Huggingface transformers use a unique model type for each task. You need to make sure there exist a model for that task. You can find what task are supported with the `HuggingFace.get_model_type` function. For example:
HuggingFace.get_model_type(Val(:bert))
It will return the exist task name with a corresponding model type. Once you know which model you need, you can get the model loaded with the @hgf_str
API:
bert_model = hgf"bert-base-cased:forquestionanswering";
During the loading process, There are two kinds of warning might appear.
The first kind of warning is that there some extra variables in the loaded state and thus will be ignored. In the above example, the loaded is pretrained with masked language modeling task, so it would have some layer for decoding the token (field cls
). However, those layer aren't needed for question answering task, so we just ignore those extra layers.
Next, you can see the second warning in the above block of what variables (qa_outputs
here) are missing from the saved state (aren't initialized with loaded state
) and thus randomly initialized. This happened because the loaded model is not pretrained on question answering task, but here we want to fine-tune the weight on question answering task. This warning inform you what weight is being loaded. So if some weights should be loaded but appear in the warning, that means something must go wrong.
You can also see the whole model architecture on printing the model. These models are implement with Flux, so we can use gpu
like a regular Flux layer.
gpu_model = gpu(bert_model)
Saving the model for python
We have shown how to get the pretrained model and load it in Julia. The next problem is that even if we trained a model in Julia, how can a person from python world use it? If we save the model with BSON just like the regular julian way, currently there is no easy way to load it in python. Not to mention upload it to Huggingface's model hub.
As a consequence, we decide to follow exactly the same format used by huggingface/transformers, which is PyTorch state_dict
for models. Fortunately, It's just a modified python pickle format, so we implement another package called Pickle.jl in pure Julia (which is also done during this summer). Therefore, things work quite smoothly with Transformers.HuggingFace
. For example:
mycfg = HuggingFace.HGFGPT2Config(vocab_size=10,
n_embd=128, n_layer=2, n_head=2, n_positions=100, n_ctx=100,
bos_token_id=0, eos_token_id=1)
we create our own gpt2 model in julia.
mygpt2 = HuggingFace.HGFGPT2LMHeadModel(mycfg)
;mkdir jlgpt2
save_config("jlgpt2", mycfg)
save_model("jlgpt2", mygpt2)
and saved them at /jlgpt2.
This can be load from python directly.
pygpt2 = pytransformers.GPT2LMHeadModel.from_pretrained("./jlgpt2")
pygpt2.lm_head.weight
mygpt2.lm_head.weight
You can see the loaded value are correct. Everything works like a charm.
The Unfinished part and Future work
Beside the code above, we still have lots stuff remain unfinished during the GSoC 2020. Here is the list:
NO Tokenizer:
One of the best part about huggingface/transformers is that they wrap the tokenizer in a way that can be easy use. They even support multiple language and several pre-processing pipelines. Unfortunately, this work is too large to fit in the schedule with other workload. What's worse is that without the tokenizer, we cannot upload our trained model to Huggingface's model hub. Currently we can only keep use the old tokenizer from Transformers.jl to train the model, but implementing the tokenizer part will be on the top priority of the future development of this package.
Only support small amount of model kind:
There several model supported by huggingface/transformers, and we only implement 3 of them during the coding period. Since the model are manually translate from Python+Pytorch code to Julia+Flux code, each model require several days to keep the Julia API consistent with Python's. We'll add more model implementation in the Future once the tokenizer problem is fixed.
Lack of examples and tutorial:
Up to now, we didn't provide any complete training example with Transformers.HuggingFace
.
Conclusion
After the GSoC 2020, we have a package that can get the pretrain model from Huggingface and train the model in pure Julia and Flux. Moreover we can also save the model for Python to use. Although there are still some stuff need to be done, this should step-by-step improve the status of NLP with JuliaLang.
Acknowledgement
Thanks the JuliaLang and Julia Community for these opportunities. Especially thanks to my mentors, Avik Sengupta, Jun Tian, and Dhairya Gandhi. This project won't be exist without their help.
After the GSoC, I will keep developing the package. Hopefully, The project could be the bridge between two communities and start the emergence of JuliaLang on the list of recommendation for NLP projects.