Welcome to bert-embedding’s documentation!¶
BERT, published by Google , is new way to obtain pre-trained language model word representation. Many NLP tasks are benefit from BERT to get the SOTA.
The goal of this project is to obtain the sentence and token embedding from BERT’s pre-trained model. In this way, instead of building and do fine-tuning for an end-to-end NLP model, you can build your model by just utilizing the sentence or token embedding.
This project is implemented with @MXNet. Special thanks to @gluon-nlp team.
BertEmbedding¶
BERT embedding.
-
class
bert_embedding.bert.
BertEmbedding
(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶ Bases:
object
Encoding from BERT model.
Parameters: - ctx (Context.) – running BertEmbedding on which gpu device id.
- dtype (str) – data type to use for the model.
- model (str, default bert_12_768_12.) – pre-trained BERT model
- dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
- params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
- max_seq_length (int, default 25) – max length of each sequence
- batch_size (int, default 256) – batch size
-
__init__
(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶ Encoding from BERT model.
Parameters: - ctx (Context.) – running BertEmbedding on which gpu device id.
- dtype (str) –
- type to use for the model. (data) –
- model (str, default bert_12_768_12.) – pre-trained BERT model
- dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
- params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
- max_seq_length (int, default 25) – max length of each sequence
- batch_size (int, default 256) – batch size
-
embedding
(sentences, oov_way='avg')[source]¶ Get tokens, tokens embedding
Parameters: - sentences (List[str]) – sentences for encoding.
- oov_way (str, default avg.) – use avg, sum or last to get token embedding for those out of vocabulary words
Returns: List of tokens, and tokens embedding
Return type: List[(List[str], List[ndarray])]
Available pre-trained BERT models¶
book_corpus_wiki_en_uncased | book_corpus_wiki_en_cased | wiki_multilingual | wiki_multilingual_cased | wiki_cn | |
---|---|---|---|---|---|
bert_12_768_12 | ✓ | ✓ | ✓ | ✓ | ✓ |
bert_24_1024_16 | x | ✓ | x | x | x |