Welcome to bert-embedding’s documentation!

BERT, published by Google , is new way to obtain pre-trained language model word representation. Many NLP tasks are benefit from BERT to get the SOTA.

The goal of this project is to obtain the sentence and token embedding from BERT’s pre-trained model. In this way, instead of building and do fine-tuning for an end-to-end NLP model, you can build your model by just utilizing the sentence or token embedding.

This project is implemented with @MXNet. Special thanks to @gluon-nlp team.

BertEmbedding

BERT embedding.

class bert_embedding.bert.BertEmbedding(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]

Bases: object

Encoding from BERT model.

Parameters:
  • ctx (Context.) – running BertEmbedding on which gpu device id.
  • dtype (str) – data type to use for the model.
  • model (str, default bert_12_768_12.) – pre-trained BERT model
  • dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
  • params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
  • max_seq_length (int, default 25) – max length of each sequence
  • batch_size (int, default 256) – batch size
__init__(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]

Encoding from BERT model.

Parameters:
  • ctx (Context.) – running BertEmbedding on which gpu device id.
  • dtype (str) –
  • type to use for the model. (data) –
  • model (str, default bert_12_768_12.) – pre-trained BERT model
  • dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
  • params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
  • max_seq_length (int, default 25) – max length of each sequence
  • batch_size (int, default 256) – batch size
embedding(sentences, oov_way='avg')[source]

Get tokens, tokens embedding

Parameters:
  • sentences (List[str]) – sentences for encoding.
  • oov_way (str, default avg.) – use avg, sum or last to get token embedding for those out of vocabulary words
Returns:

List of tokens, and tokens embedding

Return type:

List[(List[str], List[ndarray])]

Available pre-trained BERT models

book_corpus_wiki_en_uncased book_corpus_wiki_en_cased wiki_multilingual wiki_multilingual_cased wiki_cn
bert_12_768_12
bert_24_1024_16 x x x x

Usage

Example of using the large pre-trained BERT model from Google

from bert_embedding.bert import BertEmbedding

bert = BertEmbedding(model='bert_24_1024_16', dataset_name='book_corpus_wiki_en_cased')

Source: gluonnlp

Indices and tables