BertEmbedding

BERT embedding.

class bert_embedding.bert.BertEmbedding(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]

Bases: object

Encoding from BERT model.

Parameters:
  • ctx (Context.) – running BertEmbedding on which gpu device id.
  • dtype (str) – data type to use for the model.
  • model (str, default bert_12_768_12.) – pre-trained BERT model
  • dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
  • params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
  • max_seq_length (int, default 25) – max length of each sequence
  • batch_size (int, default 256) – batch size
__init__(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]

Encoding from BERT model.

Parameters:
  • ctx (Context.) – running BertEmbedding on which gpu device id.
  • dtype (str) –
  • type to use for the model. (data) –
  • model (str, default bert_12_768_12.) – pre-trained BERT model
  • dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
  • params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
  • max_seq_length (int, default 25) – max length of each sequence
  • batch_size (int, default 256) – batch size
embedding(sentences, oov_way='avg', filter_spec_tokens=True)[source]

Get tokens, tokens embedding

Parameters:
  • sentences (List[str]) – sentences for encoding.
  • oov_way (str, default avg.) – use avg, sum or last to get token embedding for those out of vocabulary words
  • filter_spec_tokens (bool) – filter [CLS], [SEP] tokens.
Returns:

List of tokens, and tokens embedding

Return type:

List[(List[str], List[ndarray])]