BertEmbedding¶

BERT embedding.

class bert_embedding.bert.BertEmbedding(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶

Bases: object

Encoding from BERT model.

Parameters:

ctx (Context.) – running BertEmbedding on which gpu device id.
dtype (str) – data type to use for the model.
model (str, default bert_12_768_12.) – pre-trained BERT model
dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
max_seq_length (int, default 25) – max length of each sequence
batch_size (int, default 256) – batch size

__init__(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶

Encoding from BERT model.

Parameters:

ctx (Context.) – running BertEmbedding on which gpu device id.
dtype (str) –
type to use for the model. (data) –
model (str, default bert_12_768_12.) – pre-trained BERT model
dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
max_seq_length (int, default 25) – max length of each sequence
batch_size (int, default 256) – batch size

embedding(sentences, oov_way='avg', filter_spec_tokens=True)[source]¶

Get tokens, tokens embedding

Parameters:	sentences (List[str]) – sentences for encoding. oov_way (str, default avg.) – use avg, sum or last to get token embedding for those out of vocabulary words filter_spec_tokens (bool) – filter [CLS], [SEP] tokens.
Returns:	List of tokens, and tokens embedding
Return type:	List[(List[str], List[ndarray])]