BertEmbedding¶
BERT embedding.
-
class
bert_embedding.bert.
BertEmbedding
(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶ Bases:
object
Encoding from BERT model.
Parameters: - ctx (Context.) – running BertEmbedding on which gpu device id.
- dtype (str) – data type to use for the model.
- model (str, default bert_12_768_12.) – pre-trained BERT model
- dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
- params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
- max_seq_length (int, default 25) – max length of each sequence
- batch_size (int, default 256) – batch size
-
__init__
(ctx=cpu(0), dtype='float32', model='bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', params_path=None, max_seq_length=25, batch_size=256)[source]¶ Encoding from BERT model.
Parameters: - ctx (Context.) – running BertEmbedding on which gpu device id.
- dtype (str) –
- type to use for the model. (data) –
- model (str, default bert_12_768_12.) – pre-trained BERT model
- dataset_name (str, default book_corpus_wiki_en_uncased.) – pre-trained model dataset
- params_path (str, default None) – path to a parameters file to load instead of the pretrained model.
- max_seq_length (int, default 25) – max length of each sequence
- batch_size (int, default 256) – batch size
-
embedding
(sentences, oov_way='avg', filter_spec_tokens=True)[source]¶ Get tokens, tokens embedding
Parameters: - sentences (List[str]) – sentences for encoding.
- oov_way (str, default avg.) – use avg, sum or last to get token embedding for those out of vocabulary words
- filter_spec_tokens (bool) – filter [CLS], [SEP] tokens.
Returns: List of tokens, and tokens embedding
Return type: List[(List[str], List[ndarray])]