Dataset¶
This module loads and parses input data. Author: Lucas Pavanelli
- class dataset.Data(file_name)¶
Loads data.
- file_namestr
Dataset’s file name
- corpuslist
List containing tokens, tags and postags
- vocab_inset
Input vocabulary
- in_w2iddict
Map from word to id for input vocabulary
- in_id2wdict
Map from id to word for input vocabulary
- vocab_outset
Output vocabulary
- out_w2iddict
Map from word to id for output vocabulary
- out_id2wdict
Map from id to word for output vocabulary
- preprocess()
Preprocesses tokens and tags
- fit()
Gets train and test data
- fit()¶
Separates corpus into training and test set
- (List, List)
Tokens and tags as PyTorch tensors
- preprocess(tokens, tags=[])¶
Converts tokens and tags to PyTorch tensors
- tokenslist of string
List of tokens.
- tagslist of string
List of tags.
- (Tensor, Tensor)
Tokens and tags as PyTorch tensors
- class dataset.DataBERT(file_name)¶
Loads data for BERT model. Inherits from Data class.
- file_namestr
Dataset’s file name
- tokenizerlist
BERT’s tokenizer
- preprocess()
Preprocesses tokens and tags
- fit()
Gets train and test data
- fit()¶
Separates corpus into training and test set
- (List, List)
Tokens and tags as PyTorch tensors
- preprocess(tokens, tags=[])¶
Converts tokens and tags to PyTorch tensors
- tokenslist of string
List of tokens.
- tagslist of string
List of tags.
- (Tensor, Tensor, Tensor)
Tokens, subwords and tags as PyTorch tensors