Dataset

This module loads and parses input data. Author: Lucas Pavanelli

class dataset.Data(file_name)

Loads data.

file_namestr

Dataset’s file name

corpuslist

List containing tokens, tags and postags

vocab_inset

Input vocabulary

in_w2iddict

Map from word to id for input vocabulary

in_id2wdict

Map from id to word for input vocabulary

vocab_outset

Output vocabulary

out_w2iddict

Map from word to id for output vocabulary

out_id2wdict

Map from id to word for output vocabulary

preprocess()

Preprocesses tokens and tags

fit()

Gets train and test data

fit()

Separates corpus into training and test set

(List, List)

Tokens and tags as PyTorch tensors

preprocess(tokens, tags=[])

Converts tokens and tags to PyTorch tensors

tokenslist of string

List of tokens.

tagslist of string

List of tags.

(Tensor, Tensor)

Tokens and tags as PyTorch tensors

class dataset.DataBERT(file_name)

Loads data for BERT model. Inherits from Data class.

file_namestr

Dataset’s file name

tokenizerlist

BERT’s tokenizer

preprocess()

Preprocesses tokens and tags

fit()

Gets train and test data

fit()

Separates corpus into training and test set

(List, List)

Tokens and tags as PyTorch tensors

preprocess(tokens, tags=[])

Converts tokens and tags to PyTorch tensors

tokenslist of string

List of tokens.

tagslist of string

List of tags.

(Tensor, Tensor, Tensor)

Tokens, subwords and tags as PyTorch tensors