Portuguese NER for Biomedical Information Extraction

Dataset¶

This module loads and parses input data. Author: Lucas Pavanelli

class dataset.Data(file_name)¶

Loads data.

file_namestr: Dataset’s file name

corpuslist: List containing tokens, tags and postags
vocab_inset: Input vocabulary
in_w2iddict: Map from word to id for input vocabulary
in_id2wdict: Map from id to word for input vocabulary
vocab_outset: Output vocabulary
out_w2iddict: Map from word to id for output vocabulary
out_id2wdict: Map from id to word for output vocabulary

preprocess(): Preprocesses tokens and tags
fit(): Gets train and test data

fit()¶

Separates corpus into training and test set

(List, List): Tokens and tags as PyTorch tensors

preprocess(tokens, tags=[])¶

Converts tokens and tags to PyTorch tensors

tokenslist of string: List of tokens.
tagslist of string: List of tags.

(Tensor, Tensor): Tokens and tags as PyTorch tensors

class dataset.DataBERT(file_name)¶

Loads data for BERT model. Inherits from Data class.

file_namestr: Dataset’s file name

tokenizerlist: BERT’s tokenizer

preprocess(): Preprocesses tokens and tags
fit(): Gets train and test data

fit()¶

Separates corpus into training and test set

(List, List): Tokens and tags as PyTorch tensors

preprocess(tokens, tags=[])¶

Converts tokens and tags to PyTorch tensors

tokenslist of string: List of tokens.
tagslist of string: List of tags.

(Tensor, Tensor, Tensor): Tokens, subwords and tags as PyTorch tensors