langvae.data_conversion package
Submodules
langvae.data_conversion.sparse module
langvae.data_conversion.tokenization module
- class langvae.data_conversion.tokenization.TokenizedAnnotatedDataSet(source: Iterable[Sentence] | Tuple[List[List[str]], List[List[str]]], tokenizer: PreTrainedTokenizer, max_len: int, annotations: Dict[str, List[str]], caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]
Bases:
TokenizedDataSetA dataset class that handles tokenization of text data with annotations.
This class extends TokenizedDataSet to include handling of annotations alongside the tokenization of text. It supports both simple lists of SAF Sentence objects and tuples of sentences with corresponding annotations.
- source
The source data containing annotated sentences.
- tokenizer
The tokenizer used for tokenization.
- Type:
PreTrainedTokenizer
- class langvae.data_conversion.tokenization.TokenizedDataSet(source: Iterable[Sentence] | List[str], tokenizer: PreTrainedTokenizer, max_len: int, caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]
Bases:
DatasetA dataset class that handles the tokenization of text data.
This class is designed to convert text data into a tokenized format suitable for model training or evaluation. It supports tokenization of plain string data or structured SAF Sentence objects. The tokenized output is converted into a one-hot encoded format for use in neural network models.
- source
The source data containing SAF Sentences or strings.
- Type:
Union[Iterable[Sentence], List[str]]
- tokenizer
The tokenizer used for converting text to tokens.
- Type:
PreTrainedTokenizer