langvae.data_conversion package

Submodules

langvae.data_conversion.sparse module

langvae.data_conversion.sparse.densify_w_padding(x: Tensor, pad_token_id: int) → Tensor[source]: Converts sparse one-hot tensors to token ids with padding.

langvae.data_conversion.tokenization module

class langvae.data_conversion.tokenization.TokenizedAnnotatedDataSet(source: Iterable[Sentence] | Tuple[List[List[str]], List[List[str]]], tokenizer: PreTrainedTokenizer, max_len: int, annotations: Dict[str, List[str]], caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]

Bases: TokenizedDataSet

A dataset class that handles tokenization of text data with annotations.

This class extends TokenizedDataSet to include handling of annotations alongside the tokenization of text. It supports both simple lists of SAF Sentence objects and tuples of sentences with corresponding annotations.

source

The source data containing annotated sentences.

Type:: Union[Iterable[Sentence], Tuple[List[List[str]], List[List[str]]]]

tokenizer

The tokenizer used for tokenization.

Type:: PreTrainedTokenizer

max_len

The maximum length of the tokenized output.

Type:: int

annotations

List of annotation types to be processed.

Type:: List[str]

caching

Activate caching of the tokenized inputs to accelerate reads.

Type:: bool

cache_persistence

File path for persisting cached inputs, if caching is activated.

Type:: str

tokenizer_options

Options for the tokenizer.

Type:: dict

class langvae.data_conversion.tokenization.TokenizedDataSet(source: Iterable[Sentence] | List[str], tokenizer: PreTrainedTokenizer, max_len: int, caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]

Bases: Dataset

A dataset class that handles the tokenization of text data.

This class is designed to convert text data into a tokenized format suitable for model training or evaluation. It supports tokenization of plain string data or structured SAF Sentence objects. The tokenized output is converted into a one-hot encoded format for use in neural network models.

source

The source data containing SAF Sentences or strings.

Type:: Union[Iterable[Sentence], List[str]]