langvae.data_conversion package

Submodules

langvae.data_conversion.sparse module

langvae.data_conversion.sparse.densify_w_padding(x: Tensor, pad_token_id: int) Tensor[source]

Converts sparse one-hot tensors to token ids with padding.

langvae.data_conversion.tokenization module

class langvae.data_conversion.tokenization.TokenizedAnnotatedDataSet(source: Iterable[Sentence] | Tuple[List[List[str]], List[List[str]]], tokenizer: PreTrainedTokenizer, max_len: int, annotations: Dict[str, List[str]], caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]

Bases: TokenizedDataSet

A dataset class that handles tokenization of text data with annotations.

This class extends TokenizedDataSet to include handling of annotations alongside the tokenization of text. It supports both simple lists of SAF Sentence objects and tuples of sentences with corresponding annotations.

source

The source data containing annotated sentences.

Type:

Union[Iterable[Sentence], Tuple[List[List[str]], List[List[str]]]]

tokenizer

The tokenizer used for tokenization.

Type:

PreTrainedTokenizer

max_len

The maximum length of the tokenized output.

Type:

int

annotations

List of annotation types to be processed.

Type:

List[str]

caching

Activate caching of the tokenized inputs to accelerate reads.

Type:

bool

cache_persistence

File path for persisting cached inputs, if caching is activated.

Type:

str

tokenizer_options

Options for the tokenizer.

Type:

dict

class langvae.data_conversion.tokenization.TokenizedDataSet(source: Iterable[Sentence] | List[str], tokenizer: PreTrainedTokenizer, max_len: int, caching: bool = False, cache_persistence: str = None, return_tensors: bool = True, one_hot: bool = True, tokenizer_options: dict = None)[source]

Bases: Dataset

A dataset class that handles the tokenization of text data.

This class is designed to convert text data into a tokenized format suitable for model training or evaluation. It supports tokenization of plain string data or structured SAF Sentence objects. The tokenized output is converted into a one-hot encoded format for use in neural network models.

source

The source data containing SAF Sentences or strings.

Type:

Union[Iterable[Sentence], List[str]]

tokenizer

The tokenizer used for converting text to tokens.

Type:

PreTrainedTokenizer

max_len

The maximum length of the tokenized sequences.

Type:

int

caching

Activate caching of the tokenized inputs to accelerate reads.

Type:

bool

cache_persistence

File path for persisting cached inputs, if caching is activated.

Type:

str

tokenizer_options

Options for the tokenizer.

Type:

dict

vocab_size

Size of the tokenizer vocabulary.

Type:

int

langvae.data_conversion.tokenization.collate_sparse_fn(batch, *, collate_fn_map: dict = None)[source]
langvae.data_conversion.tokenization.get_hash(value: str) bytes[source]

Module contents