dataset

Bases: Dataset

The dataset base class.

This class serves as a template for datasets, providing basic functionality for managing input features, labels, and optional feature encoders.

Attributes:

Name	Type	Description
`X`	`Any`	The input features of the dataset instances.
`y`	`Any`	The output labels of the dataset instances.
`encoder`	`Any, default = None`	An optional encoder for feature transformations (e.g., text to embeddings).

Methods:

Name	Description
`__init__`	Initializes the dataset with input features, labels, and an optional encoder.
`__len__`	Returns the number of instances in the dataset.
`__getitem__`	Retrieves the feature and label of a data instance by index.

Source code in tinybig/data/base_data.py

class dataset(Dataset):
    """
    The dataset base class.

    This class serves as a template for datasets, providing basic functionality for
    managing input features, labels, and optional feature encoders.

    Attributes
    ----------
    X: Any
        The input features of the dataset instances.
    y: Any
        The output labels of the dataset instances.
    encoder: Any, default = None
        An optional encoder for feature transformations (e.g., text to embeddings).

    Methods
    -------
    __init__(X, y, encoder=None, *args, **kwargs)
        Initializes the dataset with input features, labels, and an optional encoder.
    __len__()
        Returns the number of instances in the dataset.
    __getitem__(idx, *args, **kwargs)
        Retrieves the feature and label of a data instance by index.
    """
    def __init__(self, X, y, encoder=None, *args, **kwargs):
        """
        Initializes the dataset class.

        Parameters
        ----------
        X: Any
            The input features of the dataset instances.
        y: Any
            The output labels of the dataset instances.
        encoder: Any, default = None
            An optional encoder for feature transformations (e.g., text to embeddings).

        Returns
        -------
        None
        """
        super().__init__()
        self.X = X
        self.y = y
        self.encoder = encoder

    def __len__(self):
        """
        Returns the number of instances in the dataset.

        Returns
        -------
        int
            The size of the dataset.
        """
        return len(self.X)

    def __getitem__(self, idx, *args, **kwargs):
        """
        Retrieves the feature and label of a data instance by index.

        If an encoder is defined, the feature is transformed using the encoder.

        Parameters
        ----------
        idx: int
            The index of the data instance to retrieve.

        Returns
        -------
        tuple
            A tuple containing the feature and label of the data instance.
        """
        if self.encoder is None:
            sample = self.X[idx]
            target = self.y[idx]
            return sample, target
        else:
            # for the text dataset, the encoder will be applied to obtain its embeddings
            sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
            target = self.y[idx]
            return torch.squeeze(sample, dim=0), target

`getitem(idx, *args, **kwargs)`

Retrieves the feature and label of a data instance by index.

If an encoder is defined, the feature is transformed using the encoder.

Parameters:

Name	Type	Description	Default
`idx`		The index of the data instance to retrieve.	required

Returns:

Type	Description
`tuple`	A tuple containing the feature and label of the data instance.

Source code in tinybig/data/base_data.py

def __getitem__(self, idx, *args, **kwargs):
    """
    Retrieves the feature and label of a data instance by index.

    If an encoder is defined, the feature is transformed using the encoder.

    Parameters
    ----------
    idx: int
        The index of the data instance to retrieve.

    Returns
    -------
    tuple
        A tuple containing the feature and label of the data instance.
    """
    if self.encoder is None:
        sample = self.X[idx]
        target = self.y[idx]
        return sample, target
    else:
        # for the text dataset, the encoder will be applied to obtain its embeddings
        sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
        target = self.y[idx]
        return torch.squeeze(sample, dim=0), target

`init(X, y, encoder=None, *args, **kwargs)`

Initializes the dataset class.

Parameters:

Name	Description	Default
`X`	The input features of the dataset instances.	required
`y`	The output labels of the dataset instances.	required
`encoder`	An optional encoder for feature transformations (e.g., text to embeddings).	`None`

Returns:

Type	Description
`None`

Source code in tinybig/data/base_data.py

def __init__(self, X, y, encoder=None, *args, **kwargs):
    """
    Initializes the dataset class.

    Parameters
    ----------
    X: Any
        The input features of the dataset instances.
    y: Any
        The output labels of the dataset instances.
    encoder: Any, default = None
        An optional encoder for feature transformations (e.g., text to embeddings).

    Returns
    -------
    None
    """
    super().__init__()
    self.X = X
    self.y = y
    self.encoder = encoder

`len()`

Returns the number of instances in the dataset.

Returns:

Type	Description
`int`	The size of the dataset.

Source code in tinybig/data/base_data.py

def __len__(self):
    """
    Returns the number of instances in the dataset.

    Returns
    -------
    int
        The size of the dataset.
    """
    return len(self.X)