Skip to content

dataset

Bases: Dataset

The dataset base class.

This class serves as a template for datasets, providing basic functionality for managing input features, labels, and optional feature encoders.

Attributes:

Name Type Description
X Any

The input features of the dataset instances.

y Any

The output labels of the dataset instances.

encoder Any, default = None

An optional encoder for feature transformations (e.g., text to embeddings).

Methods:

Name Description
__init__

Initializes the dataset with input features, labels, and an optional encoder.

__len__

Returns the number of instances in the dataset.

__getitem__

Retrieves the feature and label of a data instance by index.

Source code in tinybig/data/base_data.py
class dataset(Dataset):
    """
    The dataset base class.

    This class serves as a template for datasets, providing basic functionality for
    managing input features, labels, and optional feature encoders.

    Attributes
    ----------
    X: Any
        The input features of the dataset instances.
    y: Any
        The output labels of the dataset instances.
    encoder: Any, default = None
        An optional encoder for feature transformations (e.g., text to embeddings).

    Methods
    -------
    __init__(X, y, encoder=None, *args, **kwargs)
        Initializes the dataset with input features, labels, and an optional encoder.
    __len__()
        Returns the number of instances in the dataset.
    __getitem__(idx, *args, **kwargs)
        Retrieves the feature and label of a data instance by index.
    """
    def __init__(self, X, y, encoder=None, *args, **kwargs):
        """
        Initializes the dataset class.

        Parameters
        ----------
        X: Any
            The input features of the dataset instances.
        y: Any
            The output labels of the dataset instances.
        encoder: Any, default = None
            An optional encoder for feature transformations (e.g., text to embeddings).

        Returns
        -------
        None
        """
        super().__init__()
        self.X = X
        self.y = y
        self.encoder = encoder

    def __len__(self):
        """
        Returns the number of instances in the dataset.

        Returns
        -------
        int
            The size of the dataset.
        """
        return len(self.X)

    def __getitem__(self, idx, *args, **kwargs):
        """
        Retrieves the feature and label of a data instance by index.

        If an encoder is defined, the feature is transformed using the encoder.

        Parameters
        ----------
        idx: int
            The index of the data instance to retrieve.

        Returns
        -------
        tuple
            A tuple containing the feature and label of the data instance.
        """
        if self.encoder is None:
            sample = self.X[idx]
            target = self.y[idx]
            return sample, target
        else:
            # for the text dataset, the encoder will be applied to obtain its embeddings
            sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
            target = self.y[idx]
            return torch.squeeze(sample, dim=0), target

__getitem__(idx, *args, **kwargs)

Retrieves the feature and label of a data instance by index.

If an encoder is defined, the feature is transformed using the encoder.

Parameters:

Name Type Description Default
idx

The index of the data instance to retrieve.

required

Returns:

Type Description
tuple

A tuple containing the feature and label of the data instance.

Source code in tinybig/data/base_data.py
def __getitem__(self, idx, *args, **kwargs):
    """
    Retrieves the feature and label of a data instance by index.

    If an encoder is defined, the feature is transformed using the encoder.

    Parameters
    ----------
    idx: int
        The index of the data instance to retrieve.

    Returns
    -------
    tuple
        A tuple containing the feature and label of the data instance.
    """
    if self.encoder is None:
        sample = self.X[idx]
        target = self.y[idx]
        return sample, target
    else:
        # for the text dataset, the encoder will be applied to obtain its embeddings
        sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
        target = self.y[idx]
        return torch.squeeze(sample, dim=0), target

__init__(X, y, encoder=None, *args, **kwargs)

Initializes the dataset class.

Parameters:

Name Type Description Default
X

The input features of the dataset instances.

required
y

The output labels of the dataset instances.

required
encoder

An optional encoder for feature transformations (e.g., text to embeddings).

None

Returns:

Type Description
None
Source code in tinybig/data/base_data.py
def __init__(self, X, y, encoder=None, *args, **kwargs):
    """
    Initializes the dataset class.

    Parameters
    ----------
    X: Any
        The input features of the dataset instances.
    y: Any
        The output labels of the dataset instances.
    encoder: Any, default = None
        An optional encoder for feature transformations (e.g., text to embeddings).

    Returns
    -------
    None
    """
    super().__init__()
    self.X = X
    self.y = y
    self.encoder = encoder

__len__()

Returns the number of instances in the dataset.

Returns:

Type Description
int

The size of the dataset.

Source code in tinybig/data/base_data.py
def __len__(self):
    """
    Returns the number of instances in the dataset.

    Returns
    -------
    int
        The size of the dataset.
    """
    return len(self.X)