Skip to content

dataset

Bases: Dataset

The dataset base class.

It defines the template of the dataset, composed of X, y and optional encoder of the features.

Attributes:

Name Type Description
X Any

The inputs/features of the data instances in the batch.

y Any

The outputs/labels of the data instances in the batch.

encoder Any, default = None

The optional encoder, which can be used for text dataset to project text to the embeddings.

Methods:

Name Description
__init__

The dataset initialization method.

__len__

The size method of the input data batch.

__getitem__

The item retrieval method of the input data batch with certain index.

Source code in tinybig/data/base_data.py
class dataset(Dataset):
    """
    The dataset base class.

    It defines the template of the dataset, composed of X, y and optional encoder of the features.

    Attributes
    ----------
    X: Any
        The inputs/features of the data instances in the batch.
    y: Any
        The outputs/labels of the data instances in the batch.
    encoder: Any, default = None
        The optional encoder, which can be used for text dataset to project text to the embeddings.

    Methods
    ----------
    __init__
        The dataset initialization method.

    __len__
        The size method of the input data batch.

    __getitem__
        The item retrieval method of the input data batch with certain index.
    """
    def __init__(self, X, y, encoder=None, *args, **kwargs):
        """
        The initialization method of the base dataset class.

        It initializes the dataset class object,
        involving the input features X, output labels y and the optional encoder.

        Parameters
        ----------
        X: Any
            The inputs/features of the data instances in the batch.
        y: Any
            The outputs/labels of the data instances in the batch.
        encoder: Any, default = None
            The optional encoder, which can be used for text dataset to project text to the embeddings.

        Returns
        ----------
        object
            The initialized object of the base dataset.
        """
        super().__init__()
        self.X = X
        self.y = y
        self.encoder = encoder

    def __len__(self):
        """
        The batch size method.

        It reimplements the built-in batch size method.

        Returns
        -------
        int
            The batch size of the input data instance set.
        """
        return len(self.X)

    def __getitem__(self, idx, *args, **kwargs):
        """
        The item retrieval method.

        It returns the feature and label of data instances with certain index.

        Parameters
        ----------
        idx: int
            The index of the data instance to be retrieved.

        Returns
        -------
        tuple
            The retrieved feature and label of the data instance.
        """
        if self.encoder is None:
            sample = self.X[idx]
            target = self.y[idx]
            return sample, target
        else:
            # for the text dataset, the encoder will be applied to obtain its embeddings
            sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
            target = self.y[idx]
            return torch.squeeze(sample, dim=0), target

__getitem__(idx, *args, **kwargs)

The item retrieval method.

It returns the feature and label of data instances with certain index.

Parameters:

Name Type Description Default
idx

The index of the data instance to be retrieved.

required

Returns:

Type Description
tuple

The retrieved feature and label of the data instance.

Source code in tinybig/data/base_data.py
def __getitem__(self, idx, *args, **kwargs):
    """
    The item retrieval method.

    It returns the feature and label of data instances with certain index.

    Parameters
    ----------
    idx: int
        The index of the data instance to be retrieved.

    Returns
    -------
    tuple
        The retrieved feature and label of the data instance.
    """
    if self.encoder is None:
        sample = self.X[idx]
        target = self.y[idx]
        return sample, target
    else:
        # for the text dataset, the encoder will be applied to obtain its embeddings
        sample = self.encoder(torch.unsqueeze(self.X[idx], 0))
        target = self.y[idx]
        return torch.squeeze(sample, dim=0), target

__init__(X, y, encoder=None, *args, **kwargs)

The initialization method of the base dataset class.

It initializes the dataset class object, involving the input features X, output labels y and the optional encoder.

Parameters:

Name Type Description Default
X

The inputs/features of the data instances in the batch.

required
y

The outputs/labels of the data instances in the batch.

required
encoder

The optional encoder, which can be used for text dataset to project text to the embeddings.

None

Returns:

Type Description
object

The initialized object of the base dataset.

Source code in tinybig/data/base_data.py
def __init__(self, X, y, encoder=None, *args, **kwargs):
    """
    The initialization method of the base dataset class.

    It initializes the dataset class object,
    involving the input features X, output labels y and the optional encoder.

    Parameters
    ----------
    X: Any
        The inputs/features of the data instances in the batch.
    y: Any
        The outputs/labels of the data instances in the batch.
    encoder: Any, default = None
        The optional encoder, which can be used for text dataset to project text to the embeddings.

    Returns
    ----------
    object
        The initialized object of the base dataset.
    """
    super().__init__()
    self.X = X
    self.y = y
    self.encoder = encoder

__len__()

The batch size method.

It reimplements the built-in batch size method.

Returns:

Type Description
int

The batch size of the input data instance set.

Source code in tinybig/data/base_data.py
def __len__(self):
    """
    The batch size method.

    It reimplements the built-in batch size method.

    Returns
    -------
    int
        The batch size of the input data instance set.
    """
    return len(self.X)