Dataset Importation

Nyuntam provides comprehensive support for various custom dataset formats. Additionally, some algorithms can execute without requiring any custom datasets.

Dataset Preparation Guidelines

Nyuntam Text-Generation

Nyuntam supports loading any text dataset compatible with the Hugging Face datasets.load_dataset for HF datasets or datasets.load_from_disk for custom datasets. It supports two main dataset formats:

LLM - Single Column

This format is suitable for use cases where the dataset is already formatted with a single text column. For example, "wikitext" dataset with "text" as the TEXT_COLUMN.

LLM - Multi Columns

This format is suitable for datasets with multiple columns. The multi-column dataset can also handle simple formatting of the dataset for instructional use cases. (See example usage below)

Please note that there are no limitations regarding the loading of datasets with either single-column or multi-column structures. However, it is advisable to opt for loading multi-column datasets using option 2, especially when straightforward formatting is desired.

Parameters for Dataset Loading:

Parameter	Default Value	Description
DATASET_SUBNAME	null	Subname of the dataset if applicable.
TEXT_COLUMN	text	Specifies the text columns to be used. If multiple columns are present, they should be separated by commas.
SPLIT	train	Specifies the split of the dataset to load, such as 'train', 'validation', or 'test'.
FORMAT_STRING	null	If provided, this string is used to format the dataset. It allows for customization of the dataset's text representation.

Dataset Formatting:

The process responsible for formatting the dataset based on the provided parameters follows these steps:

If no format string is provided, the TEXT_COLUMN and the dataset are used as is.
If a format string is provided, it is applied to format the dataset. The format string should contain placeholders for the columns specified in TEXT_COLUMN.
Once the dataset is formatted, a log message is generated to indicate that the format string was found and used. It also displays a sample of the formatted dataset.
Finally, the dataset is mapped to replace the original "text" columns with the newly formatted "text" column (or) a new "text" column is created.

Example usage for Multi-column dataset:

Params for Alpaca Dataset (yahma/alpaca-cleaned)

DATASET_SUBNAME - null
TEXT_COLUMN - input,output,instruction
SPLIT - train
FORMAT_STRING - Instruction:\n{instruction}\n\nInput:\n{input}\n\nOutput:\n{output}

Alpaca dataset before Formatting

input	instruction	output
How to bake a cake	Step 1: Preheat the oven to...	A delicious homemade cake...
Introduction to Python	Python is a high-level...	Learn Python programming...
History of the Roman Empire	The Roman Empire was...	Explore the rise and fall...
...	...	...

Alpaca dataset after formatting with the input params

input	instruction	output	text
How to bake a cake	Step 1: Preheat the oven to...	A delicious homemade cake...	Instruction:\nStep 1: Preheat...\n\nInput:\nHow to bake a cake\n\nOutput:\nA delicious homemade cake...
Introduction to Python	Python is a high-level...	Learn Python programming...	Instruction:\nPython is a hi...\n\nInput:\nIntroduction to Python\n\nOutput:\nLearn Python programming...
History of the Roman Empire	The Roman Empire was...	Explore the rise and fall...	Instruction:\nThe Roman Emp...\n\nInput:\nHistory of the Roman Empire\n\nOutput:\nExplore the rise and fall...
...	...	...	...

Importing Your Dataset

There are two different ways to import your dataset into Nyuntam:

Custom Data

Users who are using a custom dataset (stored locally): - For nyuntam-adapt, CUSTOM_DATASET_PATH argument in the yaml can be updated .

For nyuntam-text-generation and nyuntam-vision, DATA_PATH argument in the yaml can be updated.

Pre-existing dataset

Users who are using existing datasets from huggingface: - For nyuntam-adapt, use the DATASET argument in the yaml. - for nyuntam-text-generation, use the DATASET_NAME argument in the yaml.

For more examples please refer to Examples

Note: For LLM tasks, the data folder must be loadable by datasets.load_from_disk and should return a datasets.DatasetDict object and not a datasets.Dataset object.