Concepts

Dataset

The Dataset entity in the data platform represents a structured collection of data, usually in tabular form. It is a crucial component of data management and processing within the platform, serving as the primary format for storing, manipulating, and retrieving data.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Pyspark Examples in Transforms

Source

© Copyright 2024. All rights reserved.

Concepts

Dataset

The Dataset entity in the data platform represents a structured collection of data, usually in tabular form. It is a crucial component of data management and processing within the platform, serving as the primary format for storing, manipulating, and retrieving data.

Properties

  • ID: A unique identifier for the dataset.

  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.

  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.

  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).

  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.

  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.


Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.

  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.

  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.


Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.

Pyspark Examples in Transforms

Source

© Copyright 2024. All rights reserved.