Concepts

Source

The Source entity represents the origin of data within the data platform. It is a fundamental component that defines where and how data is retrieved from external or internal data storage systems. A source can be a database, an API, a file storage system, or any other data provider.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Pyspark Examples in Transforms

Extract

© Copyright 2024. All rights reserved.

Concepts

Source

The Source entity represents the origin of data within the data platform. It is a fundamental component that defines where and how data is retrieved from external or internal data storage systems. A source can be a database, an API, a file storage system, or any other data provider.

Properties

  • ID: A unique identifier for the source.

  • Type: The type of source (e.g., relational database, REST API, file storage). Current supported types include AWS S3, MySQL Database, PostgreSQL, with more to be added.

  • Configuration : Specific configurations required to access and interact with the source, such as connection strings, credentials, endpoints, or file paths.

  • Available Tables/Paths: For databases, this includes available tables; for file storage like S3, this refers to specific file paths or patterns.

Usage

  • Data Extraction: Sources are used as starting points for data extraction processes. Depending on the type, this could involve querying a database, accessing files in storage, or making API calls.

  • Integration with Extracts: Each source is linked to one or more 'extracts' that define how data is pulled from the source, including specifics like table names or file paths for S3 buckets.

  • Flexibility and Scalability: The platform's ability to handle various source types allows for flexibility and scalability in data operations.

Best Practices

  • Secure Configuration: Ensure that access to sources is secure, using encrypted connections, secure credentials storage, and least privilege access principles.

  • Efficient Data Retrieval: Optimize data retrieval methods to balance performance and resource utilization, especially for large or complex sources.

  • Monitoring and Logging: Implement monitoring and logging to track source accessibility, performance, and any issues that arise during data extraction.

Maintenance and Updates

Regularly review and update source configurations to reflect changes in the underlying data storage systems or access requirements.

Pyspark Examples in Transforms

Extract

© Copyright 2024. All rights reserved.