Amazon Redshift and the “Low/Zero-Code - Low/Zero-ETL” narrative
Amazon Redshift is a column-oriented, fully managed, petabyte-scale data warehouse that makes it easy and cost-effective to analyze all your data. It achieves efficient query performance through a combination of massively parallel processing, columnar data storage, data compression, and ML powered system optimizations.
In this blog let's look at how Amazon Redshift helps the “Low/Zero Code – Low/Zero-ETL” narrative.
Before we get started, let's first understand what ETL means and why we do ETL.
ETL stands for Extract, Transform and Load, and it means exactly what each of these terms sounds like. We Extract data from a source, Transform the data based on the need and Load it to a target. For example - You can use a Cobol program to Extract data from a GDG, Transform the data in a series of Cobol code and Load it into a DB2 table. Another example can be where you use an ETL tool such as Informatica or Talend to extract data from a table or a file, Transform the data and then Load the data to a database such as Amazon Redshift or Teradata.
Now that we understand what ETL is, lets look at why we do ETL. Organization data can come from multiple sources - transactional databases, flat files, IOT devices etc. ETL helps Extract data from each of these multiple sources, Transform this data to make them fit for purpose and finally Load it to a target structure.
Ok... So what is this “Low/Zero Code - Low/Zero-ETL”?
When you look at the different data sources, you realize that data in those sources many times is already in the structure and format that you can readily consume. This can be the case when data is transformed and stored in a transactional database. In case of sources which provide domain data or when you are buying prepared data from external feeds, there may be no need to transform that data. However, even these systems often have an ETL in place. The reason is, most of the traditional technologies lack the capability to query and aggregate data from different sources in real time. They require an ETL to bring all this data into one place, which in turn increases cost and latency.
“Low/Zero Code - Low/Zero-ETL” is a technique which looks at reducing or even eliminating the ETL thereby reducing cost and bringing in efficiency and faster time to market.
How does Amazon Redshift help with “Low/Zero Code - Low/Zero-ETL”?
Amazon Redshift makes loading, distributing and consuming data easy. Its native integration with other services in the AWS ecosystem, helps it reduce and even eliminate the activity of writing code or building complex ETL pipelines. This in turn reduces the data operations, data processing, data storage and data maintenance cost, and greatly increases productivity and time to market.
Here are some of the key features of Amazon Redshift that helps this Low-Code/Low-ETL narrative -
Amazon Redshift Spectrum helps query and analyze data stored on Amazon S3 in real time. This eliminates the need for movement of data from Amazon S3 to Amazon Redshift.
Amazon Redshift Federated queries helps query, analyze, and integrate data stored on transactional databases, data warehouse and the data lake in the same query. Federated queries can join data from transactional databases such as Amazon RDS for MySQL, Amazon Aurora PostgreSQL, Amazon RDS for PostgreSQL and Amazon Aurora MySQL, with Amazon Redshift Data warehouse and Amazon S3 data lake. This powerful feature helps eliminate ETL and data movement between transactional databases, data warehouses and data lakes. You can query both structured and semi-structured data stored natively in Amazon Redshift, Amazon S3 data lakes and Amazon RDS, using standard SQL.
The integration of Amazon Redshift with AWS Data Exchange (ADX) helps Amazon Redshift queries instantly connect and query external 3rd party data without data copy or ETL.
This integration with AWS Data Exchange (ADX) also helps with seamless live data sharing with external parties.
The integration of Amazon Redshift with Amazon AppFlow enables direct load of data from SaaS platforms into Amazon Redshift, without a need for ETL.
Amazon Redshift is integrated with Real time data ingestion services such as Amazon Kinesis Data Firehose and Managed Streaming for Kafka which can directly write data to Amazon Redshift.
Amazon SageMaker natively integrates with Amazon Redshift to read data required for building ML models. You can use Amazon Redshift ML to create, train and deploy your ML models using SQL commands, and use the data in your Redshift with Amazon SageMaker.
Word of caution –
Federated queries can join only with Amazon RDS for PostgreSQL, Amazon RDS for MySQL, Amazon Aurora PostgreSQL and Amazon Aurora MySQL.
When joining data in Amazon S3, you can often notice a slight drop in query performance compared to querying directly from Amazon Redshift. This applies to both Amazon Redshift Federated queries and Amazon Redshift Spectrum.
The Low-Code/Low-ETL discussed here assumes that the data in your Amazon S3 or Amazon RDS are in the format which can be consumed directly or require easy transformation which can be handled in the SQL. If this is not the case, an ETL might be required.
Good Read Narendra.
ReplyDeleteETL not only does the job of aggregating datasets to one platform but also that of standardizing data so that you can correlate better. While there is merit in these virtualization solutions, you will end up writing a lot of code before combining these data sets on the virtualization platform(like ADX). So it does seem like we are just right shifting the load to the consuming side. And if there are multiple independent consumptions, you will end up repeating the same code over and over in several different programs.
Having said, I'm no proponent of ETL. As always, there is no one answer- we must use our best discretion for our use of either solutions and strike a via media. For now, ETL will probably continue to exist as a necessary evil until AI and CPU availability obviate the necessity for it.