Job Scheduling on AWS - Part 1 - Using AWS native services for job scheduling
In this blog, I will look at some of the ways to schedule and/or run job/batch-jobs in AWS using AWS native services. In Part 2 of the post, I will talk about how to extend your current on premise job schedule to AWS.
1) Using AWS Lambda –
An AWS Lambda can be triggered via several ways. The most common way will be to trigger by a file upload to S3. Create your AWS Lambda function and set the trigger. Your trigger can be a file upload to S3. The moment a file is uploaded it executes the AWS Lambda which contains your execution code.
a. When processing large files use AWS Lambda as a trigger rather than a file processor. For example - lets say your requirement is to load fact and dimension files into Amazon Redshift, and you are using Amazon QuickSight to reports on Amazon Redshift. Considering the time and payload limits of AWS Lambda, a suggested architecture in this scenario would be to use AWS Lambda to check if all dimension files and fact files have arrived before you begin the load. Once you have all the files, you can use AWS Lambda to trigger an AWS Glue job or a COPY to load Amazon Redshift. This will ensure that your process will not fail due to AWS Lambda timeout or other issues related to AWS Lambda limits. This will also ensure that your Amazon QuickSight reports come complete with all required dimension and fact data.
b.
Remember
the limits of AWS Lambda. Some of these are hard limits such as function timeout of
900 seconds (15 mins) while some are soft limits (1000 concurrent executions).
c. AWS Lambda
CPU is not something you can directly control. CPU in AWS Lambda is linked to RAM. AWS Lambda RAM and CPU are almost directly proportional.
2) Using AWS Lambda and Amazon CloudWatch –
You can also trigger AWS Lambda via a Amazon CloudWatch Event (EventBridge). Instead of a Amazon S3 file PUT as in the (1), create a Amazon CloudWatch rule with an Event source as schedule and the AWS Lambda as the target.
a. This option does not give enough control to the user to delay the AWS Lambda execution if required. If you want to cancel the execution, you have to log into Amazon CloudWatch and override the AWS Lambda start.
b. The considerations I have explained under point 1 applies here too.
3) Using AWS Lambda, Amazon Cloudwatch for running jobs on Amazon EC2 –
This is very similar in working to what I have explained under Point 2. Except that instead of the processing logic in AWS Lambda, you will now have the processing logic in an Amazon EC2. Use this approach if you have a valid reason to use Amazon EC2 to process instead of using an AWS Lambda.
Considerations for using AWS Lambda, Amazon CloudWatch for running jobs on Amazon EC2 –
a. You can have the code loaded to the Amazon EC2 during Start up via a Boot Strap script (User data) OR you can have AWS Lambda START (and STOP) the Amazon EC2 via the Amazon EC2 instance ID. If you require an elaborate setup, you can use the AWS Lambda to trigger an AWS CloudFormation script that will spin up the Amazon EC2 along with other services.
b. Use
this in scenario such as where your processing logic has multiple steps and you
know will exceed the AWS Lambda limits.
c. Maybe
you want to loop through the logic once the process starts OR you have long
waits in your process OR your input file sizes are huge and exceeds AWS Lambda
limits etc.
d. Remember that here you are paying for usage of Amazon EC2 as well.
4) Using Amazon Cloudwatch for running jobs on Amazon ECS –
This is similar to point 3 except that instead of Amazon EC2 you will use Amazon ECS (Elastic Container Service). The Amazon CloudWatch here will trigger an Amazon ECS task instead of AWS Lambda. The code will be created as a Docker image and uploaded to the Amazon ECS Container Registry. Next you will create an Amazon ECS task with launch type as Fargate, if you don’t have a running Amazon ECS. If you have a running Amazon ECS, then use Amazon EC2 as your launch type. But stick to Fargate as Fargate will take away all overhead of running your servers.
Remember – AWS Lambda integrates with several other AWS services such as – Amazon SNS, Amazon SQS, Amazon DynamoDB, Amazon CloudFront, AWS CodeCommit, Amazon Cognito, Amazon Kinesis, Amazon MSK, Amazon API Gateway etc. Therefore, a combination of (1) (2) (3) and (4) points above opens up several options for you.
5) AWS
Batch –
Well suited for running batch jobs. You can choose your compute environments to be Amazon EC2 on demand, spot, Fargate or Fargate spot. Once your compute environment is setup you can use Amazon CloudWatch to schedule your jobs as a fixed interval schedule or cron schedule.
Considerations for AWS Batch –
a. Simple job scheduler where you can schedule using Amazon Cloudwatch.
b. Its pay as you use. Works for Amazon EC2 ondemand and Spot. Fargate options takes away the overhead of Amazon EC2 management.
c. AWS Glue
uses Spark underneath. So I consider AWS Batch instead of AWS Glue when I have to
process the input file as a whole and cannot partition the input file. For
example – you want to convert an input txt file into parquet file.
6) AWS
Batch with AWS Step Functions –
Use this option if you want your AWS Batch to run via an AWS Step Function and not using a Amazon CloudWatch schedule. Setup your AWS Batch you regularly do. Once your AWS Batch is ready, setup your Step Functions. In your Step Function, provide the AWS Batch job as one of the Resource with Batch parameters – job definition, job name, job queue.
Considerations for AWS Batch with Step Function –
a. Combines
the power of AWS Step Function and AWS Batch.
b. AWS Glue
is easier to manage compared to AWS Batch and should be the default. Consider AWS Batch
when you cannot use AWS Glue.
7) AWS
Glue scheduler –
AWS Glue provides managed service that
provides server-less Apache Spark environment. Create your AWS Glue job. Create a
trigger by setting up the trigger properties. Your trigger can be a CRON
schedule or a job event such as success or failure of a job or on-demand.
Attach the trigger to the job that you have created.
Considerations for AWS Glue –
a. In
addition to providing a Spark environment, AWS Glue also provides Crawler
functionality which crawls the input dataset and creates a technical metadata
of the input.
b. This technical metadata can be exported and combined with Business metadata and you can eventually build out a complete data catalog of technical, business and process metadata.
8) AWS
Step Functions
AWS Step Functions is an orchestrator to
sequence and orchestrate different AWS services such as AWS Glue, AWS Batch, AWS Lambda
etc. It has a visual interface which helps you build, visualize and run teh
workflow. The output of one step acts as input for the next.
Considerations for AWS Step Functions –
a. AWS Step Functions can orchestrate different AWS services include Amazon EMR, AWS Glue, AWS Lambda, AWS Batch etc.
b. AWS Step Function provides multiple states such as Choice, Pass, Fail, Succeed, Wait
etc.
9) Using
Oozie in Amazon EMR
Helps schedule manage and coordinate Hadoop jobs in Amazon EMR. If you have a Hadoop application on premise with Oozie as the scheduler and want to migrate to cloud as a Lift-Shift, then Oozie on Amazon EMR is a good choice.
Considerations for Oozie –
a. Job management and orchestration limited to Amazon EMR. So if your requirement is to manage jobs running on Amazon EMR, Amazon EC2 and maybe AWS Glue etc, do not consider Oozie.
b. Amazon EMR does not support Oozie interface. You have to use Hue.
10) Using AWS Data Pipeline -
AWS Data Pipeline helps schedule and manage data movement and data processing jobs easily.
Considerations for using AWS Data Pipeline -
a. Provides features for Cross-Region pipelines, easily manage task dependencies and provides retries and notifications on failure.
b. It provides out of the box pre-condition checks and data processing templates.
c. Provides a Task Runner which can help manage on premise data sources as well.
11) Using
Managed Airflow on AWS –
Amazon managed Apache Airflow.
Airflow is an open source workflow management tool.
a. If you are using Airflow on premise and your strategy is to re-platform strategy, then you can re-platform your on premise Airflow to managed airflow on cloud.
~Narendra V Joshi
Comments
Post a Comment