AWS GLUE — EXTRACT, TRANSFORM, LOAD(ETL)

Learn how to use AWS Glue for ETL operations in Spark using amazon S3 as data source

5 min readApr 7, 2021

AWS Glue uses other AWS services to orchestrate your ETL (extract, transform, and load) jobs to build data warehouses and data lakes and generate output streams. AWS Glue calls API operations to transform your data, create runtime logs, store your job logic, and create notifications to help you monitor your job runs. The AWS Glue console connects these services into a managed application, so you can focus on creating and monitoring your ETL work. The console performs administrative and job development operations on your behalf. You supply credentials and other properties to AWS Glue to access your data sources and write to your data targets.

We will cover following topics:

GLUE Components
Data Source s3 setup
IAM Role for GLUE
AWS Glue Crawler setup
Glue Job setup
Extract the Data and Transform using spark
Load it in amazon S3 Output

1) GLUE components:

In a nutshell, AWS Glue has following important components:

Data Source and Data Target: the data store that is provided as input, from where data is loaded for ETL is called the data source and the data store where the transformed data is stored is the data target.
Data Catalog: Data Catalog is AWS Glue’s central metadata repository that is shared across all the services in a region. This catalog has table definitions, job definitions, and other control information to manage your AWS Glue environment.
Crawlers and Classifiers: A crawler is a program that retrieves the schema of data from the data store(s3). Crawler uses custom or built-in classifiers to identify data formats and populates the metadata tables in the catalog.
Database and Table: Each successful crawler run populates a table of a database in the data catalog. A database in the catalog is a set of associated tables. Each table only has metadata information of the data such as names of columns, data type definitions, partition information and the actual data remains in the data store. One or more tables in the database are used by the source and target in an ETL job run.
Job and Triggers: It is the actual business logic to carry out the ETL task. A job is composed of a transformation script, data sources and data targets. We can define our Jobs either in python or pySpark. Job runs are initiated by triggers that can be scheduled or triggered by events.

For a detailed study of AWS Glue you can visit the official developer guide.

2) Data Source and Data Target S3 setup

Now that we have an understanding of what are the different components of As a first step of developing our end to end ETL job we will first setup our data stores.

We are going to create the S3 bucket and upload our data set in the folder in S3 bucket. Create an input and output folder. for practice just to create the bucket with public access.
upload main file of data set “mydatasource.csv” into the input folder . It has following data sets ( id, vendorid, trip_time, total_amount, fare_amount, toll_amount)

3) IAM Role For GLUE

We Will create a Glue role that will allow Glue to access different AWS resources e.g S3.Follwoup the below steps

Go to the IAM console and add a new role and attach the AWSGlueServiceRole and AWSS3FullAccess policy to this role. This policy contains permissions to access Glue, CloudWatch, EC2, S3, and IAM.
Make sure name role as per naming conventions and add “AWSGlueServiceRole” prefix to it e:g AWSGlueServiceRoledemo

4) AWS Glue Crawler Setup

Our data source and IAM role is ready we need to setup the crawler so that we can catalog our data. We need to follow the following steps to set up the crawler

Got to AWS Glue service and On the left menu click on databases and add a database. For this demo we created “gluedemodb” database.
Now got to Crawlers on the left side and a new crawler. Enter a name and click on next. We have used “gluedemo”.
Specify a crawler data source type. Chose data Stores option and click on next. Next choose S3 as data store and give the input folder path to include path option. in our case we have created input folder as “glueinput”.
No need to chose the another data source and click on Next.
Select the Glue role we have created earlier, in the dropdown you will see the role. We are choosing role as “AWSGlueServiceRoledemo”.
Set frequency to Run on Demand.
Select the database that you have created earlier “gluedemodb”.
Finally review and click finish.
Now your crawler is created. Click on “Run Crawler” to catalog the dataset.
Crawler can take sometime to catalog the data. After successful run you must have tables created in your specified database. In our case we can see one table is created as “glueinput”.

5) Add a GLUE Job for ETL work

Now that we have cataloged our dataset we can now move towards adding a Glue Job that will do the ETL work on our dataset.

On the left menu click on “Jobs” and add a new job. Glue can auto generate a python or pyspark script that we can use to perform ETL operations. However, in our case we’ll be providing a new script.We will name job as “gluedemojob”
Choose IAM Role we have created earlier. We have created a role “AWSGlueServiceRoledemo”.
Choose type as “Spark” and Glue version as “Spark2.4 and Python3(Glue version1.0)”.
Give script a name in our case we have given it “gluedemojob”
Leave other settings as default and click on “Security configuration, script libraries, and job parameters (optional)”
Set Worker Type “Standard”, Number of Workers as 2 and Job timeout as 10 Min. The higher the number of DPUs(maximum capacity) you set the more cost you will incur.
Since we are not connecting to any RDBMS we don’t have to setup any connections. Click “Save Job and edit script”.
Copy and Paste the below code in the editor and hit save. Now click the run job button.

##################
##import libraries
########################
#import python modules
from datetime import datetime
#import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
#initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
#parameters
glue_db = “gluedemodb”
glue_tbl = “glueinput”
s3_write_path = “s3://gluesources3/glueoutput”
######################
##EXTRACT(READ DATA)
######################
dynamicframe_fram_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db,table_name = glue_tbl)
#convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamicframe_fram_read.toDF()
#################################
##TRANSFORM( Modify data)
##################################
data_frame_aggregated = data_frame.groupby(“vendorid”).agg(
f.mean(f.col(“total_amount”)).alias(“avg_total_time”),
f.mean(f.col(“trip_time”)).alias(“avg_trip_time”),
)
data_frame_aggregated = data_frame_aggregated.orderBy(f.desc(“avg_total_time”))
#######################################
###LOAD(WRITE DATA)
#######################################
#create just 2 partition because there is less data
data_frame_aggregated = data_frame_aggregated.repartition(1)
#convert back to the dynamic frame
dynamic_frame_write = DynamicFrame.fromDF(data_frame_aggregated,glue_context,”dynamic_frame_write”)
#write back to S3
glue_context.write_dynamic_frame.from_options(
frame = dynamic_frame_write,
connection_type =”s3",
connection_options = {
“path”:s3_write_path,
},
format =”csv”
)

If the job execution is successful, you will have aggregated results in parquet format in your destination bucket that is specified in the Glue Job.

Summary

In this article we learned how we can use AWS Glue to do ETL operations in Spark. We learned how we can setup data source and data target, creating crawlers to catalog the data on s3 and authoring Glue Spark Job to perform extract, transform and load(ETL) operations.

Thank you.