Accessing Data from an AWS data lake using AWS Lake Formation – Part 1

A data lake is a repository to store your data. Like a database, a data lake is expected to have the data in an organized manner, provide tools for data processing and data access, and have well defined methods for authentication and authorization. And unlike a database, data lake is expected to hold structured, semi-structured or unstructured data, and are envisioned to hold this data forever. It is these subtle differences that make data lakes fit more for analytic needs such as deriving patterns, detailed comparisons, build an exhaustive story etc. A database can also be used for analytic use cases, but they are not meant for storing large volumes of data indefinitely, or store unstructured data, and this in turn puts a limit on how detailed of a result it can provide. Whether you need a data lake is a question that depends on the volume of data in your organization. If your organization data is small; you do not have much history or you do not have a need to store large history; your data is all structured, then maybe you don’t need a data lake. A database such as Amazon Redshift may be sufficient.

AWS Lake formation helps organizations build data lakes easily and efficiently leveraging familiar AWS services such as Amazon S3 for storage, Amazon Kinesis for ingestion, Amazon EMR and AWS Glue for processing and cataloging, Amazon Athena for access etc. Once a data lake has been built, the next critical step would be to enable row and/or column level access controls.

In this 2-part blog let us look at how we can access data (--access control--) at Row and/or Column held in an AWS Data Lake using AWS Lake Formation.

AWS Lake Formation provides 2 different ways to query/filter data based on Row or Column or Cell and in this blog let us look at how we can use AWS Lake Formation Data Filters for access control.

Data Filters is a simple way to implement Row level and/or Column level. If you are applying both Row and Column level, it becomes a Cell level security. AWS Lake formation uses AWS Glue Data Catalog to identify the tables. You will then use these Glue Catalog tables to define your data filters in the Lake formation to grant access to Columns for your Roles or Users or Groups. Within this same Lake formation filter screen, you can also restrict/grant access to specific rows based on values.

Steps to enable Row and/or column level control using Data filters –

Steps 1 – Log into the AWS Lake Formation console and select Data Filters.

Step 2 – Create a new filter.

2A – Choose the Glue DB. This is the Glue catalog DB which would have been created by the Glue Crawler job.

2B – Choose the table which you want to give/restrict access.

2C – Choose if you want to give access to all columns or Include or Exclude some columns.

2D – Apply your Row filter expression, to create the Row access control. This will be applied within the “WHERE” clause in your SELECT query.

Step 3 – Go to Data Lake Permissions.

3A – Choose the IAM User/Role for which you want to apply the Data filter

3B – Choose “Named data catalog resources”

3C – Choose the DB and tables which you want to give the user access to

3D – Choose the Data Filter you want to apply. Remember, you can choose more than 1 filter here.

Points to remember about Data Filters -

Data filtering applies only to SELECT operations (or Read operations).
In case of Row filters with Strings, String constants must be enclosed in single quote. Example – movie_name = ‘d’ OR movie_name = ‘darr’.
Hint – Before you create the filter, run the query you are using in Amazon Athena to verify the syntax.
Remember – To use Amazon Athena with AWS Lake Formation, you need to use AmazonAthenaLakeFormation Workgroup. If this workgroup does not exist, go to the Workgroup menu and create this workgroup.

If a Role has more than 1filters applied, then the output behavior will be - SELECT query of each of the Data Filters are executed and the end result of each of the query will be UNIONed to give the final output.
Remember – Using example from Step 2D, if you have 2 filters, one restricting rows to include only those movies which were made after year 2000 and another filter which has no restriction, then the end result will be all rows from the table.
Using data filtering to implement Column and Row can become extremely cumbersome as data volume and users/query-volumes increase. The recommended way is to go Tag Based Access control which we will see in the next blog.

~Narendra V Joshi

Search This Blog

NAVIJO ON CLOUD

Accessing Data from an AWS data lake using AWS Lake Formation – Part 1 - Data Filtering based access control

Comments

Post a Comment