Accessing Data from an AWS data lake using AWS Lake Formation – Part 1 - Data Filtering based access control
A data lake is
a repository to store your data. Like a database, a data lake is expected to
have the data in an organized manner, provide tools for data processing and data
access, and have well defined methods for authentication and authorization. And
unlike a database, data lake is expected to hold structured, semi-structured or
unstructured data, and are envisioned to hold this data forever. It is these subtle differences that make data
lakes fit more for analytic needs such as deriving patterns, detailed
comparisons, build an exhaustive story etc. A database can also be used for
analytic use cases, but they are not meant for storing large volumes of data
indefinitely, or store unstructured data, and this in turn puts a limit on how
detailed of a result it can provide. Whether you need a data lake is a question
that depends on the volume of data in your organization. If your organization
data is small; you do not have much history or you do not have a need to store large
history; your data is all structured, then maybe you don’t need a data lake. A database
such as Amazon Redshift may be sufficient.
AWS Lake
formation helps organizations build data lakes easily and efficiently leveraging
familiar AWS services such as Amazon S3 for storage, Amazon Kinesis for
ingestion, Amazon EMR and AWS Glue for processing and cataloging, Amazon Athena
for access etc. Once a data lake has been built, the next critical step would
be to enable row and/or column level access controls.
In this 2-part blog
let us look at how we can access data (--access control--) at Row and/or Column
held in an AWS Data Lake using AWS Lake Formation.
Data Filters is a simple way to implement Row level and/or Column level. If you are applying both Row and Column level, it becomes a Cell level security. AWS Lake formation uses AWS Glue Data Catalog to identify the tables. You will then use these Glue Catalog tables to define your data filters in the Lake formation to grant access to Columns for your Roles or Users or Groups. Within this same Lake formation filter screen, you can also restrict/grant access to specific rows based on values.
Steps to enable Row and/or column level control using Data filters –
Steps 1 – Log
into the AWS Lake Formation console and select Data Filters.
Step 2 – Create
a new filter.
2A – Choose the Glue DB. This is the Glue
catalog DB which would have been created by the Glue Crawler job.
2B – Choose the table which you want
to give/restrict access.
2C – Choose if you want to give
access to all columns or Include or Exclude some columns.
2D – Apply your Row filter
expression, to create the Row access control. This will be applied within the “WHERE”
clause in your SELECT query.
Step 3 – Go to
Data Lake Permissions.
3A – Choose the IAM User/Role for which
you want to apply the Data filter
3B – Choose “Named data catalog resources”
3C – Choose the DB and tables which
you want to give the user access to
3D – Choose the Data Filter you want to
apply. Remember, you can choose more than 1 filter here.
- Data filtering applies only to SELECT operations (or Read operations).
- In case of Row filters with Strings, String constants must be enclosed in single quote. Example – movie_name = ‘d’ OR movie_name = ‘darr’.
- Hint – Before you create the filter, run the query you are using in Amazon Athena to verify the syntax.
- Remember – To use Amazon Athena with AWS Lake Formation, you need to use AmazonAthenaLakeFormation Workgroup. If this workgroup does not exist, go to the Workgroup menu and create this workgroup.
- If a Role has more than 1filters applied, then the output behavior will be - SELECT query of each of the Data Filters are executed and the end result of each of the query will be UNIONed to give the final output.
- Remember – Using example from Step 2D, if you have 2 filters, one restricting rows to include only those movies which were made after year 2000 and another filter which has no restriction, then the end result will be all rows from the table.
- Using data filtering to implement Column and Row can become extremely cumbersome as data volume and users/query-volumes increase. The recommended way is to go Tag Based Access control which we will see in the next blog.
Comments
Post a Comment