Developer Documentation

# Let's Data : Focus on the data - we'll manage the infrastructure!

Cloud infrastructure that simplifies how you process, analyze and transform data.

Access Grants

The data that #Let's Data processes can be in different AWS accounts. The read queues could be in a different AWS Account, the write destination could be in the #Let's Data account etc. Let's Data would need access to the different resources that are needed to process the dataset to enable these cross account data processing scenarios.

Resource Locations

In terms of access, the different resources (S3 Buckets, DynamoDB Tables, SQS Queues etc) that are read from, written to, and managed by #Let's Data can be divided into two groups - 1./ Customer: Resources that are located in external AWS accounts - 2./ LetsData: Resources that located in #Let's Data AWS account

Customer: Resources that are not located in #Let's Data AWS account but are used in dataset processing can either be public or access limited by the owner. In these cases, #Let's Data requires that the owner adds #Let's Data to the access lists.
Let's Data: Resources that are located in #Let's Data AWS account are managed completely by #Let's Data - we'll grant the customer account access to these resources to read, write and manage them.

Regardless of the resource location, #Let's Data adheres to the strictest software security principles. The code follows the principal of least privilege, runs in context of the dataset's user and is granted access only to the resources that it needs.

Managing Access

Some resources (such as read connector s3 buckets, artifact file in s3 and manifest file in s3) are read-only to Let's Data, so they need to be public or the customer needs to give their Let's Data IAM User access (More on how to give access later).
The resources which are written to by Let's Data, customer can decide whether these would be in customer account or should Let's Data manage these resources. These resources are the error connector S3 bucket, the write connector kinesis stream etc.
If these are managed by the customer, then the customer would need to grant access to these resources to their Let's Data IAM User. If these are managed by Let's Data, then we'll be granting the customer's AWS account access to these resources.

Access Permissions

We'd be validating this access as part of the dataset creation and would let you know if there are any access issues. Here is the access that is needed on each resource (On how to give access, see the following sections):

Read Connectors

We'd be needing permissions to access the resources from the read connector. Here are the access grants needed for different read connectors.

We'd be needing s3:ListBucket and s3:GetObject permissions on the bucket.

Artifact File

When implementation language is Java, we'd be downloading the jar file from the artifact file S3 link and this requires s3:GetObject permissions on the object.

ECR Image

When implementation language is Python / Javascript, we'll be using the ECR Image to create Lambda functions. The ECR Repo permissions need to be configured for ecr:BatchGetImage,ecr:GetDownloadUrlForLayer permissions to allow access to #LetsData lambda functions arn:aws:lambda:us-east-1:956943252347:function:*. 956943252347 is the #LetsData AWS account. ProdCreateDatasetLambdaFunction and ProdUpdateDatasetCodeLambdaFunction are the code that are responsible for dataset creation and updates. These permissions are configured directly on the ECR repo permissions, different from the access grant role arn and policy configuration that we do for the different dataset resources.

The following commands can be used to set this policy on the ECR Repo named 'letsdata_example_functions' using AWS CLI.

Manifest File

Manifest files do not require additional permissions except in the S3ReaderS3LinkManifestFile case - we'll be reading the file from S3. This requires s3:GetObject permission.

As for the objects specified in the S3 Reader Manifest, We'd be needing s3:GetObject permissions on the objects listed in the reader manifest. These should already be covered in the Read Connector S3 Bucket Permissions but explicitly mentioning in case these require additional permissions. (If these objects are public, these can be skipped)

Write Connectors

Different write connectors require different access. Here is access for these:

The Write Connector Kinesis Stream: We'd be writing to the Kinesis stream and possibly scaling the number of shards. We'd need kinesis:PutRecords, kinesis:DescribeStream and kinesis:UpdateShardCount permissions.

When the resourceLocation == Customer, the Kinesis write connector would need the following access for the Kinesis stream. These access statements would need to be included in the accessGrantedRole's IAM access policy.
Do note that the "kinesis:DeleteStream" access is because if #Let's Data is used to delete the dataset, it will attempt to best effort delete the customer kinesis stream as well. This "kinesis:DeleteStream" access can be removed if customer does not want to allow deletion control to #Let's Data.

Compute Engine

Different Compute Engines might require different access. Here is access for these:

Depending on the usecase, the following different access might need to be included in the dataset's accessGrantedRole.

Create New Model: When the model's urlResourceLocation == customer, the Sagemaker compute engine would need the following to access the model code in S3. These access statements would need to be included in the accessGrantedRole's IAM access policy.
Bring Your Own Endpoint: When the endpoint's resourceLocation == customer, the Sagemaker compute engine would need the following to access the endpoint. These access statements would need to be included in the accessGrantedRole's IAM access policy.

S3 Error Connector

When the resourceLocation == Customer, the S3 write connector would need the following access for the S3 bucket. These access statements would need to be included in the accessGrantedRole's IAM access policy.
Do note that the s3:DeleteObject, s3:DeleteBucket access is because if #Let's Data is used to delete the dataset, it will delete the error records created by doing a list bucket to get the records. (Any data not created by #LetsData but in the same bucket is at risk of deletion as well). The s3:DeleteObject, s3:DeleteBucket access can be removed if customer does not want to allow deletion control to #Let's Data.

Instructions: Create the Access Grants Role

Find the User details: We need the following identifiers from the logged in user's data to enable access.
1. #Let's Data IAM Account ARN: the logged in user's #Let's Data IAM Account ARN. This is the IAM user that was created automatically by #Let's Data when you signed up. All the dataset execution would be scoped to this user's security perimeter.
2. UserId: the logged in user's User Id. We use the userId as the STS ExternalId to follow Amazon's security best practices. This would be an additional identifier (similar to MFA) that would limit someone inadvertently gaining access.
The console's User Management tab lists your IAM user ARN. You can also find it via CLI.
Create an IAM Role and Policy to grant access