Free DEA-C01 Sample Questions — AWS Certified Data Engineer - Associate DEA-C01

Free DEA-C01 sample questions for the AWS Certified Data Engineer - Associate DEA-C01 exam. No account required: study at your own pace.

Want an interactive quiz? Take the full DEA-C01 practice test

Looking for more? Click here to get the full PDF with 224+ practice questions for $10 for offline study and deeper preparation.

Question 1

A finance company receives data from third-party data providers and stores the data as objects in an Amazon S3 bucket. The company ran an AWS Glue crawler on the objects to create a data catalog. The AWS Glue crawler created multiple tables. However, the company expected that the crawler would create only one table. The company needs a solution that will ensure the AVS Glue crawler creates only one table. Which combination of solutions will meet this requirement? (Choose two.)

A. Ensure that the object format, compression type, and schema are the same for each object
B. Ensure that the object format and schema are the same for each object. Do not enforce consistency for the compression type of each object
C. Ensure that the schema is the same for each object. Do not enforce consistency for the file format and compression type of each object
D. Ensure that the structure of the prefix for each S3 object name is consistent
E. Ensure that all S3 object names follow a similar pattern

Show Answer

Correct Answer:

A. Ensure that the object format, compression type, and schema are the same for each object
D. Ensure that the structure of the prefix for each S3 object name is consistent

Question 2

A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII. Which solution will meet this requirement with the LEAST operational effort?

A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Create an AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate the PII. Set the S3 data lake as the target for the delivery stream
B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake
C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule in AWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake
D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identify and obfuscate the PII in the DynamoDB table and to transform the data. Use the same Lambda function to ingest the data into the S3 data lake

Show Answer

Correct Answer:

B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII. Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the data into the S3 data lake

Question 3

A data engineer uses Amazon Redshift to run resource-intensive analytics processes once every month. Every month, the data engineer creates a new Redshift provisioned cluster. The data engineer deletes the Redshift provisioned cluster after the analytics processes are complete every month. Before the data engineer deletes the cluster each month, the data engineer unloads backup data from the cluster to an Amazon S3 bucket. The data engineer needs a solution to run the monthly analytics processes that does not require the data engineer to manage the infrastructure manually. Which solution will meet these requirements with the LEAST operational overhead?

A. Use Amazon Step Functions to pause the Redshift cluster when the analytics processes are complete and to resume the cluster to run new processes every month
B. Use Amazon Redshift Serverless to automatically process the analytics workload
C. Use the AWS CLI to automatically process the analytics workload
D. Use AWS CloudFormation templates to automatically process the analytics workload

Show Answer

Correct Answer:

B. Use Amazon Redshift Serverless to automatically process the analytics workload

Question 4

A company stores server logs in an Amazon S3 bucket. The company needs to keep the logs for 1 year. The logs are not required after 1 year. A data engineer needs a solution to automatically delete logs that are older than 1 year. Which solution will meet these requirements with the LEAST operational overhead?

A. Define an S3 Lifecycle configuration to delete the logs after 1 year
B. Create an AWS Lambda function to delete the logs after 1 year
C. Schedule a cron job on an Amazon EC2 instance to delete the logs after 1 year
D. Configure an AWS Step Functions state machine to delete the logs after 1 year

Show Answer

Correct Answer:

A. Define an S3 Lifecycle configuration to delete the logs after 1 year

Question 5

A company uses AWS Key Management Service (AWS KMS) to encrypt an Amazon Redshift cluster. The company wants to configure a cross-Region snapshot of the Redshift cluster as part of disaster recovery (DR) strategy. A data engineer needs to use the AWS CLI to create the cross-Region snapshot. Which combination of steps will meet these requirements? (Choose two.)

A. Create a KMS key and configure a snapshot copy grant in the source AWS Region
B. In the source AWS Region, enable snapshot copying. Specify the name of the snapshot copy grant that is created in the destination AWS Region
C. In the source AWS Region, enable snapshot copying. Specify the name of the snapshot copy grant that is created in the source AWS Region
D. Create a KMS key and configure a snapshot copy grant in the destination AWS Region
E. Convert the cluster to a Multi-AZ deployment

Show Answer

Correct Answer:

A. Create a KMS key and configure a snapshot copy grant in the source AWS Region
C. In the source AWS Region, enable snapshot copying. Specify the name of the snapshot copy grant that is created in the source AWS Region

Question 6

A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies. A data engineer wants to cost optimize the company's use of Amazon Athena without adding any additional infrastructure costs. Which solution will meet these requirements with the LEAST operational overhead?

A. Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archive storage class after 1 day
B. Use the query result reuse feature of Amazon Athena for the SQL queries
C. Add an Amazon ElastiCache cluster between the BI application and Athena
D. Change the format of the files that are in the dataset to Apache Parquet

Show Answer

Correct Answer:

B. Use the query result reuse feature of Amazon Athena for the SQL queries

Question 7

A company stores sensitive data in an Amazon Redshift table. The company needs to give specific users the ability to access the sensitive data. The company must not create duplication in the data. Customer support users must be able to see the last four characters of the sensitive data. Audit users must be able to see the full value of the sensitive data. No other users can have the ability to access the sensitive information. Which solution will meet these requirements?

A. Create a dynamic data masking policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the masking policy to the column that contains sensitive data
B. Enable metadata security on the Redshift cluster. Create IAM users and IAM roles for the customer support users and the audit users. Grant the IAM users and IAM roles permissions to view the metadata in the Redshift cluster
C. Create a row-level security policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the security policy to the table
D. Create an AWS Glue job to redact the sensitive data and to load the data into a new Redshift table

Show Answer

Correct Answer:

A. Create a dynamic data masking policy to allow access based on each user role. Create IAM roles that have specific access permissions. Attach the masking policy to the column that contains sensitive data

Question 8

A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every 30 seconds. Which solution will deliver the data to the S3 bucket with the LEAST latency?

A. Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer interval for Kinesis Data Firehose
B. Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards
C. Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application
D. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use a 5 second buffer interval for Kinesis Data Firehose

Show Answer

Correct Answer:

C. Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application

Question 9

A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple user groups need to access the raw data. The company must ensure that user groups can access only the PII that they require. Which solution will meet these requirements with the LEAST effort?

A. Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company's IAM roles. Assign each user to the IAM role that matches the user's PII access requirements
B. Use Amazon QuickSight to access the data. Use column-level security features in QuickSight to limit the PII that users can retrieve from Amazon S3 by using Amazon Athena. Define QuickSight access levels based on the PII access requirements of the users
C. Build a custom query builder UI that will run Athena queries in the background to access the data. Create user groups in Amazon Cognito. Assign access levels to the user groups based on the PII access requirements of the users
D. Create IAM roles that have different levels of granular access. Assign the IAM roles to IAM user groups. Use an identity-based policy to assign access levels to user groups at the column level

Show Answer

Correct Answer:

A. Use Amazon Athena to query the data. Set up AWS Lake Formation and create data filters to establish levels of access for the company's IAM roles. Assign each user to the IAM role that matches the user's PII access requirements

Question 10

A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views. Which solution will meet this requirement with the LEAST effort?

A. Use Apache Airflow to refresh the materialized views
B. Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh the materialized views
C. Use the query editor v2 in Amazon Redshift to refresh the materialized views
D. Use an AWS Glue workflow to refresh the materialized views

Show Answer

Correct Answer:

C. Use the query editor v2 in Amazon Redshift to refresh the materialized views

Question 11

A company has a gaming application that stores data in Amazon DynamoDB tables. A data engineer needs to ingest the game data into an Amazon OpenSearch Service cluster. Data updates must occur in near real time. Which solution will meet these requirements?

A. Use AWS Step Functions to periodically export data from the Amazon DynamoDB tables to an Amazon S3 bucket. Use an AWS Lambda function to load the data into Amazon OpenSearch Service
B. Configure an AWS Glue job to have a source of Amazon DynamoDB and a destination of Amazon OpenSearch Service to transfer data in near real time
C. Use Amazon DynamoDB Streams to capture table changes. Use an AWS Lambda function to process and update the data in Amazon OpenSearch Service
D. Use a custom OpenSearch plugin to sync data from the Amazon DynamoDB tables

Show Answer

Correct Answer:

C. Use Amazon DynamoDB Streams to capture table changes. Use an AWS Lambda function to process and update the data in Amazon OpenSearch Service

Question 12

During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script. A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials. Which combination of steps should the data engineer take to meet these requirements? (Choose two.)

A. Store the credentials in the AWS Glue job parameters
B. Store the credentials in a configuration file that is in an Amazon S3 bucket
C. Access the credentials from a configuration file that is in an Amazon S3 bucket by using the AWS Glue job
D. Store the credentials in AWS Secrets Manager
E. Grant the AWS Glue job IAM role access to the stored credentials

Show Answer

Correct Answer:

D. Store the credentials in AWS Secrets Manager
E. Grant the AWS Glue job IAM role access to the stored credentials

Question 13

A transportation company wants to track vehicle movements by capturing geolocation records. The records are 10 bytes in size. The company receives up to 10.000 records every second. Data transmission delays of a few minutes are acceptable because of unreliable network conditions. The transportation company wants to use Amazon Kinesis Data Streams to ingest the geolocation data. The company needs a reliable mechanism to send data to Kinesis Data Streams. The company needs to maximize the throughput efficiency of the Kinesis shards. Which solution will meet these requirements in the MOST operationally efficient way?

A. Kinesis Agent
B. Kinesis Producer Library (KPL)
C. Amazon Kinesis Data Firehose
D. Kinesis SDK

Show Answer

Correct Answer:

B. Kinesis Producer Library (KPL)

Question 14

An airline company is collecting metrics about flight activities for analytics. The company is conducting a proof of concept (POC) test to show how analytics can provide insights that the company can use to increase on-time departures. The POC test uses objects in Amazon S3 that contain the metrics in .csv format. The POC test uses Amazon Athena to query the data. The data is partitioned in the S3 bucket by date. As the amount of data increases, the company wants to optimize the storage solution to improve query performance. Which combination of solutions will meet these requirements? (Choose two.)

A. Add a randomized string to the beginning of the keys in Amazon S3 to get more throughput across partitions
B. Use an S3 bucket that is in the same account that uses Athena to query the data
C. Use an S3 bucket that is in the same AWS Region where the company runs Athena queries
D. Preprocess the .csv data to JSON format by fetching only the document keys that the query requires
E. Preprocess the .csv data to Apache Parquet format by fetching only the data blocks that are needed for predicates

Show Answer

Correct Answer:

C. Use an S3 bucket that is in the same AWS Region where the company runs Athena queries
E. Preprocess the .csv data to Apache Parquet format by fetching only the data blocks that are needed for predicates

Question 15

A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)

A. Use Hadoop Distributed File System (HDFS) as a persistent data store
B. Use Amazon S3 as a persistent data store
C. Use x86-based instances for core nodes and task nodes
D. Use Graviton instances for core nodes and task nodes
E. Use Spot Instances for all primary nodes

Show Answer

Correct Answer:

B. Use Amazon S3 as a persistent data store
D. Use Graviton instances for core nodes and task nodes

Question 16

A data engineer created a table named cloudtrail_logs in Amazon Athena to query AWS CloudTrail logs and prepare data for audits. The data engineer needs to write a query to display errors with error codes that have occurred since the beginning of 2024. The query must return the 10 most recent errors. Which query will meet these requirements?

A. select count (*) as TotalEvents, eventname, errorcode, errormessage from cloudtrail_logswhere errorcode is not nulland eventtime >= '2024-01-01T00:00:00Z' group by eventname, errorcode, errormessageorder by TotalEvents desclimit 10;
B. select count (*) as TotalEvents, eventname, errorcode, errormessage from cloudtrail_logs where eventtime >= '2024-01-01T00:00:00Z' group by eventname, errorcode, errormessage order by TotalEvents desc limit 10;
C. select count (*) as TotalEvents, eventname, errorcode, errormessage from cloudtrail_logswhere eventtime >= '2024-01-01T00:00:00Z' group by eventname, errorcode, errormessageorder by eventname asc limit 10;
D. select count (*) as TotalEvents, eventname, errorcode, errormessage from cloudtrail_logs where errorcode is not nulland eventtime >= '2024-01-01T00:00:00Z' group by eventname, errorcode, errormessagelimit 10;

Show Answer

Correct Answer:

A. select count (*) as TotalEvents, eventname, errorcode, errormessage from cloudtrail_logswhere errorcode is not nulland eventtime >= '2024-01-01T00:00:00Z' group by eventname, errorcode, errormessageorder by TotalEvents desclimit 10;

Question 17

A company has several new datasets in CSV and JSON formats. A data engineer needs to make the data available to a team of data analysts who will analyze the data by using SQL queries. Which solution will meet these requirements in the MOST cost-effective way?

A. Create an Amazon RDS MySQL cluster. Use AWS Glue to transform and load the CSV and JSON files into database tables. Provide the data analysts access to the MySQL cluster
B. Create an AWS Glue DataBrew project that contains the new data. Make the DataBrew project available to the data analysts
C. Store the data in an Amazon S3 bucket. Use an AWS Glue crawler to catalog the S3 bucket as tables. Create an Amazon Athena workgroup that has a data usage threshold. Grant the data analysts access to the Athena workgroup
D. Load the data into Super-fast, Parallel, In-memory Calculation Engine (SPICE) in Amazon QuickSight. Allow the data analysts to create analyses and dashboards in QuickSight

Show Answer

Correct Answer:

C. Store the data in an Amazon S3 bucket. Use an AWS Glue crawler to catalog the S3 bucket as tables. Create an Amazon Athena workgroup that has a data usage threshold. Grant the data analysts access to the Athena workgroup

Question 18

A company is designing a serverless data processing workflow in AWS Step Functions that involves multiple steps. The processing workflow ingests data from an external API, transforms the data by using multiple AWS Lambda functions, and loads the transformed data into Amazon DynamoDB. The company needs the workflow to perform specific steps based on the content of the incoming data. Which Step Functions state type should the company use to meet this requirement?

A. Parallel
B. Choice
C. Task
D. Map

Show Answer

Correct Answer:

B. Choice

Question 19

A company is planning to upgrade its Amazon Elastic Block Store (Amazon EBS) General Purpose SSD storage from gp2 to gp3. The company wants to prevent any interruptions in its Amazon EC2 instances that will cause data loss during the migration to the upgraded storage. Which solution will meet these requirements with the LEAST operational overhead?

A. Create snapshots of the gp2 volumes. Create new gp3 volumes from the snapshots. Attach the new gp3 volumes to the EC2 instances
B. Create new gp3 volumes. Gradually transfer the data to the new gp3 volumes. When the transfer is complete, mount the new gp3 volumes to the EC2 instances to replace the gp2 volumes
C. Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput
D. Use AWS DataSync to create new gp3 volumes. Transfer the data from the original gp2 volumes to the new gp3 volumes

Show Answer

Correct Answer:

C. Change the volume type of the existing gp2 volumes to gp3. Enter new values for volume size, IOPS, and throughput

Question 20

A data engineer needs to securely transfer 5 TB of data from an on-premises data center to an Amazon S3 bucket. Approximately 5% of the data changes every day. Updates to the data need to be regularly proliferated to the S3 bucket. The data includes files that are in multiple formats. The data engineer needs to automate the transfer process and must schedule the process to run periodically. Which AWS service should the data engineer use to transfer the data in the MOST operationally efficient way?

A. AWS DataSync
B. AWS Glue
C. AWS Direct Connect
D. Amazon S3 Transfer Acceleration

Show Answer

Correct Answer:

A. AWS DataSync

Aced these? Get the Full Exam

Download the complete DEA-C01 study bundle with 224+ questions in a single printable PDF.

Purchase Full Exam PDF | $10