Free Certified Data Engineer Associate Sample Questions — Certified Data Engineer Associate

Free Certified Data Engineer Associate sample questions for the Certified Data Engineer Associate exam. No account required: study at your own pace.

Want an interactive quiz? Take the full Certified Data Engineer Associate practice test

Looking for more? Click here to get the full PDF with 134+ practice questions for $10 for offline study and deeper preparation.

Question 1

A data engineer has a Python variable table_name that they would like to use in a SQL query. They want to construct a Python code block that will run the query using table_name. They have the following incomplete code block: ____(f"SELECT customer_id, spend FROM {table_name}") What can be used to fill in the blank to successfully complete the task?

A. spark.delta.sql
B. spark.sql
C. spark.table
D. dbutils.sql

Show Answer

Correct Answer:

B. spark.sql

Question 2

A data engineer has created a new database using the following command: CREATE DATABASE IF NOT EXISTS customer360; In which location will the customer360 database be located?

A. dbfs:/user/hive/database/customer360
B. dbfs:/user/hive/warehouse
C. dbfs:/user/hive/customer360
D. dbfs:/user/hive/database

Show Answer

Correct Answer:

B. dbfs:/user/hive/warehouse

Question 3

A data engineer wants to schedule their Databricks SQL dashboard to refresh once per day, but they only want the associated SQL endpoint to be running when it is necessary. Which approach can the data engineer use to minimize the total running time of the SQL endpoint used in the refresh schedule of their dashboard?

A. They can ensure the dashboard’s SQL endpoint matches each of the queries’ SQL endpoints
B. They can set up the dashboard’s SQL endpoint to be serverless
C. They can turn on the Auto Stop feature for the SQL endpoint
D. They can ensure the dashboard’s SQL endpoint is not one of the included query’s SQL endpoint

Show Answer

Correct Answer:

C. They can turn on the Auto Stop feature for the SQL endpoint

Question 4

Which of the following statements regarding the relationship between Silver tables and Bronze tables is always true?

A. Silver tables contain a less refined, less clean view of data than Bronze data
B. Silver tables contain aggregates while Bronze data is unaggregated
C. Silver tables contain more data than Bronze tables
D. Silver tables contain a more refined and cleaner view of data than Bronze tables
E. Silver tables contain less data than Bronze tables

Show Answer

Correct Answer:

D. Silver tables contain a more refined and cleaner view of data than Bronze tables

Question 5

A data engineer has a Python notebook in Databricks, but they need to use SQL to accomplish a specific task within a cell. They still want all of the other cells to use Python without making any changes to those cells. Which of the following describes how the data engineer can use SQL within a cell of their Python notebook?

A. It is not possible to use SQL in a Python notebook
B. They can attach the cell to a SQL endpoint rather than a Databricks cluster
C. They can simply write SQL syntax in the cell
D. They can add %sql to the first line of the cell
E. They can change the default language of the notebook to SQL

Show Answer

Correct Answer:

D. They can add %sql to the first line of the cell

Question 6

A data engineer has developed a data pipeline to ingest data from a JSON source using Auto Loader, but the engineer has not provided any type inference or schema hints in their pipeline. Upon reviewing the data, the data engineer has noticed that all of the columns in the target table are of the string type despite some of the fields only including float or boolean values. Why has Auto Loader inferred all of the columns to be of the string type?

A. Auto Loader cannot infer the schema of ingested data
B. JSON data is a text-based format
C. Auto Loader only works with string data
D. All of the fields had at least one null value

Show Answer

Correct Answer:

B. JSON data is a text-based format

Question 7

Which statement regarding the relationship between Silver tables and Bronze tables is always true?

A. Silver tables contain a less refined, less clean view of data than Bronze data
B. Silver tables contain aggregates while Bronze data is unaggregated
C. Silver tables contain more data than Bronze tables
D. Silver tables contain less data than Bronze tables

Show Answer

Correct Answer:

D. Silver tables contain less data than Bronze tables

Question 8

A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?

A. Unity Catalog
B. Data Explorer
C. Delta Lake
D. Delta Live Tables
E. Auto Loader

Show Answer

Correct Answer:

D. Delta Live Tables

Question 9

What describes the relationship between Gold tables and Silver tables?

A. Gold tables are more likely to contain aggregations than Silver tables
B. Gold tables are more likely to contain valuable data than Silver tables
C. Gold tables are more likely to contain a less refined view of data than Silver tables
D. Gold tables are more likely to contain truthful data than Silver tables

Show Answer

Correct Answer:

A. Gold tables are more likely to contain aggregations than Silver tables

Question 10

A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is configured to run in Production mode using the Continuous Pipeline Mode. What is the expected outcome after clicking Start to update the pipeline assuming previously unprocessed data exists and all definitions are valid?

A. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will persist to allow for additional testing
B. All datasets will be updated once and the pipeline will shut down. The compute resources will persist to allow for additional testing
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped
D. All datasets will be updated once and the pipeline will shut down. The compute resources will be terminated

Show Answer

Correct Answer:

C. All datasets will be updated at set intervals until the pipeline is shut down. The compute resources will be deployed for the update and terminated when the pipeline is stopped

Question 11

A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?

A. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table
B. Records that violate the expectation are added to the target dataset and flagged as invalid in a field added to the target dataset
C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log
D. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log
E. Records that violate the expectation cause the job to fail

Show Answer

Correct Answer:

C. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log

Question 12

A data engineer has realized that the data files associated with a Delta table are incredibly small. They want to compact the small files to form larger files to improve performance. Which of the following keywords can be used to compact the small files?

A. REDUCE
B. OPTIMIZE
C. COMPACTION
D. REPARTITION
E. VACUUM

Show Answer

Correct Answer:

B. OPTIMIZE

Question 13

How can Git operations must be performed outside of Databricks Repos?

A. Commit
B. Pull
C. Merge
D. Clone

Show Answer

Correct Answer:

C. Merge

Question 14

What can be used to simplify and unify siloed data architectures that are specialized for specific use cases?

A. Delta Lake
B. Data lake
C. Data warehouse
D. Data lakehouse

Show Answer

Correct Answer:

D. Data lakehouse

Question 15

A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data. Which of the following relational objects should the data engineer create?

A. Spark SQL Table
B. View
C. Delta Table
D. Temporary view

Show Answer

Correct Answer:

D. Temporary view

Question 16

Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?

A. The ability to manipulate the same data using a variety of languages
B. The ability to collaborate in real time on a single notebook
C. The ability to set up alerts for query failures
D. The ability to support batch and streaming workloads
E. The ability to distribute complex data operations

Show Answer

Correct Answer:

D. The ability to support batch and streaming workloads

Question 17

In which scenario will a data team want to utilize cluster pools?

A. An automated report needs to be version-controlled across multiple collaborators
B. An automated report needs to be runnable by all stakeholders
C. An automated report needs to be refreshed as quickly as possible
D. An automated report needs to be made reproducible

Show Answer

Correct Answer:

C. An automated report needs to be refreshed as quickly as possible

Question 18

A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables. Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?

A. CREATE TABLE all_transactions AS SELECT * FROM march_transactions INNER JOIN SELECT * FROM april_transactions;
B. CREATE TABLE all_transactions AS SELECT * FROM march_transactions UNION SELECT * FROM april_transactions;
C. CREATE TABLE all_transactions AS SELECT * FROM march_transactions OUTER JOIN SELECT * FROM april_transactions;
D. CREATE TABLE all_transactions AS SELECT * FROM march_transactions INTERSECT SELECT * from april_transactions;

Show Answer

Correct Answer:

B. CREATE TABLE all_transactions AS SELECT * FROM march_transactions UNION SELECT * FROM april_transactions;

Question 19

Which of the following commands will return the number of null values in the member_id column?

A. SELECT count(member_id) FROM my_table;
B. SELECT count(member_id) - count_null(member_id) FROM my_table;
C. SELECT count_if(member_id IS NULL) FROM my_table;
D. SELECT null(member_id) FROM my_table;

Show Answer

Correct Answer:

C. SELECT count_if(member_id IS NULL) FROM my_table;

Question 20

Which tool is used by Auto Loader to process data incrementally?

A. Checkpointing
B. Spark Structured Streaming
C. Databricks SQL
D. Unity Catalog

Show Answer

Correct Answer:

B. Spark Structured Streaming

Aced these? Get the Full Exam

Download the complete Certified Data Engineer Associate study bundle with 134+ questions in a single printable PDF.

Purchase Full Exam PDF | $10