Free Certified Data Engineer Professional Sample Questions — Certified Data Engineer Professional
Free Certified Data Engineer Professional sample questions for the Certified Data Engineer Professional exam. No account required: study at your own pace.
Looking for more? Click here to get the full PDF with 137+ practice questions for $10 for offline study and deeper preparation.
Question 1
What statement is true regarding the retention of job run history?
A. It is retained until you export or delete job run logs
B. It is retained for 30 days, during which time you can deliver job run logs to DBFS or S3
C. It is retained for 60 days, during which you can export notebook run results to HTML
D. It is retained for 60 days, after which logs are archived
E. It is retained for 90 days or until the run-id is re-used through custom run configuration
Show Answer
Correct Answer:
C. It is retained for 60 days, during which you can export notebook run results to HTML
Question 2
When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?
A. The five Minute Load Average remains consistent/flat
B. Bytes Received never exceeds 80 million bytes per second
C. Network I/O never spikes
D. Total Disk Space remains constant
E. CPU Utilization is around 75%
Show Answer
Correct Answer:
E. CPU Utilization is around 75%
Question 3
Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?
A. configure
B. fs
C. jobs
D. libraries
E. workspace
Show Answer
Correct Answer:
B. fs
Question 4
A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively. Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?
A. Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs, all PySpark and Spark SQL logic should be refactored
B. The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution
C. Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production
D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results
E. The Jobs UI should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs
Show Answer
Correct Answer:
D. Calling display() forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results
Question 5
A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources. The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours. Which approach would simplify the identification of these changed records?
A. Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed
B. Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model
C. Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions
D. Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date
E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed
Show Answer
Correct Answer:
E. Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed
Question 6
The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables. Which approach will ensure that this requirement is met?
A. When a database is being created, make sure that the LOCATION keyword is used
B. When the workspace is being configured, make sure that external cloud object storage has been mounted
C. When data is saved to a table, make sure that a full file path is specified alongside the USING DELTA clause
D. When tables are created, make sure that the UNMANAGED keyword is used in the CREATE TABLE statement
Show Answer
Correct Answer:
C. When data is saved to a table, make sure that a full file path is specified alongside the USING DELTA clause
Question 7
A data architect has heard about Delta Lake’s built-in versioning and time travel capabilities. For auditing purposes, they have a requirement to maintain a full record of all valid street addresses as they appear in the customers table. The architect is interested in implementing a Type 1 table, overwriting existing records with new values and relying on Delta Lake time travel to support long-term auditing. A data engineer on the project feels that a Type 2 table will provide better performance and scalability. Which piece of information is critical to this decision?
A. Data corruption can occur if a query fails in a partially completed state because Type 2 tables require setting multiple fields in a single update
B. Shallow clones can be combined with Type 1 tables to accelerate historic queries for long-term versioning
C. Delta Lake time travel cannot be used to query previous versions of these tables because Type 1 changes modify data files in place
D. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution
Show Answer
Correct Answer:
D. Delta Lake time travel does not scale well in cost or latency to provide a long-term versioning solution
Question 8
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators. Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?
A. Query’s detail screen and Job’s detail screen
B. Stage’s detail screen and Executor’s log files
C. Driver’s and Executor’s log files
D. Executor’s detail screen and Executor’s log files
E. Stage’s detail screen and Query’s detail screen
Show Answer
Correct Answer:
B. Stage’s detail screen and Executor’s log files
Question 9
Which statement describes Delta Lake Auto Compaction?
A. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB
B. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job
C. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written
D. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB
Show Answer
Correct Answer:
E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB
Question 10
Which Python variable contains a list of directories to be searched when trying to locate required modules?
A. importlib.resource_path
B. sys.path
C. os.path
D. pypi.path
Show Answer
Correct Answer:
B. sys.path
Question 11
A Delta table of weather records is partitioned by date and has the below schema: date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT To find all the records from within the Arctic Circle, you execute a query with the below filter: latitude > 66.3 Which statement describes how the Delta engine identifies which files to load?
A. All records are cached to an operational database and then the filter is applied
B. The Parquet file footers are scanned for min and max statistics for the latitude column
C. All records are cached to attached storage and then the filter is applied
D. The Delta log is scanned for min and max statistics for the latitude column
E. The Hive metastore is scanned for min and max statistics for the latitude column
Show Answer
Correct Answer:
D. The Delta log is scanned for min and max statistics for the latitude column
Question 12
A Delta Lake table representing metadata about content posts from users has the following schema: user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter: longitude < 20 & longitude > -20 Which statement describes how data will be filtered?
A. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range
B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude
C. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria
D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range
E. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria
Show Answer
Correct Answer:
D. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range
Question 13
A Databricks job has been configured with three tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A. What will be the resulting state if tasks A and B complete successfully but task C fails during a scheduled run?
A. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully
B. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically
C. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed
D. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure
Show Answer
Correct Answer:
A. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully
Question 14
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?
A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet
E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet
Show Answer
Correct Answer:
A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet
Question 15
A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries. In which location can one review the timeline for cluster resizing events?
A. Workspace audit logs
B. Driver's log file
C. Ganglia
D. Cluster Event Log
E. Executor's log file
Show Answer
Correct Answer:
D. Cluster Event Log
Question 16
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?
A. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: Unlimited
B. Cluster: New Job Cluster; Retries: None; Maximum Concurrent Runs: 1
C. Cluster: Existing All-Purpose Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
E. Cluster: Existing All-Purpose Cluster; Retries: None; Maximum Concurrent Runs: 1
Show Answer
Correct Answer:
D. Cluster: New Job Cluster; Retries: Unlimited; Maximum Concurrent Runs: 1
Question 17
All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema: key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely. Which of the following solutions meets the requirements?
A. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information
B. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory
C. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken
D. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level
E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries
Show Answer
Correct Answer:
E. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries
Question 18
A team of data engineers are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks. One member of the team suggests reusing these data quality rules across all tables defined for this pipeline. What approach would allow them to do this?
A. Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files
B. Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline
C. Maintain data quality rules in a separate Databricks notebook that each DLT notebook or file can import as a library
D. Maintain data quality rules in a Delta table outside of this pipeline's target schema, providing the schema name as a pipeline parameter
Show Answer
Correct Answer:
D. Maintain data quality rules in a Delta table outside of this pipeline's target schema, providing the schema name as a pipeline parameter
Question 19
The business reporting team requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms, and loads the data for their pipeline runs in 10 minutes. Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?
A. Manually trigger a job anytime the business reporting team refreshes their dashboards
B. Schedule a job to execute the pipeline once an hour on a new job cluster
C. Schedule a Structured Streaming job with a trigger interval of 60 minutes
D. Schedule a job to execute the pipeline once an hour on a dedicated interactive cluster
E. Configure a job that executes every time new data lands in a given directory
Show Answer
Correct Answer:
B. Schedule a job to execute the pipeline once an hour on a new job cluster
Question 20
You are testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. assert(myIntegrate(lambda x: x*x, 0, 3) [0] == 9) Which kind of test would the above line exemplify?
A. Unit
B. Manual
C. Functional
D. Integration
E. End-to-end
Show Answer
Correct Answer:
A. Unit
Aced these? Get the Full Exam
Download the complete Certified Data Engineer Professional study bundle with 137+ questions in a single printable PDF.