Free DP-203 Sample Questions — Data Engineering on Microsoft Azure

Free DP-203 sample questions for the Data Engineering on Microsoft Azure exam. No account required: study at your own pace.

Want an interactive quiz? Take the full DP-203 practice test

Looking for more? Click here to get the full PDF with 173+ practice questions for $10 for offline study and deeper preparation.

Question 1

You have an Azure Data Lake Storage Gen2 account that contains two folders named Folder1 and Folder2. You use Azure Data Factory to copy multiple files from Folder1 to Folder2. You receive the following error. Operation on target Copy_sks failed: Failure happened on 'Sink' side. ErrorCode=DelimitedTextMoreColumnsThanDefined, 'Type=Microsoft.DataTransfer.Common.Snared.HybridDeliveryException, Message=Error found when processing 'Csv/Tsv Format Text' source '0_2020_11_09_11_43_32.avro' with row number 53: found more columns than expected column count 27., Source=Microsoft.DataTransfer.Comnon,' What should you do to resolve the error?

A. Change the Copy activity setting to Binary Copy
B. Lower the degree of copy parallelism
C. Add an explicit mapping
D. Enable fault tolerance to skip incompatible rows

Show Answer

Correct Answer:

A. Change the Copy activity setting to Binary Copy

Question 2

What should you recommend using to secure sensitive customer contact information?

A. Transparent Data Encryption (TDE)
B. row-level security
C. column-level security
D. data sensitivity labels

Show Answer

Correct Answer:

C. column-level security

Question 3

A company purchases IoT devices to monitor manufacturing machinery. The company uses an Azure IoT Hub to communicate with the IoT devices. The company must be able to monitor the devices in real-time. You need to design the solution. What should you recommend?

A. Azure Analysis Services using Azure Portal
B. Azure Analysis Services using Azure PowerShell
C. Azure Stream Analytics cloud job using Azure Portal
D. Azure Data Factory instance using Microsoft Visual Studio

Show Answer

Correct Answer:

C. Azure Stream Analytics cloud job using Azure Portal

Question 4

A company has a real-time data analysis solution that is hosted on Microsoft Azure. The solution uses Azure Event Hub to ingest data and an Azure Stream Analytics cloud job to analyze the data. The cloud job is configured to use 120 Streaming Units (SU). You need to optimize performance for the Azure Stream Analytics job. Which two actions should you perform? Each correct answer presents part of the solution. NOTE: Each correct selection is worth one point.

A. Implement event ordering
B. Implement Azure Stream Analytics user-defined functions (UDF)
C. Implement query parallelization by partitioning the data output
D. Scale the SU count for the job up
E. Scale the SU count for the job down
F. Implement query parallelization by partitioning the data input

Show Answer

Correct Answer:

C. Implement query parallelization by partitioning the data output
F. Implement query parallelization by partitioning the data input

Question 5

You have an Azure Databricks workspace and an Azure Data Lake Storage Gen2 account named storage1. New files are uploaded daily to storage1. You need to recommend a solution that configures storage1 as a structured streaming source. The solution must meet the following requirements: • Incrementally process new files as they are uploaded to storage1. • Minimize implementation and maintenance effort. • Minimize the cost of processing millions of files. • Support schema inference and schema drift. Which should you include in the recommendation?

A. COPY INTO
B. Azure Data Factory
C. Auto Loader
D. Apache Spark FileStreamSource

Show Answer

Correct Answer:

C. Auto Loader

Question 6

You are designing a highly available Azure Data Lake Storage solution that will include geo-zone-redundant storage (GZRS). You need to monitor for replication delays that can affect the recovery point objective (RPO). What should you include in the monitoring solution?

A. 5xx: Server Error errors
B. Average Success E2E Latency
C. availability
D. Last Sync Time

Show Answer

Correct Answer:

D. Last Sync Time

Question 7

You have an Azure Synapse Analytics dedicated SQL pool. You need to create a fact table named Table1 that will store sales data from the last three years. The solution must be optimized for the following query operations: • Show order counts by week. • Calculate sales totals by region. • Calculate sales totals by product. • Find all the orders from a given month. Which data should you use to partition Table1?

A. product
B. month
C. week
D. region

Show Answer

Correct Answer:

B. month

Question 8

You have an Azure Synapse Analytics workspace named WS1 that contains an Apache Spark pool named Pool1. You plan to create a database named DB1 in Pool1. You need to ensure that when tables are created in DB1, the tables are available automatically as external tables to the built-in serverless SQL pool. Which format should you use for the tables in DB1?

A. Parquet
B. ORC
C. JSON
D. HIVE

Show Answer

Correct Answer:

A. Parquet

Question 9

You have an activity in an Azure Data Factory pipeline. The activity calls a stored procedure in a data warehouse in Azure Synapse Analytics and runs daily. You need to verify the duration of the activity when it ran last. What should you use?

A. activity runs in Azure Monitor
B. Activity log in Azure Synapse Analytics
C. the sys.dm_pdw_wait_stats data management view in Azure Synapse Analytics
D. an Azure Resource Manager template

Show Answer

Correct Answer:

A. activity runs in Azure Monitor

Question 10

You have an Azure subscription that contains an Azure data factory named ADF1 and a Log Analytics workspace named Workspace1. You need to configure ADF1 to send execution information for pipelines to Workspace1. What should you configure?

A. diagnostic settings
B. metrics
C. logs
D. alerts

Show Answer

Correct Answer:

A. diagnostic settings

Question 11

You plan to create an Azure Data Factory pipeline that will include a mapping data flow. You have JSON data containing objects that have nested arrays. You need to transform the JSON-formatted data into a tabular dataset. The dataset must have one row for each item in the arrays. Which transformation method should you use in the mapping data flow?

A. new branch
B. unpivot
C. alter row
D. flatten

Show Answer

Correct Answer:

D. flatten

Question 12

You have an Azure Synapse Analytics dedicated SQL pool that contains a large fact table. The table contains 50 columns and 5 billion rows and is a heap. Most queries against the table aggregate values from approximately 100 million rows and return only two columns. You discover that the queries against the fact table are very slow. Which type of index should you add to provide the fastest query times?

A. nonclustered columnstore
B. clustered columnstore
C. nonclustered
D. clustered

Show Answer

Correct Answer:

B. clustered columnstore

Question 13

You need to design a solution that will process streaming data from an Azure Event Hub and output the data to Azure Data Lake Storage. The solution must ensure that analysts can interactively query the streaming data. What should you use?

A. Azure Stream Analytics and Azure Synapse notebooks
B. Structured Streaming in Azure Databricks
C. event triggers in Azure Data Factory
D. Azure Queue storage and read-access geo-redundant storage (RA-GRS)

Show Answer

Correct Answer:

B. Structured Streaming in Azure Databricks

Question 14

You have an Azure Synapse Analytics job that uses Scala. You need to view the status of the job. What should you do?

A. From Synapse Studio, select the workspace. From Monitor, select SQL requests
B. From Azure Monitor, run a Kusto query against the AzureDiagnostics table
C. From Synapse Studio, select the workspace. From Monitor, select Apache Sparks applications
D. From Azure Monitor, run a Kusto query against the SparkLoggingEvent_CL table

Show Answer

Correct Answer:

C. From Synapse Studio, select the workspace. From Monitor, select Apache Sparks applications

Question 15

You have an Azure data factory. You need to examine the pipeline failures from the last 60 days. What should you use?

A. the Activity log blade for the Data Factory resource
B. the Monitor & Manage app in Data Factory
C. the Resource health blade for the Data Factory resource
D. Azure Monitor

Show Answer

Correct Answer:

D. Azure Monitor

Question 16

You have an Azure Synapse Analytics Apache Spark pool named Pool1. You plan to load JSON files from an Azure Data Lake Storage Gen2 container into the tables in Pool1. The structure and data types vary by file. You need to load the files into the tables. The solution must maintain the source data types. What should you do?

A. Use a Conditional Split transformation in an Azure Synapse data flow
B. Use a Get Metadata activity in Azure Data Factory
C. Load the data by using the OPENROWSET Transact-SQL command in an Azure Synapse Analytics serverless SQL pool
D. Load the data by using PySpark

Show Answer

Correct Answer:

D. Load the data by using PySpark

Question 17

You have an Azure Stream Analytics job that receives clickstream data from an Azure event hub. You need to define a query in the Stream Analytics job. The query must meet the following requirements: ✑ Count the number of clicks within each 10-second window based on the country of a visitor. ✑ Ensure that each click is NOT counted more than once. How should you define the Query?

A. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SlidingWindow(second, 10)
B. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10)
C. SELECT Country, Avg(*) AS Average FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, HoppingWindow(second, 10, 2)
D. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, SessionWindow(second, 5, 10)

Show Answer

Correct Answer:

B. SELECT Country, Count(*) AS Count FROM ClickStream TIMESTAMP BY CreatedAt GROUP BY Country, TumblingWindow(second, 10)

Question 18

You are implementing a batch dataset in the Parquet format. Data files will be produced be using Azure Data Factory and stored in Azure Data Lake Storage Gen2. The files will be consumed by an Azure Synapse Analytics serverless SQL pool. You need to minimize storage costs for the solution. What should you do?

A. Use Snappy compression for the files
B. Use OPENROWSET to query the Parquet files
C. Create an external table that contains a subset of columns from the Parquet files
D. Store all data as string in the Parquet files

Show Answer

Correct Answer:

A. Use Snappy compression for the files

Question 19

You are designing a folder structure for the files in an Azure Data Lake Storage Gen2 account. The account has one container that contains three years of data. You need to recommend a folder structure that meets the following requirements: ✑ Supports partition elimination for queries by Azure Synapse Analytics serverless SQL pools ✑ Supports fast data retrieval for data from the current month ✑ Simplifies data security management by department Which folder structure should you recommend?

A. \Department\DataSource\YYYY\MM\DataFile_YYYYMMDD.parquet
B. \DataSource\Department\YYYYMM\DataFile_YYYYMMDD.parquet
C. \DD\MM\YYYY\Department\DataSource\DataFile_DDMMYY.parquet
D. \YYYY\MM\DD\Department\DataSource\DataFile_YYYYMMDD.parquet

Show Answer

Correct Answer:

A. \Department\DataSource\YYYY\MM\DataFile_YYYYMMDD.parquet

Question 20

You have an Azure Data Factory pipeline named P1. You need to schedule P1 to run at 10:15 AM, 12:15 PM, 2:15 PM, and 4:15 PM every day. Which frequency and interval should you configure for the scheduled trigger?

A. Frequency: Month - Interval: 1
B. Frequency: Day - Interval: 1
C. Frequency: Minute - Interval: 60
D. Frequency: Hour - Interval: 2

Show Answer

Correct Answer:

B. Frequency: Day - Interval: 1

Aced these? Get the Full Exam

Download the complete DP-203 study bundle with 173+ questions in a single printable PDF.

Purchase Full Exam PDF | $10