Free GCP-PDE Sample Questions — Google Cloud Platform - Professional Data Engineer

Free GCP-PDE sample questions for the Google Cloud Platform - Professional Data Engineer exam. No account required: study at your own pace.

Want an interactive quiz? Take the full GCP-PDE practice test

Looking for more? Click here to get the full PDF with 281+ practice questions for $10 for offline study and deeper preparation.

Question 1

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

  • A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches
  • B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications
  • C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function
  • D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages
Show Answer
Correct Answer:
B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications
Question 2

You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using SideInputs to join data. You noticed that the pipeline is taking longer to complete than expected; what should you do to expedite the Dataflow job?

  • A. Switch to compressed Avro files
  • B. Reduce the batch size
  • C. Retry records that throw an error
  • D. Use CoGroupByKey instead of the SideInput
Show Answer
Correct Answer:
D. Use CoGroupByKey instead of the SideInput
Question 3

Your company's on-premises Apache Hadoop servers are approaching end-of-life, and IT has decided to migrate the cluster to Google Cloud Dataproc. A like-for- like migration of the cluster would require 50 TB of Google Persistent Disk per node. The CIO is concerned about the cost of using that much block storage. You want to minimize the storage cost of the migration. What should you do?

  • A. Put the data into Google Cloud Storage
  • B. Use preemptible virtual machines (VMs) for the Cloud Dataproc cluster
  • C. Tune the Cloud Dataproc cluster so that there is just enough disk for all data
  • D. Migrate some of the cold data into Google Cloud Storage, and keep only the hot data in Persistent Disk
Show Answer
Correct Answer:
A. Put the data into Google Cloud Storage
Question 4

You created a new version of a Dataflow streaming data ingestion pipeline that reads from Pub/Sub and writes to BigQuery. The previous version of the pipeline that runs in production uses a 5-minute window for processing. You need to deploy the new version of the pipeline without losing any data, creating inconsistencies, or increasing the processing latency by more than 10 minutes. What should you do?

  • A. Update the old pipeline with the new pipeline code
  • B. Snapshot the old pipeline, stop the old pipeline, and then start the new pipeline from the snapshot
  • C. Drain the old pipeline, then start the new pipeline
  • D. Cancel the old pipeline, then start the new pipeline
Show Answer
Correct Answer:
C. Drain the old pipeline, then start the new pipeline
Question 5

You need to choose a database for a new project that has the following requirements: ✑ Fully managed ✑ Able to automatically scale up ✑ Transactionally consistent ✑ Able to scale up to 6 TB ✑ Able to be queried using SQL Which database do you choose?

  • A. Cloud SQL
  • B. Cloud Bigtable
  • C. Cloud Spanner
  • D. Cloud Datastore
Show Answer
Correct Answer:
C. Cloud Spanner
Question 6

You have a data processing application that runs on Google Kubernetes Engine (GKE). Containers need to be launched with their latest available configurations from a container registry. Your GKE nodes need to have GPUs, local SSDs, and 8 Gbps bandwidth. You want to efficiently provision the data processing infrastructure and manage the deployment process. What should you do?

  • A. Use Compute Engine startup scripts to pull container images, and use gcloud commands to provision the infrastructure
  • B. Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images
  • C. Use GKE to autoscale containers, and use gcloud commands to provision the infrastructure
  • D. Use Dataflow to provision the data pipeline, and use Cloud Scheduler to run the job
Show Answer
Correct Answer:
B. Use Cloud Build to schedule a job using Terraform build to provision the infrastructure and launch with the most current container images
Question 7

An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?

  • A. BigQuery
  • B. Cloud SQL
  • C. Cloud BigTable
  • D. Cloud Datastore
Show Answer
Correct Answer:
B. Cloud SQL
Question 8

You are on the data governance team and are implementing security requirements. You need to encrypt all your data in BigQuery by using an encryption key managed by your team. You must implement a mechanism to generate and store encryption material only on your on-premises hardware security module (HSM). You want to rely on Google managed solutions. What should you do?

  • A. Create the encryption key in the on-premises HSM, and import it into a Cloud Key Management Service (Cloud KMS) key. Associate the created Cloud KMS key while creating the BigQuery resources
  • B. Create the encryption key in the on-premises HSM and link it to a Cloud External Key Manager (Cloud EKM) key. Associate the created Cloud KMS key while creating the BigQuery resources
  • C. Create the encryption key in the on-premises HSM, and import it into Cloud Key Management Service (Cloud HSM) key. Associate the created Cloud HSM key while creating the BigQuery resources
  • D. Create the encryption key in the on-premises HSM. Create BigQuery resources and encrypt data while ingesting them into BigQuery
Show Answer
Correct Answer:
B. Create the encryption key in the on-premises HSM and link it to a Cloud External Key Manager (Cloud EKM) key. Associate the created Cloud KMS key while creating the BigQuery resources
Question 9

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate?

  • A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user
  • B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability
  • C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability
  • D. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket
Show Answer
Correct Answer:
B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability
Question 10

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do?

  • A. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name
  • B. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name
  • C. Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code
  • D. Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code
Show Answer
Correct Answer:
D. Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code
Question 11

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?

  • A. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage
  • B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage
  • C. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery
  • D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption
Show Answer
Correct Answer:
B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage
Question 12

You need ads data to serve AI models and historical data for analytics. Longtail and outlier data points need to be identified. You want to cleanse the data in near-real time before running it through AI models. What should you do?

  • A. Use Cloud Storage as a data warehouse, shell scripts for processing, and BigQuery to create views for desired datasets
  • B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink
  • C. Use BigQuery to ingest, prepare, and then analyze the data, and then run queries to create views
  • D. Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery
Show Answer
Correct Answer:
B. Use Dataflow to identify longtail and outlier data points programmatically, with BigQuery as a sink
Question 13

You are deploying an Apache Airflow directed acyclic graph (DAG) in a Cloud Composer 2 instance. You have incoming files in a Cloud Storage bucket that the DAG processes, one file at a time. The Cloud Composer instance is deployed in a subnetwork with no Internet access. Instead of running the DAG based on a schedule, you want to run the DAG in a reactive way every time a new file is received. What should you do?

  • A. 1. Enable Private Google Access in the subnetwork, and set up Cloud Storage notifications to a Pub/Sub topic. 2. Create a push subscription that points to the web server URL
  • B. 1. Enable the Cloud Composer API, and set up Cloud Storage notifications to trigger a Cloud Function. 2. Write a Cloud Function instance to call the DAG by using the Cloud Composer API and the web server URL. 3. Use VPC Serverless Access to reach the web server URL
  • C. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance. 2. Create a Private Service Connect (PSC) endpoint. 3. Write a Cloud Function that connects to the Cloud Composer cluster through the PSC endpoint
  • D. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance. 2. Write a Cloud Function instance to call the DAG by using the Airflow REST API and the web server URL. 3. Use VPC Serverless Access to reach the web server URL
Show Answer
Correct Answer:
C. 1. Enable the Airflow REST API, and set up Cloud Storage notifications to trigger a Cloud Function instance. 2. Create a Private Service Connect (PSC) endpoint. 3. Write a Cloud Function that connects to the Cloud Composer cluster through the PSC endpoint
Question 14

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Dataproc and Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?

  • A. cron
  • B. Cloud Composer
  • C. Cloud Scheduler
  • D. Workflow Templates on Dataproc
Show Answer
Correct Answer:
B. Cloud Composer
Question 15

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants. What should you do?

  • A. Increase the size of the dataset by collecting additional data
  • B. Train a linear regression to predict a credit default risk score
  • C. Remove the bias from the data and collect applications that have been declined loans
  • D. Match loan applicants with their social profiles to enable feature engineering
Show Answer
Correct Answer:
B. Train a linear regression to predict a credit default risk score
Question 16

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do?

  • A. Use Cloud TPUs without any additional adjustment to your code
  • B. Use Cloud TPUs after implementing GPU kernel support for your customs ops
  • C. Use Cloud GPUs after implementing GPU kernel support for your customs ops
  • D. Stay on CPUs, and increase the size of the cluster you're training your model on
Show Answer
Correct Answer:
C. Use Cloud GPUs after implementing GPU kernel support for your customs ops
Question 17

Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

  • A. Rewrite the job in Pig
  • B. Rewrite the job in Apache Spark
  • C. Increase the size of the Hadoop cluster
  • D. Decrease the size of the Hadoop cluster but also rewrite the job in Hive
Show Answer
Correct Answer:
B. Rewrite the job in Apache Spark
Question 18

You want to migrate an Apache Spark 3 batch job from on-premises to Google Cloud. You need to minimally change the job so that the job reads from Cloud Storage and writes the result to BigQuery. Your job is optimized for Spark, where each executor has 8 vCPU and 16 GB memory, and you want to be able to choose similar settings. You want to minimize installation and management effort to run your job. What should you do?

  • A. Execute the job as part of a deployment in a new Google Kubernetes Engine cluster
  • B. Execute the job from a new Compute Engine VM
  • C. Execute the job in a new Dataproc cluster
  • D. Execute as a Dataproc Serverless job
Show Answer
Correct Answer:
D. Execute as a Dataproc Serverless job
Question 19

You are migrating a table to BigQuery and are deciding on the data model. Your table stores information related to purchases made across several store locations and includes information like the time of the transaction, items purchased, the store ID, and the city and state in which the store is located. You frequently query this table to see how many of each item were sold over the past 30 days and to look at purchasing trends by state, city, and individual store. How would you model this table for the best query performance?

  • A. Partition by transaction time; cluster by state first, then city, then store ID
  • B. Partition by transaction time; cluster by store ID first, then city, then state
  • C. Top-level cluster by state first, then city, then store ID
  • D. Top-level cluster by store ID first, then city, then state
Show Answer
Correct Answer:
A. Partition by transaction time; cluster by state first, then city, then store ID
Question 20

You are administering a BigQuery on-demand environment. Your business intelligence tool is submitting hundreds of queries each day that aggregate a large (50 TB) sales history fact table at the day and month levels. These queries have a slow response time and are exceeding cost expectations. You need to decrease response time, lower query costs, and minimize maintenance. What should you do?

  • A. Build authorized views on top of the sales table to aggregate data at the day and month level
  • B. Enable BI Engine and add your sales table as a preferred table
  • C. Build materialized views on top of the sales table to aggregate data at the day and month level
  • D. Create a scheduled query to build sales day and sales month aggregate tables on an hourly basis
Show Answer
Correct Answer:
C. Build materialized views on top of the sales table to aggregate data at the day and month level

Aced these? Get the Full Exam

Download the complete GCP-PDE study bundle with 281+ questions in a single printable PDF.