Free Certified Associate Developer for Apache Spark Sample Questions — Certified Associate Developer for Apache Spark

Free Certified Associate Developer for Apache Spark sample questions for the Certified Associate Developer for Apache Spark exam. No account required: study at your own pace.

Want an interactive quiz? Take the full Certified Associate Developer for Apache Spark practice test

Looking for more? Click here to get the full PDF with 164+ practice questions for $10 for offline study and deeper preparation.

Question 1

Which of the following code blocks applies the function assessPerformance() to each row of DataFrame storesDF?

  • A. storesDF.collect.foreach(assessPerformance(row))
  • B. storesDF.collect().apply(assessPerformance)
  • C. storesDF.collect.apply(row => assessPerformance(row))
  • D. storesDF.collect.map(assessPerformance(row))
  • E. storesDF.collect.foreach(row => assessPerformance(row))
Show Answer
Correct Answer:
E. storesDF.collect.foreach(row => assessPerformance(row))
Question 2

Which of the following operations will fail to trigger evaluation?

  • A. DataFrame.collect()
  • B. DataFrame.count()
  • C. DataFrame.first()
  • D. DataFrame.join()
  • E. DataFrame.take()
Show Answer
Correct Answer:
D. DataFrame.join()
Question 3

Which of the following is the most complete description of lazy evaluation?

  • A. None of these options describe lazy evaluation
  • B. process is lazily evaluated if its execution does not start until it is put into action by some type of trigger
  • C. process is lazily evaluated if its execution does not start until it is forced to display a result to the user
  • D. process is lazily evaluated if its execution does not start until it reaches a specified date and time
  • E. process is lazily evaluated if its execution does not start until it is finished compiling
Show Answer
Correct Answer:
B. process is lazily evaluated if its execution does not start until it is put into action by some type of trigger
Question 4

Which of the following operations fails to return a DataFrame with no duplicate rows?

  • A. DataFrame.dropDuplicates()
  • B. DataFrame.distinct()
  • C. DataFrame.drop_duplicates()
  • D. DataFrame.drop_duplicates(subset = None)
  • E. DataFrame.drop_duplicates(subset = "all")
Show Answer
Correct Answer:
E. DataFrame.drop_duplicates(subset = "all")
Question 5

Which of the following code blocks attempts to cache the partitions of DataFrame storesDF only in Spark’s memory?

  • A. storesDF.cache(StorageLevel.MEMORY_ONLY).count()
  • B. storesDF.persist().count()
  • C. storesDF.cache().count()
  • D. storesDF.persist(StorageLevel.MEMORY_ONLY).count()
  • E. storesDF.persist("MEMORY_ONLY").count()
Show Answer
Correct Answer:
D. storesDF.persist(StorageLevel.MEMORY_ONLY).count()
Question 6

The code block shown below contains an error. The code block is intended to print the schema of DataFrame storesDF. Identify the error. Code block: storesDF.printSchema.getAs[String]

  • A. There is no printSchema member of DataFrame – the getSchema() operation should be used instead
  • B. There is no printSchema member of DataFrame – the schema() operation should be used instead
  • C. The entire line needs to be a string – it should be wrapped by str()
  • D. The printSchema member of DataFrame is an operation prints the DataFrame – there is no need to call getAs
  • E. There is no printSchema member of DataFrame – schema and the print() function should be used instead
Show Answer
Correct Answer:
D. The printSchema member of DataFrame is an operation prints the DataFrame – there is no need to call getAs
Question 7

Which of the following describes the Spark driver?

  • A. The Spark driver is responsible for performing all execution in all execution modes – it is the entire Spark application
  • B. The Spare driver is fault tolerant – if it fails, it will recover the entire Spark application
  • C. The Spark driver is the coarsest level of the Spark execution hierarchy – it is synonymous with the Spark application
  • D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application
  • E. The Spark driver is horizontally scaled to increase overall processing throughput of a Spark application
Show Answer
Correct Answer:
D. The Spark driver is the program space in which the Spark application’s main method runs coordinating the Spark entire application
Question 8

The code block shown below should return a new DataFrame that is the result of an inner join between DataFrame storeDF and DataFrame employeesDF on column storeId. Choose the response chat correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF.__1__(__2__, __3__, __4__)

  • A. 1. join 2. employeesDF 3. "inner" 4. storesDF.storeId === employeesDF.storeId
  • B. 1. join 2. employeesDF 3. "storeId" 4. "inner"
  • C. 1. merge 2. employeesDF 3. "storeId" 4. "inner"
  • D. 1. join 2. employeesDF 3. "inner" 4. "storeId"
  • E. 1. join 2. employeesDF 3. "inner" 4. "storeId"
Show Answer
Correct Answer:
B. 1. join 2. employeesDF 3. "storeId" 4. "inner"
Question 9

Which of the following storage levels should be used to store as much data as possible in memory on two cluster nodes while storing any data that does not fit in memory on disk to be read in when needed?

  • A. MEMORY_ONLY_2
  • B. MEMORY_AND_DISK_SER
  • C. MEMORY_AND_DISK
  • D. MEMORY_AND_DISK_2
  • E. MEMORY_ONLY
Show Answer
Correct Answer:
D. MEMORY_AND_DISK_2
Question 10

Which of the following code blocks returns a DataFrame containing only the rows from DataFrame storesDF where the value in column sqft is less than or equal to 25,000 OR the value in column customerSatisfaction is greater than or equal to 30?

  • A. storesDF.filter(col("sqft") <= 25000 and col("customerSatisfaction") >= 30)
  • B. storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)
  • C. storesDF.filter(col(sqft) <= 25000 or col(customerSatisfaction) >= 30)
  • D. storesDF.filter(sqft <= 25000 | customerSatisfaction >= 30)
  • E. storesDF.filter(col("sqft") <= 25000 or col("customerSatisfaction") >= 30)
Show Answer
Correct Answer:
B. storesDF.filter(col("sqft") <= 25000 | col("customerSatisfaction") >= 30)
Question 11

Which of the following describes the difference between DataFrame.repartition(n) and DataFrame.coalesce(n)?

  • A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions
  • B. While the results are similar, DataFrame.repartition(n) will be more efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column
  • C. DataFrame.repartition(n) will split a Data Frame into any number of new partitions while minimizing shuffling. DataFrame.coalesce(n) will split a DataFrame onto any number of new partitions utilizing a full shuffle
  • D. While the results are similar, DataFrame.repartition(n) will be less efficient than DataFrame.coalesce(n) because it can partition a Data Frame by the column
  • E. DataFrame.repartition(n) will combine the existing partitions of a DataFrame but may result in an uneven distribution of data across the new partitions. DataFrame.coalesce(n) will more slowly split a Data Frame into n number of new partitions with data distributed evenly
Show Answer
Correct Answer:
A. DataFrame.repartition(n) will split a DataFrame into n number of new partitions with data distributed evenly. DataFrame.coalesce(n) will more quickly combine the existing partitions of a DataFrame but might result in an uneven distribution of data across the new partitions
Question 12

The code block shown below should read a JSON at the file path filePath into a DataFrame with the specified schema schema. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__.__3__(__4__).format("csv").__5__(__6__)

  • A. 1. spark 2. read() 3. schema 4. schema 5. json 6. filePath
  • B. 1. spark 2. read() 3. json 4. filePath 5. format 6. schema
  • C. 1. spark 2. read() 3. schema 4. schema 5. load 6. filePath
  • D. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath
  • E. 1. spark 2. read 3. format 4. "json" 5. load 6. filePath
Show Answer
Correct Answer:
D. 1. spark 2. read 3. schema 4. schema 5. load 6. filePath
Question 13

The code block shown below should return a DataFrame containing all columns from DataFrame storesDF except for column sqft and column customerSatisfaction. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: __1__.__2__(__3__)

  • A. 1. drop 2. storesDF 3. col(“sqft”), col(“customerSatisfaction”)
  • B. 1. storesDF 2. drop 3. sqft, customerSatisfaction
  • C. 1. storesDF 2. drop 3. “sqft”, “customerSatisfaction”
  • D. 1. storesDF 2. drop 3. col(sqft), col(customerSatisfaction)
  • E. 1. drop 2. storesDF 3. col(sqft), col(customerSatisfaction)
Show Answer
Correct Answer:
C. 1. storesDF 2. drop 3. “sqft”, “customerSatisfaction”
Question 14

Which of the following code blocks returns all the rows from DataFrame storesDF?

  • A. storesDF.head()
  • B. storesDF.collect()
  • C. storesDF.count()
  • D. storesDF.take()
  • E. storesDF.show()
Show Answer
Correct Answer:
B. storesDF.collect()
Question 15

The default value of spark.sql.shuffle.partitions is 200. Which of the following describes what that means?

  • A. By default, all DataFrames in Spark will be spit to perfectly fill the memory of 200 executors
  • B. By default, new DataFrames created by Spark will be split to perfectly fill the memory of 200 executors
  • C. By default, Spark will only read the first 200 partitions of DataFrames to improve speed
  • D. By default, all DataFrames in Spark, including existing DataFrames, will be split into 200 unique segments for parallelization
  • E. By default, DataFrames will be split into 200 unique partitions when data is being shuffled
Show Answer
Correct Answer:
E. By default, DataFrames will be split into 200 unique partitions when data is being shuffled
Question 16

Which of the following code blocks returns a new DataFrame where column division from DataFrame storesDF has been replaced and renamed to column state and column managerName from DataFrame storesDF has been replaced and renamed to column managerFullName?

  • A. storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName")
  • B. storesDF.withColumn("state", "division") .withColumn("managerFullName", "managerName")
  • C. storesDF.withColumn("state", col("division")) .withColumn("managerFullName", col("managerName"))
  • D. storesDF.withColumnRenamed(Seq("division", "state"), Seq("managerName", "managerFullName"))
  • E. storesDF.withColumnRenamed("state", "division") .withColumnRenamed("managerFullName", "managerName")
Show Answer
Correct Answer:
A. storesDF.withColumnRenamed("division", "state") .withColumnRenamed("managerName", "managerFullName")
Question 17

Which of the following statements describing a difference between transformations and actions is incorrect?

  • A. There are wide and narrow transformations but there are not wide and narrow actions
  • B. Transformations do not trigger execution while actions do trigger execution
  • C. Transformations work on DataFrames/Datasets while actions are reserved for native language objects
  • D. Some actions can be used to return data objects in a format native to the programming language being used to access the Spark API while transformations do not provide this ability
  • E. Transformations are typically logic operations while actions are typically focused on returning results
Show Answer
Correct Answer:
B. Transformations do not trigger execution while actions do trigger execution
Question 18

Which of the following describes the difference between cluster and client execution modes?

  • A. The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node)
  • B. The cluster execution mode is run on a local cluster, while the client execution mode is run in the cloud
  • C. The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode runs a Spark job entirely on one client machine
  • D. The cluster execution mode runs the driver on the cluster machine (also known as a gateway machine or edge node), while the client execution mode runs the driver on a worker node within a cluster
  • E. The cluster execution mode distributes executors across worker nodes in a cluster, while the client execution mode submits a Spark job from a remote machine to be run on a remote, unconfigurable cluster
Show Answer
Correct Answer:
A. The cluster execution mode runs the driver on a worker node within a cluster, while the client execution mode runs the driver on the client machine (also known as a gateway machine or edge node)
Question 19

Which of the following code blocks reads a CSV at the file path filePath into a Data Frame with the specified schema schema?

  • A. spark.read().csv(filePath)
  • B. spark.read().schema(“schema”).csv(filePath)
  • C. spark.read.schema(schema).csv(filePath)
  • D. spark.read.schema(“schema”).csv(filePath)
  • E. spark.read().schema(schema).csv(filePath)
Show Answer
Correct Answer:
C. spark.read.schema(schema).csv(filePath)
Question 20

The code block shown below should return a new DataFrame where column division from DataFrame storesDF has been renamed to column state and column managerName from DataFrame storesDF has been renamed to column managerFullName. Choose the response that correctly fills in the numbered blanks within the code block to complete this task. Code block: storesDF. __1__(__2__, __2__).__4__(__5__, __6__)

  • A. 1. withColumnRenamed 2."state" 3."division" 4.withColumnRenamed 5."managerFullName" 6."managerName"
  • B. 1.withColumnRenamed 2.division 3.col("state") 4. withColumnRenamed 5."managerName" 6.col("managerFullName")
  • C. 1. WithColumnRenamed 2. "division" 3."state" 4. withColumnRenamed 5. "managerName" 6."managerFullName"
  • D. 1. withColumn 2. "division" 3. "state" 4.withcolumn 5."managerName" 6."managerFullName
  • E. 1. withColumn 2. "division" 3. "state" 4. withColumn 5."managerName" 6."managerFullName"
Show Answer
Correct Answer:
C. 1. WithColumnRenamed 2. "division" 3."state" 4. withColumnRenamed 5. "managerName" 6."managerFullName"

Aced these? Get the Full Exam

Download the complete Certified Associate Developer for Apache Spark study bundle with 164+ questions in a single printable PDF.