Apache Spark Scala Interview Question Shyam Mallesh

Apache Spark

Apache spark is a free open source search engine like Google, Yahoo and another search engine. It is an analytical search engine for the large data scale processing of the user. Spark provides the interface for the programming users. Spark features include fault tolerance. It has been originally developed at the University of California, Berkeley AMPLab. The spark codebase was later donated to the most eligible software development association, which was the Apache software foundation. And the association is maintaining it properly. Spark provides all the functionality within the DataFrames API. The entire idea behind employing a SQL interface for Spark is that there’s tons of knowledge that will be represented during a loose relational model.

 Aggregations are at the centre of the huge effort of processing large scale data because it all usually comes right down to BI Dashboards and ML, both of which require aggregation of 1 sort or the opposite. Using the SparkSQL library, you'll achieve mostly everything that the user can get during a traditional electronic database or a knowledge warehouse query engine.

 

 Components of the spark

 

  • Executors: Executors comprise multiple tasks; basically, it's a JVM process sitting on all nodes. Executors receive the tasks and run them. Executors utilize cache in order that the tasks are often run faster.
  • Tasks: Jars, alongside the code, are mentioned as tasks.
  • Nodes: Nodes contains multiple executors.
  • RDDs: RDD may be a big arrangement that's wont to represent data, which can't be stored on one machine. Hence, the info is distributed, partitioned, and split across multiple computers.
  • Inputs: Every RDD is formed from some inputs like a document, Hadoop file, etc.
  • Output: The output of a function in Spark can produce an RDD; it's functional since a function, one after the opposite, receives an input RDD and outputs an output RDD.




Important questions and answer

 

Q1. What is the significance of silent distributed datasets in spark?

Ans- distributed data sets are the fundamental data structure of Apache spark, which is most important. It is one of the Big data Hadoop Training course stop distributed data are immutable, fully fault tolerance, distributed collections of different objects that are operated in a parallel way. Distributed datasets separate into different parts and can be executed on nodes of a cluster. It can be created by using the existing RDD or the use of the external datasets from any kind of HFDS or HBase. 

Q2.What is a lazy evaluation in spark?

Ans- whenever spark is operated in any of the datasets, it remembers only the instruction given by the user. When a transformation like a map function is called an RDD operation, and this operation doesn't perform at that moment. Transformation in the spark not evaluated until the user perform any actions, which leads to optimising the overall data processing in the workflow, and this is known as lazy evaluation. 

Q3. What makes spark good at low latency workloads like graph processing and machine learning?

Ans- Apache spark stores data in the memory itself for the faster processing and developing of the machine to learn the models quickly. The machine learns the algorithm requires multiple at the nodes and the different concepts which is created optimise the model. Graph algorithms grow through the nodes and edges, which helps to create a graph. These low latency workloads help to build the Spark to work in a faster way than other applications.

Q4.How can you trigger automatic Linux in sparse to handle accumulated data? 

Ans- trigger all the cleaner apps user need to set up the parameter such as a spark.cleaner.ttlx.

Q5.How can the user connect spark to Apache Mesos?

Ans- there are a total of four steps through which user can connect spark to the Apache Mesos:

  1.        a) Configuring the spark driver to connect with the Apache Mesos is the first step.
  2.        b) Then, put the binary spark package in the location, which can be e accessible by Apache Mesos without any problem. 
  3.        c) After that, install spark in the same location where Apache Mesos has located. 
  4.        d) Configure the spark.Mesos.executor.home property to pointing the location where the spark is installed. 

 

Q6. What is the Parquet file, and what are its advantages?

It is a columnar format that supports several data processing systems without any lags. With having the parquet file, Big data spark application can perform read and write operation both easily.

Advantages of having a parquet file have been listed below:

  1. i) It enables the user to get the specific column for his need or any access he wants. 
  2. ii) It always occupies less space than others.

iii) It follows the type-specific encoding method.

  1. iv) It supports the limited input and output operations rather than making it messy.

Post a Comment

0 Comments