Skip to main content

Apache Spark Basics FAQ


Big Data has come a long way. Apache spark is one of the fastest big data computational engines. We will answer often asked questions about the basics of Apache Spark in this article.

What problems Apache Spark solves and how does it solve them?

Big data computation problem

When the size of data is large in terabytes, it is time taking and inefficient to load them into a single machine's memory and process them for computation. The cost of running a computation on high-end machines (large memory with multiple cores and processors) is very high.

Apache Spark is a cluster-based parallel processing engine that runs efficiently on low-end machines. It can run in-memory as well as on disk.

Limitation in MapReduce processing

MapReduce is a big data-parallel and distributed algorithm to process and generate data set on a cluster. It is the programming model used by Apache Hadoop for big data computation.

MapReduce process everything on disk (cluster of disks) in the following sequential steps.

  1. Read data from disk
  2. Map data
  3. Reduce data
  4. Write result on disk

IO (Input & Output) from disk takes most of the time of a MapReduce operation. It goes really inefficient when a problem needs multiple iterations on the same data set. We need iterations on the same data set for Graph manipulation, Machine learning algorithms, and other problems.

Apache Spark overcomes the MapReduce big data processing bottlenecks with their in-memory resilient distributed dataset (RDD) data structure, a clustered read-only multi-set of data items. In-Memory it is 100x faster than Apache Hadoop MapReduce, while on the disk it is 10x faster.

With RDDs implementation now it is possible to use iterative algorithms which use the same dataset in a loop as well as repeated database-style querying.

The complex big data ecosystem

Apache Hadoop, another big data computation platform grown as the complex ecosystem of tools and libraries for solving real-time streaming, structured data analysis, machine learning, etc. Usually, development has to opt for newer frameworks leads to increased cost of maintenance.

Apache Spark provides a unified ecosystem with low-end API along with high-level APIs and tools for real-time streaming, machine learning, etc.

What are the various components of Apache Spark?


                                            Apache Spark Various Components

Spark Core - Apache Spark core includes RDD (Resilient Distributed Dataset) API, Cluster management, scheduling, data source handling, memory management, fault tolerance, and others functionalities. Apache Spark's general-purpose fast computational core provides fundamentals for building higher-level API for various purposes. The benefits of tightly coupled architecture are that when Core get improvements, further high-level API also get benefitted.

Spark SQL, DataFrames, and Datasets - It provides API for processing structured data (JSON, relational database, and others). You can query the dataset using SQL-like syntax.

Spark Streaming - It provides API for processing real-time streams of data coming from various sources like Kafka, Flume, Kinesis, or TCP sockets. These real-time streams can be further processed using complex machine learning, graph, and other algorithms.

Spark MLib (Machine Learning) - It provides API for executing Machine Learning algorithms like Classification, Regression, Collaborative Filtering, and others.

Spark GraphX - It provides API for doing parallel computation on Graph data (e.g Facebook friend graph). It also gives inbuilt support for Graph algorithms like Page Rank and Triangle counting.

In which programming languages you can write Spark applications?

                                Apache Spark Programming Language Supports

Apache Spark provides libraries and tools for writing an application using Scala, Java, Python, and R programming languages.

You can also interact with Apache Spark with Scala Python and R CLI (Command Line Interface) to execute exploratory queries. It helps Data Scientists a lot.

How Apache Spark application execute?

Apache Spark program execution on cluster lifecycle

 

The lifecycle for Spark program execution on the cluster:

  1. You write Spark applications, package them and send them to the main spark server (not Worker Node). In your application's main program (Driver Program) you use SparkContext. Spark application runs on a cluster as an independent process coordinated by SparkContext.
  2. SparkContext can connect to worker nodes using many Cluster Managers.
  3. SparkContext acquires worker node executor process. These processes run computations and store data for the app.
  4. SparkContext sends application code (JARs or Python files) to the executor node.
  5. SparkContext send tasks to the worker node executor for further process.

A few special features of a Spark application architecture are:

  1. Each application is given its executor process. The executor process stays up until the end of the program and executes it in multiple threads. It brings isolation between multiple Spark applications. Data sharing between nodes are also not possible without writing them on disk.
  2. Spark can work with any Cluster Manager as long as they can acquire executor processes on the node.
  3. The network connection between the worker node and the driver program is a must.
  4. Keeping driver program and worker node close to each other (preferably on the same LAN) decreases the latency of cluster task scheduling.

What data storage Apache Spark supports?

Apache Spark doesn't have its own data storage capabilities. Though it supports several data storages like Hadoop Distributed File System (HDFS), HBase, Cassandra, Apache Hive, Amazon S3, and others. It has options to add custom data backend.

Which cluster managers Apache Spark supports?

Apache Spark comes with its native cluster manager, which is good for small deployment. Though it can also use Hadoop YARN and Apache Mesos as its cluster manager.

Why Apache Hadoop bundle with Apache Spark distribution?

Apache Spark gives support for Hadoop YARN and Mesos cluster manager. Apache Spark depends on Hadoop client libraries for YARN and Mesos. We may download Spark build without the Hadoop bundle. We can also refer to the existing installation of Hadoop if it is on the same machine as Apache Spark.

References

  1. Apache Spark latest documentation (Click Here)
  2. MapReduce framework Wikipedia article (Click here)
  3. Apache Hadoop Ecosystem
  4. Apache Spark Wikipedia article (Click Here)
  5. MapReduce algorithm Wikipedia article
  6. Hadoop YARN official website
  7. Apache Mesos official website

Comments

Popular posts from this blog

Extend and reuse an existing AirByte destination connector

AirByte is an open-source ELT (Extract, Load, and Transformation) application. It heavily uses containerization for the deployment of its various components. On the local machine, we need docker to run it. AirByte has an impressive list of source and destination connectors available. One of my use case data destinations is the  ClickHouse data warehouse and its destination connector is not yet (2021-12-08) available. As per the documentation, It seems that creating a destination connector is a non-trivial job. It's a great idea to build an open-source ClickHouse destination connector. However, I tried avoiding the temptation to create one because of the required effort. AirByte has a  MySql destination connector available. ClickHouse provides a MySQL connector for access from any MySQL client. We need to configure Clickhouse to give support for the MySQL connector. Accessing ClickHouse from AirByte using its MySQL destination connector looks promising. However, when ...

Understanding Type Checking

A few examples of types in the context of programming language can be integer, float, character, string, array, etc.  When a program executes then data flow between instructions and values of specific types are assigned to a variable after some operation. It's important for the system to verify if the correct types are used as operands in operations. For e.g. In a sum operation, the expectation for operands to be of numeric type. The program's execution should fail in the case there is inconsistency. We can classify programming languages into two categories based as per their ability to cater to type safety: Dynamically Typed Language Statically Typed Language

Setting Clickhouse column data warehouse at Google Cloud Compute Engine VM

I didn't have a Google Cloud account associated with my email, so I signed up for one. It needs a valid Credit Card and mobile number to check if you are human. On successful sign up I get 300$ to spend within 3 months. Creating a free forever Google Cloud Compute Engine VM As per Google Cloud documentation you can have 1 non-preemptible e2-micro VM instance (1GB 2vCPU, 30GB Disk, etc.) per month free forever in some regions with some restrictions. I wanted the following stuff in my VM before I can install Clickhouse on to that: Ubuntu 20.x LTS SSH access from my machine Enabling SSH-based access to Google Compute Engine VM Step 1 Created an ssh private and public key on my mac using the following command ssh-keygen -t rsa -f ~/.ssh/gcloud-ssh-key -C mrityunjay -b 2048 Step 2 Copied the public key from the console using the following command: cat ~/.ssh/gcloud-ssh-key.pub output ssh-rsa <Gibrish :)> mrityunjay Step 3 I went to Google Cloud Console > Co...