Back to blog

Who needs hadoop when you got spark kid

Introduction

There has been a lot of debate between which is better a platform for Big Data Hadoop or Spark. Both are completely different platforms which were made for different purposes. I hope after reading this post you can understand the technology behind Hadoop and Spark and identify which platform better suits your company’s and clients’ needs.


What is Hadoop?

The first think you need to realize is that Hadoop is NOT a database.

Hadoop is not going to replace traditional database and data warehouse architectures. Hadoop is a distributed fault tolerant file sharing ecosystem.

What Hadoop offers is parallel processing which makes it convenient for you to run many jobs in parallel. But keep in mind all these jobs have to be run on disk. Hadoop uses MapReduce a computational model which breaks down and aggregates tasks across your database which speeds up the DB performance.

HDFS or Hadoop File System is used to reduce redundancy and include parallelism which might not exist in traditional RBMS. Traditional databases have ACID properties (Atomicity, Consistency, Isolation and Durability) something absent in Hadoop. So, every time you have to run a job you will have to code that manually in MapReduce. Ouch!

Is Hadoop faster in processing jobs than RBMS? Yes!

But,

Hadoop is also not fit for real-time data applications; its not fast in joining multiple data sets, has problems in scalability (operated by single master) and doesn’t have storage or network level encryption. With Hadoop you will always have to use a database systems like HBase, Impala or Cassandra and for analytics something like Hive on top.

Hive is primarily used on top of Hadoop for SQL abstractions of MapReduce as writing jobs in Java for MapReduce is hectic and time consuming.


What is Spark?

Spark is an engine for large scale data processing with a focus on data science and data analytics but has evolved for real-time streaming as well.

Compatible with major platforms

Why Use Spark?

  1. Spark uses a DAG (directed acyclic graph) engine which is 100x in-memory faster than MapReduce and 10x in-disk faster than Hadoop
  2. Spark is platform agnostic so it can work on top of Hadoop, Cassandra, Mesos, Hbase and is compatible with languages like SQL, Python, Java, Scala and R
  3. Spark is an all-in-one platform with; Spark SQL, Spark Streaming, MLib, GraphX and SparkR

Spark Architecture

Spark Core Engine and the apps on top of it

Spark Core Engine is the foundation block of the Spark Architecture and is responsible for task distribution, scheduling and input/output.

Spark introduced RDD (Resilient Distributed Dataset) which is an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel an evolution on Hadoop’s MapReduce.

RDDs support two types of operations:

Transformation in Spark are lazy* *because they do not compute the result instantly, they just remember the operation performed on the data set.

Actions has to be used to compute the results.

Thus the transformations and actions technique makes Spark faster and more efficient in its computation.

Spark SQL supports querying data via SQL but before you can use SQL you need to develop the relevant table structures or data frames (pandas in python).

Spark Streaming supports real time processing of streaming data and converts the input data stream into batches. Spark Core generates the final stream of the processed data. These micro batches are supported by the lambda architecture* (it aggregates old data on top of new one).*

MLib is the machine learning library of Spark which supports all leading algorithms like classification, regression, clustering and collaborative filtering.

GraphX is a library for manipulating graphs and performing graph-parallel operations based on data frames. It provides a useful tool for ETL, exploratory analysis and iterative graph computations.

SparkR connects R language to the Spark framework, so you can use all the components of the R framework.


Why Spark is better?

Before 2013, Hadoop was the king of data processing but today Spark because of its fast in-memory processing is used by companies like Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks and Amazon.

The advantages of Spark is the seamless integration of existing frameworks and a fusion of machine learning, business intelligence and analytics with real-time processing.

A poll from Apache Foundation asked companies why they used Spark

Answer:

So if you are a data engineer, data scientist or application developer and want to leverage Big Data Spark is the perfect platform for you.


Use Cases of Spark

  1. IoT applications in connected devices and sensors
  2. Predictive Analytics for real-time analytics and probabilistic forecasting of customers, products and partners
  3. Anomaly Detection for real-time anomalies in financial fraud, structural defects, potential medical conditions
  4. Personalization for delivering a custom-tailored experience in real-time for your customers

Conclusion

Spark is a platform which improved upon Hadoop and was made especially keeping in mind the growing demands of real-time Big Data.

However, you will still find Hadoop in a lot of companies who invested heavily in distributed computing before 2013 but the good thing is you can run Spark on top of Hadoop and get the best of both worlds.

Is Spark better than Hadoop, I think so. The ease of use, speed, multi-language support, multi-framework capacity and diverse use cases makes it a winner for me.

Winner: Spark!

Back to blog