Pages

Sunday 11 September 2016

What is Hadoop ? How it is different from other techniques ?

Hey guys !!! Today we will see why Hadoop is popular and different than other regular programming techniques.

Before start of this artical we will see what are the cons of existing RDBMS (Relational Database Management System)

To start with Hadoop is the only technology will cope all the charecterstics of Big Data; Those are 3 V's.

Volume    - Huge data coming from various sources like apps,sensors,social media...
Velocity   - There is huge amount of data that has been produced for every second. From sensors...
Vereity     - Structured (Data which in table format with columns are rows), Semi Structured (XML,JSON), Unstructured (email contents,web pages, multimedia)

There are total 8 V's are there in total will discuss Later. Hadoop is handling all these without much bother.

Comming to RDBMS,
It is best suitable only for Structred data. (According to surveys the present the ratio of structured data is 20% and rest 80% is unstructured data).
Costing (Do you know how much cost of an highly reliable server is?).
No scalability (Relational databases are designed to run on a single server inorder to maintain integrety of the table mappings, so you Just can't add another server to allow more data).

What Hadoop is providing,

(Hadoop Cluster - A group of commodity servers (a normal machine) into one unit.)

Distributed Batch Processing - Hadoop works is distributed mode which is best suitable to Handle Big Data.
How Earlier Techniques - There is file with size say 1 gb, you are writing a program to  process the file. First you'll fetch the file from database/server to your program (two different networks) side which required network bandwidth, and then you start processing the file from top to bottom.
How Hadoop Works - First Hadoop will break 1 gb file into small peices and stored into different machines in your cluster, and it will copy your program to all machines where peices of file were stored and runs all programs paralally. So here there is no point of High Network Bandwidth.
Fault Tolerant - In Hadoop Cluster any server is failed due to hardware problems then it no longer stops our job.
Scalability - In the middle you came to know your cluster is out of memory you can add a commodity server to the Cluster.
Opensource     - Free Free Free ;)

We will see UNIX basic commands & HDFS in next topics.

Next Post

2 comments: