YARN: Yet Another Resource Negotiator - a cluster management technology
PSB
What is this place? Bank - lockers ? Shelves ? No mate ! This is FACEBOOK data center. A good explanation FB data center here.
If you see there are many racks which holds commodity servers. These machines are connected to each other over the network. Similarly all the racks are connected to each other over network we called it as a Cluster. In other words you can say Cluster is a collection of Racks which contains commodity servers.
Rack Awareness: Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks.
So YARN is taking care of all these things like memory,disk,network etc.
In Hadoop 1.x there is no separate Cluster Management concept. It was managed by MapReduce only. But MapReduce is slow (we discuss more in next post), it is java specific and is the only way to communicate with HDFS (bottle neck situation). So to avoid this situation and to make availability of other programming techniques to communicate HDFS we opt YARN. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
YARN has mainly 3 demons
1.Resource Manager
–Manages Resources for a Cluster
•Memory, disk, networks etc
–Instruct Node Manager to allocate resources
–Application negotiates for resources with Resource Manager
–There is only one instance of Resource Manager
2.Node Manager
–Manages Resources of a single node
–There is one instance per node in the cluster
3.Job History Server
–Archives Jobs’ metrics and meta-data
How it works ?
When client submits a request first it will contact with the Resource Manager and it will fetch the metadata of DataNodes from NameNode. Each DataNode has its own Node Manager which will send some periodic signals (heart beat) to the Resource Manager about their readiness to run the tasks. The input file will now break down into no of input splits (discuss more on input splits in Map Reduce till then it is just a piece of file) and ready to store in Data Nodes. This process is nothing but a Job. For each job Resource Manager assigns a Job Id. Job history maintains information of all jobs.
(We discussed there is only one Resource Manager per cluster so to make Resource Manager more robust they've introduced a concept called App Master. Instead of Resource Manager, App Master will take care of the Job.Resource Manager will do only launching one App Master per Job and rest of the things App Master will take care.)
So once input splits are ready Resource Manager will launch one App Master on any DataNode (will decide by Resource Manager, we don't have any control of it). Once App Master launched successfully it will register to Resource Manager. App Master negotiates with Resource Manager for the resources (DataNodes) and gives launch specification to Node Manager and Node Manager will launch containers (JVM) on DataNodes and start running the process.
What just happened ?
As discussed one of Hadoop feature here the program itself going to the data location and processing. Here our program will be copied into the container and will launch JVM to run the program.
Once containers finishes the process will get notified to App Master and it will notifies back to Resource Manager. And Resource Manager de-register the App Master and kill the App Master.
PSB
What is this place? Bank - lockers ? Shelves ? No mate ! This is FACEBOOK data center. A good explanation FB data center here.
If you see there are many racks which holds commodity servers. These machines are connected to each other over the network. Similarly all the racks are connected to each other over network we called it as a Cluster. In other words you can say Cluster is a collection of Racks which contains commodity servers.
Rack Awareness: Network bandwidth between any two nodes in rack is greater than bandwidth between two nodes on different racks.
So YARN is taking care of all these things like memory,disk,network etc.
In Hadoop 1.x there is no separate Cluster Management concept. It was managed by MapReduce only. But MapReduce is slow (we discuss more in next post), it is java specific and is the only way to communicate with HDFS (bottle neck situation). So to avoid this situation and to make availability of other programming techniques to communicate HDFS we opt YARN. The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
YARN has mainly 3 demons
1.Resource Manager
–Manages Resources for a Cluster
•Memory, disk, networks etc
–Instruct Node Manager to allocate resources
–Application negotiates for resources with Resource Manager
–There is only one instance of Resource Manager
2.Node Manager
–Manages Resources of a single node
–There is one instance per node in the cluster
3.Job History Server
–Archives Jobs’ metrics and meta-data
How it works ?
When client submits a request first it will contact with the Resource Manager and it will fetch the metadata of DataNodes from NameNode. Each DataNode has its own Node Manager which will send some periodic signals (heart beat) to the Resource Manager about their readiness to run the tasks. The input file will now break down into no of input splits (discuss more on input splits in Map Reduce till then it is just a piece of file) and ready to store in Data Nodes. This process is nothing but a Job. For each job Resource Manager assigns a Job Id. Job history maintains information of all jobs.
(We discussed there is only one Resource Manager per cluster so to make Resource Manager more robust they've introduced a concept called App Master. Instead of Resource Manager, App Master will take care of the Job.Resource Manager will do only launching one App Master per Job and rest of the things App Master will take care.)
So once input splits are ready Resource Manager will launch one App Master on any DataNode (will decide by Resource Manager, we don't have any control of it). Once App Master launched successfully it will register to Resource Manager. App Master negotiates with Resource Manager for the resources (DataNodes) and gives launch specification to Node Manager and Node Manager will launch containers (JVM) on DataNodes and start running the process.
What just happened ?
As discussed one of Hadoop feature here the program itself going to the data location and processing. Here our program will be copied into the container and will launch JVM to run the program.
Once containers finishes the process will get notified to App Master and it will notifies back to Resource Manager. And Resource Manager de-register the App Master and kill the App Master.
So with this we covered the base concepts of Hadoop.
So if you see the above image HDFS is for distributed storage, YARN is for cluster management. So as discussed programming languages other than java can also communicates with HDFS with the help of YARN.
Previous Post Next Post
Previous Post Next Post
No comments:
Post a Comment