Hadoop Performance Tuning Recommendations

Hadoop

The checklists and recommendations given in this section will be useful to prepare and follow MapReduce performance recommendations.
The following is the checklist for Memory recommendations:
• Adjust memory settings to avoid a job hanging due to insufficient memory
• Set or define a JVM reuse policy
• Verify the JVM code cache and increase it if necessary
• Analyze garbage collector (GC) cycles (using detailed logs), observe whether it has an intensive cycle (which means there is a large number of object instances created in memory) and check the Hadoop framework heap usage

The following are the massive I/O tuning recommendations to ensure that there are no setbacks due to I/O operations:
• In the context of large input data, compress source data to avoid/reduce massive I/O tuning
• Reduce spilled records from map tasks when you experiment with large spilled records
• Reduce spilled records by tuning: io.sort.mb, io.sort.record.percent, io.sort.spill.percent
• Compress the map output to minimize I/O disk operations

• Implement a Combiner to minimize massive I/O and network traffic Add a Combiner with the following line of code:

job.setCombinerClass(Reduce.class);

• Compress the MapReduce job output to minimize large output data effects The compression parameters are mapred.compress.map.output and mapred.output.compression.type
• Change the replication parameter value to minimize network traffic and massive I/O disk operations

 

The Hadoop minimal configuration checklist to validate hardware resources is as follows:
• Define the Hadoop ecosystem components that are required to be installed (and maintained)
• Define how you are going to install Hadoop, manually or using an automated deployment tool (such as Puppet/Yum)
• Choose the underlying core storage such as HDFS, HBase, and so on
• Check whether additional components are required for orchestration, job scheduling, and so on
• Check on third-party software dependencies such as JVM version
• Check the key parameter configuration of Hadoop, such as HDFS block size, replication factor, and compression
• Define the monitoring policy; what should be monitored and with which tool (for example, Ganglia)
• Install a monitoring tool, such as Nagios or Ganglia, to monitor your Hadoop cluster resources
• Identify (calculate) the amount of required disk space to store the job data
• Identify (calculate) the number of required nodes to perform the job
• Check whether NameNodes and DataNodes have the required minimal hardware resources, such as amount of RAM, number of CPUs, and network bandwidth
• Calculate the number of mapper and reducer tasks required to maximize CPU usage
• Check the number of MapReduce tasks to ensure that sufficient tasks are running

• Avoid using the Virtual server for the production environment and use it only for your MapReduce application development
• Eliminate map-side spills and reduce-side disk I/O

 

In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.