Hadoop Performance Tuning Recommendations
The checklists and recommendations given in this section will be useful to prepare and follow MapReduce performance recommendations.
The following is the checklist for Memory recommendations:
• Adjust memory settings to avoid a job hanging due to insufficient memory
• Set or define a JVM reuse policy
• Verify the JVM code cache and increase it if necessary
• Analyze garbage collector (GC) cycles (using detailed logs), observe whether it has an intensive cycle (which means there is a large number of object instances created in memory) and check the Hadoop framework heap usage
The following are the massive I/O tuning recommendations to ensure that there are no setbacks due to I/O operations:
• In the context of large input data, compress source data to avoid/reduce massive I/O tuning
• Reduce spilled records from map tasks when you experiment with large spilled records
• Reduce spilled records by tuning: io.sort.mb, io.sort.record.percent, io.sort.spill.percent
• Compress the map output to minimize I/O disk operations
• Implement a Combiner to minimize massive I/O and network traffic Add a Combiner with the following line of code:
job.setCombinerClass(Reduce.class);
• Compress the MapReduce job output to minimize large output data effects The compression parameters are mapred.compress.map.output and mapred.output.compression.type
• Change the replication parameter value to minimize network traffic and massive I/O disk operations
The Hadoop minimal configuration checklist to validate hardware resources is as follows:
• Define the Hadoop ecosystem components that are required to be installed (and maintained)
• Define how you are going to install Hadoop, manually or using an automated deployment tool (such as Puppet/Yum)
• Choose the underlying core storage such as HDFS, HBase, and so on
• Check whether additional components are required for orchestration, job scheduling, and so on
• Check on third-party software dependencies such as JVM version
• Check the key parameter configuration of Hadoop, such as HDFS block size, replication factor, and compression
• Define the monitoring policy; what should be monitored and with which tool (for example, Ganglia)
• Install a monitoring tool, such as Nagios or Ganglia, to monitor your Hadoop cluster resources
• Identify (calculate) the amount of required disk space to store the job data
• Identify (calculate) the number of required nodes to perform the job
• Check whether NameNodes and DataNodes have the required minimal hardware resources, such as amount of RAM, number of CPUs, and network bandwidth
• Calculate the number of mapper and reducer tasks required to maximize CPU usage
• Check the number of MapReduce tasks to ensure that sufficient tasks are running
• Avoid using the Virtual server for the production environment and use it only for your MapReduce application development
• Eliminate map-side spills and reduce-side disk I/O
In case of any ©Copyright or missing credits issue please check CopyRights page for faster resolutions.