Best Practices for Apache Hive Performance Tuning


If you’re wondering how to scale Apache Hive, here are the best ways to make the most of Hive performance.


1. Partitioning Tables :
                                    
Hive partitioning is an effective method to improve the query performance on larger tables.

2. Compress map/reduce output :

By enabling compression at various Levels we can achieve the performance improvement in Hive Queries.                                               

     Inside Hive-site.xml Add these Properties

             <property>
                  <name>mapred.compress.map.output</name>
                  <value>true</value>
             </property>

             <property>
                  <name>mapred.output.compress</name>
                  <value>true</value>
            </property>


3. Map join :
           
We can improve the performance of joins by enabling Auto Convert Map Joins & enabling optimization of skew joins.

When using this parameter, be sure the auto convert is enabled in the Hive environment.


     Inside Hive-site.xml Add these Properties 


             <property>
                  <name>hive.auto.convert.join</name>
                  <value>true</value>
             </property>


4. De-normalizing data:

Normalization is a standard process used to model your data tables with certain rules to take care with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues . Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.


5. Sampling:

Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.


6.Use ORC File Format :

Using ORC (Optimized Record Columnar) file format we can improve the performance of Hive Queries very effectively.

Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift and ProtoBuf, which can be helpful in various use cases too.


7. Enable Parallel Execution :

Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by setting below properties.


Inside Hive-site.xml Add these Properties 


             <property>
                  <name>hive.exec.parallel</name>
                  <value>true</value>
                                  <description>Whether to execute jobs in parallel</description>
             </property>
           
             <property>
                  <name>hive.exec.parallel.thread.number</name>
                  <value>8</value>
                                  <description>How many jobs at most can be executed in parallel</description>
             </property>


8. Bucketing :

Bucketing improves the join performance if the bucket key and join keys are common.Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).

SET hive.exec.dynamic.partition =true;
SET hive.exec.dynamic.partition.mode = nonstrict;       

# This command changes the setting for single session

# for permanent change in setting add it to " HIVE-SITE.XML" file

Inside Hive-site.xml Add these Properties 


             <property>
                  <name>hive.exec.dynamic.partition</name>
                  <value>true</value>
             </property>


             <property>
                  <name>hive.exec.dynamic.partition.mode</name>
                  <value>nonstrict</value>
             </property>


9. Enable Tez Execution Engine :

Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez engine with below property from hive shell.


Inside Hive-site.xml Add these Properties 


             <property>
                  <name>hive.execution.engine</name>
                  <value>tez</value>
             </property>


10. Enable Cost Based Optimization :

one can achieve further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others. cost based optimization can be enabled by setting below properties



Inside Hive-site.xml Add these Properties 


             <property>
                  <name>hive.cbo.enable</name>
                  <value>true</value>
             </property>







Thanks,
Have Fun!!!!

Please Leave your Comment Below If you Have Any Clarifications Regarding Installations or Any Suggestions.








Comments

Post a Comment

Popular posts from this blog

Quick Setup Steps For Installing & Integrating Spark,Hive,Hadoop Processess Together !!!

Twitter Analysis Using R

Performance Improvement in Hive (Properties in HIVE -SITE.XML)

Installation Steps for Hive

Installation Steps For MySql Server (version 5.7.23) For Centos

Facebook Data Analysis Using R

Installation Steps For Spark