Best Practices for Apache Hive Performance Tuning
If
you’re wondering how to scale Apache Hive, here are the best ways to make the most of Hive performance.
1. Partitioning Tables :
2. Compress map/reduce output :
Inside Hive-site.xml Add these Properties
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
3. Map join :
We can improve the performance of joins by enabling Auto Convert Map Joins & enabling optimization of skew joins.
When using this parameter, be sure the auto convert is enabled in the Hive environment.
4. De-normalizing data:
Normalization is a standard process used to model your data tables with certain rules to take care with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues . Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
5. Sampling:
Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.
6.Use ORC File Format :
Using ORC (Optimized Record Columnar) file format we can improve the performance of Hive Queries very effectively.
Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift and ProtoBuf, which can be helpful in various use cases too.
7. Enable Parallel Execution :
Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by setting below properties.
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
8. Bucketing :
Bucketing improves the join performance if the bucket key and join keys are common.Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
SET hive.exec.dynamic.partition =true;
SET hive.exec.dynamic.partition.mode = nonstrict;
# This command changes the setting for single session
# for permanent change in setting add it to " HIVE-SITE.XML" file
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
9. Enable Tez Execution Engine :
Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez engine with below property from hive shell.
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
10. Enable Cost Based Optimization :
one can achieve further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others. cost based optimization can be enabled by setting below properties
<property>
<name>hive.cbo.enable</name>
<value>true</value>
</property>
1. Partitioning Tables :
Hive partitioning is an effective method to improve the query performance on larger tables.
2. Compress map/reduce output :
By enabling compression at various Levels we can achieve the performance improvement in Hive Queries.
Inside Hive-site.xml Add these Properties
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>
3. Map join :
We can improve the performance of joins by enabling Auto Convert Map Joins & enabling optimization of skew joins.
When using this parameter, be sure the auto convert is enabled in the Hive environment.
Inside Hive-site.xml Add these Properties
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
4. De-normalizing data:
Normalization is a standard process used to model your data tables with certain rules to take care with redundancy of data and anomalies. In simpler words, if you normalize your data sets, you end up creating multiple relational tables which can be joined at the run time to produce the results. Joins are expensive and difficult operations to perform and are one of the common reasons for performance issues . Because of that, it’s a good idea to avoid highly normalized table structures because they require join queries to derive the desired metrics.
5. Sampling:
Sampling allows users to take a subset of dataset and analyze it, without having to analyze the entire data set. If a representative sample is used, then a query can return meaningful results as well as finish quicker and consume fewer compute resources.
6.Use ORC File Format :
Using ORC (Optimized Record Columnar) file format we can improve the performance of Hive Queries very effectively.
Columnar formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some other binary formats like Avro, sequence files, Thrift and ProtoBuf, which can be helpful in various use cases too.
7. Enable Parallel Execution :
Hadoop can execute MapReduce jobs in parallel, and several queries executed on Hive automatically use this parallelism.
A particular job may consist of some stages that are not dependent on each other and could be executed in parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by setting below properties.
Inside Hive-site.xml Add these Properties
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
8. Bucketing :
Bucketing improves the join performance if the bucket key and join keys are common.Bucketing in Hive distributes the data in different buckets based on the hash results on the bucket key. It also reduces the I/O scans during the join process if the process is happening on the same keys (columns).
SET hive.exec.dynamic.partition =true;
SET hive.exec.dynamic.partition.mode = nonstrict;
# This command changes the setting for single session
# for permanent change in setting add it to " HIVE-SITE.XML" file
Inside Hive-site.xml Add these Properties
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
9. Enable Tez Execution Engine :
Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez engine with below property from hive shell.
Inside Hive-site.xml Add these Properties
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
10. Enable Cost Based Optimization :
one can achieve further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others. cost based optimization can be enabled by setting below properties
Inside Hive-site.xml Add these Properties
<property>
<name>hive.cbo.enable</name>
<value>true</value>
</property>
Thanks,
Have Fun!!!!
Please Leave your Comment Below If you Have Any Clarifications Regarding Installations or Any Suggestions.
Have Fun!!!!
Please Leave your Comment Below If you Have Any Clarifications Regarding Installations or Any Suggestions.
This comment has been removed by the author.
ReplyDelete