Introduction

Statistics such as the number of rows of a table or partition and the histograms of a particular interesting column are important in many ways. One of the key use cases of statistics is query optimization. Statistics serve as the input to the cost functions of the optimizer so that it can compare different plans and choose among them.

Hive’s Cost-Based Optimizer (CBO) is a core component in Hive’s query processing engine. Powered by Apache Calcite, the CBO optimizes and calculates the cost of various plans for a query. It attempts to cut down execution time and reduce resources by examining table and conditions specified in the query.

Missing column statistics is one of the leading causes of sub-optimal query performance on Hadoop.

Environment Setup

We will use the same tables as in How to Process Data with Apache Hive tutorial (drivers and timesheet).

For the time being, we will create tables without computing statistics. Later, we will comparing plans generated with and without statists.

1. Download driver-data file.

2. Upload drivers.csv and timesheet.csv to /usr/maria_dev on the Hortonworks Sandbox environment using Ambari -> Files View:

setup-fileview

3. From Hive View 2.0, run the following SQL to create drivers table (without statistics).

setup-driver

4. From Hive View 2.0, run the following SQL to create timesheet table (without statistics).

setup-timesheet

NOTE: For newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default. The user has to explicitly set the boolean variable hive.stats.autogather to false so that statistics are not automatically computed and stored into Hive MetaStore.

Visual Explain without Statistics

As you may recall, the following query will summarize total hours and miles driven by driver. Insert the following query in the Hive View 2.0 editor. Instead of executing the query, let’s click on Visual Explain.

explain-query

Visual Explain makes it easy to find and fix expensive operations by providing a high-level view of the query’s execution plan that focuses on the cost of each step. Expensive operations like Merge Joins are immediately visible, letting you know when settings changes or query re-writes are appropriate. When you need more detail, drill down to any operator of the plan to see detailed cost, vectorization and execution engine information.

explain-nostats

Compute Statistics

To generate statistics from Hive View 2.0 for a specific table, we need:

  1. select TABLES tab
  2. select Database
  3. select table
  4. select STATISTICS tab
  5. select include columns
  6. click on Recompute

compute-stats

Let’s compute statistics for table drivers:

stats-drivers

Let’s compute statistics for table timesheet:

stats-timesheet

Visual Explain with Statistics

Now that we’ve computed statistics, let’s re-run Visual Explain using the same query:

explain-query

Notice anything different?

explain-stats

Compare Plans

No statistics:

explain-stats

With statistics:
explain-stats

For this particular query and dataset, the optimizer did a great job generating a plan with and without statistics. As you can see, the number of rows at each node is more accurate when statistics were computed.

As your queries become more complex, you need to make a habit of computing statistics.

Summary

Congratulations in completing this tutorial. You should see the benefits of generating statistics for tables and columns.

Other Big data and Hortonworks articles on Orbifold Consulting can be found here.]