Apache Presto vs Apache Hive

Overview

Presto, Hive and Impala are analytic engines that provide a similar service - SQL on Hadoop. The fourth contender here is SparkSQL, which runs on Spark (surprise) and thus has very different characteristics.However, there are fundamental differences in how they go about this task. They are also supported by different organizations, and there’s plenty of competition in the field.

A quick overview of the teams involved -

  • Apache Hive is supported by HortonWorks.
  • Apache Impala is supported by Cloudera.
  • Apache Presto was created by Facebook and is supported by Teradata and Starburst.

Apache Hive

What is Apache Hive

Apache Hive is a data warehouse built on Hadoop providing a “SQL-like” interface. Traditionally, SQL queries had to be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Apache Hive was built to provide the necessary SQL abstraction without the need to implement queries in the low-level Java API. The “SQL-like” language that Hive supports is called HiveQL.

Apache Hive implements schema-on-read, and can transparently convert HiveQL queries to MapReduce, Tez or Spark jobs. All three of these engines can run in YARN.

Other features that Hive supports are

  • Indexes, including Bitmap indexes.
  • Different storage types - plain text, RCFile, HBase, ORC and others.
  • Metadata storage in an RDBMS.
  • Can operate on compressed data - DEFLATE, BWT, Snappy, etc.
  • User-defined functions. Hive also supports extending the UDF set to handle use-cases not supported by built-in functions.

The default metadata catalog for Hive is an embedded Apache Derby database, but other client/server DBs like MySQL can optionally be used.

There are Hive plugins that now support querying the bitcoin blockchain.

Apache Hive was first released as version 0.10.0 on 11 January 2013. As of writing, the latest version of Hive available is 2.3.2 released on 18-Novemeber-2017.

Hive Architecture

Hive Architecture

Metastore

Metadata for each table, including schema, location and partition metadata. This data is extremely important, and typically is backed up regularly.

Driver

This is the controller which receives the HiveQL statement. It does the statement execution by creates sessions and monitors the lifecycle and progress of the execution. The Driver also acts as a collection point of data or query results obtained after a Reduce operation. The driver consists of three components -

  • Compiler
  • Optimizer
  • Executor

Compiler

Compiles the HiveQL query into an Execution Plan. The Execution Plan contains MapReduce tasks and steps that need to be performed to get the desired output. The compiler does this in a few stages. First it converts the query into an Abstract Syntax Tree. After checking for compatibility and compile time errors, it converts the AST into a Directed Acyclic Graph (DAG). The DAG represents MapReduce stages and tasks based on the input query and data.

Optimizer

The Optimizer performs various transformations on the execution plan generated by the compiler to get an optimized DAG. The optimizer might aggregate tasks, for example converting a pipeline of joins into a single join, or split tasks - such as applying a transformation on the data before running a reduce operation. The objective is to provide better performance and scalability. As with all things in this architecture, you can modify or pipeline the logic used for transformation by using another optimizer.

Executor

The executor runs the task. It interacts with the Hadoop job tracer to schedule tasks to be run. It takes care of pipelining of tasks and handling dependencies.

CLI/UI/Thrift

Various interfaces (or in the case of Thrift, by another project) that allow external clients to interact with Hive.

HiveQL

HiveQL does not strictly follow the full SQL-92 standard, and offers extensions like multi-table inserts and CTAS. However, it only offers basic support for indexes.

HiveQL traditionally lacked support for transactions and materialized views and offered only limited subquery support. Support for full ACID functionality was made available with release 0.14 in November 2014.

Hive Transaction Support

Up until Hive 0.13, Hive provided partition level ACD. Isolation could be accomplished by turning on one of the available locking mechanisms (ZooKeeper or in-memory). With the addition of transactions in Hive 0.13, it became possible to provide full ACID semantics at the row level - this meant one application could add rows, while another reads from the same partition without interfering with each other. Hive 0.14 brought INSERT, UPDATE and DELETE statements that now allow true ACID row-level changes.

What’s the sweet spot for Hive?

Hive is a great choice when low latency/multiuser support is not a requirement. It is great for batch processing and ETL.

Hive and Tez

Modern Hive can run on Tez, whose computational model is similar to Sparks. This means that while staging continues to happen between map and reduce operations, they may no longer be written to disk. This also means that Hive on Tez is not as fault tolerant as Hive on MapReduce.

Apache Presto vs. Hive

Presto-vs-Hive

Sidebar : ACID

If you’re into databases, you should already be familiar with ACID, and its very contrived cousin BASE.

Atomic
An operation either succeeds or fails. When you update a row, you do not end up with a partialy updated row.
Consistent
Otherwise known as Strong consistency. Once an application performans an operation, the rsults of that operation are visible to it in every subsequent operation. Contrast this with weak or eventual consistency models.
Isolated
Every “session” is isolated until commit. An incomplete operation by one user does not cause unexpected side effects for other users. This is usually accomplished by table and row locks. When sessions start trampling on each other, you get what are called “dirty reads” - reads where your session sees data that was written AFTER you started your query. There are a few different types of isolation - query time isolation provides snapshot isolation for the duration of the query. Traditional RDBMS systems provide transaction time isolation, which provide snapshot isolation for the duration of the transaction.
Durable
Once an operation is completed, it will be preserved even in the face of machine or system failure.

References

Raw Notes

  • Hive - staged to disk.
  • Presto - In-memory.
comments powered by Disqus