Jan 6, 2022

Apache Hive

Updated: Jan 10, 2022

What is a hive ?

Apache hive is not a database, Apache hive is a distributed, data warehouse system which processes the large scale of data on hadoop. This warehouse provides the central store of information, with the help of this information we can easily be analyzed to make informed, data driven decisions. Hive is used to analyse large amounts of data which is stored on hadoop HDFS and compatible file systems like amazon s3 and alluxio. The hive HQL query is just like a sql querying tool to query the data stored in HDFS and other filesystems that are integrated with hadoop.

Apache hive is built on the top of apache hadoop. Apache hadoop is an open source framework that allows for the distributed processing of large scale data which is stored on a cluster of computers.

Hive is designed to work with large amounts of dataset like terabyte and petabytes and hive is integrated with hadoop. Hive can process the structured data only that can be stored into the table. Hive does not support unstructured data such as audio, video and pictures. Hive is efficient for batch processing. Where the amount of data that means to be processed is concerned and not the time taken by producing the result. Hive supports various file formats like parquet, text file with significant compression, sequence file, ORC file.

Why Hive developed ?

Hive was created for non programmers who are familiar with SQL and to work with large amounts of dataset. Hive reduces the complexity of mapreduce programming. The simplest mapreduce program consists of 100 lines of code. We have to write different mapper, reducer code and driver code in the simplest cases. This task is more difficult for non-programmers. Hence to reduce the programmer effort hive was developed. 100 lines of mapreduce code summarized in a single hive query. Hive queries are similar to sql hence analysts can use their SQL knowledge to write HiveQL queries.

How does hive work ?

Hive queries are internally converted into a mapreduce program by its compiler and then actually processes the data through that program. This process is done internally by the hive itself.

Benefits

Fast : Hive is designed to work with large amounts of datasets like petabytes.
Familiar : Hive provides the SQL like interface that is accessible to non programmers.
Scalable : Hive is easy to distribute and scale based on your needs.

Why Apache hive Use?

It is mainly used for data analysis, querying and data summarization of large datasets. It also helps improve the developers productivity. Hive queries are similar to SQL so that analysts can use their SQL knowledge to write Hive queries. When we compare hive with sql, hive will win. Hive is designed for work with a large amount of data set. Hive has a lot of user-defined functions that can help you solve problems quickly.

Hive queries can easily connect with other hadoop packages like RHipe, RHive and Apache Mahout. Hive helps the developers during work with complex analytics processing and challenging data formats. We can also use tableau and hive integration for data visualization. Apache tez and along with hive offers real time processing.

Architecture of Apache Hive

The major component of Apache hive are :

User Interface (UI)
Driver
Compiler
Metastore
Executable engine

User Interface (UI) : Using the User Interface (UI) we can submit the queries through this component and process monitoring and instruction so that users can interact with Hive. There are three types of user interface i.e we can submit hive queries by three different ways. Hive CLI, Web Interface and Thrift Server.

Driver : This component receives the HiveQL query from the User Interface. Main task of driver is first driver fetch requires APIs needed for the query that are modeled at jdbc and odbc and the second main task is convert the Hive query into Mapreduce program with the help of compiler.

Compiler : This component helps the driver to convert the hive query into a Mapreduce program. This compiler also does the semantic analysis of the different hive query block and expression and then eventually generates an execution plan with the help of metastore data.

Metastore : This component stores all the structure information of the tables and partitions such as number of tables, columns data types, serializers, deserializers etc. It keep the track of the data and provides backup in case the data loss.

Execution engine : Execution engine interacts with the Name node and resource manager of hadoop framework to look for the table in HDFS and tell them to perform the task of execution plan. Once the job is completed by hadoop, it reverts back with the result and displayed to the user in the user interface.

Optimizer : This operation performs the transformation operations on the execution plan. It split the task in order to improve the scalability and efficiency.

Apache Hive Limitation

Hive does not support online transaction processing but we can use it for online analytical processing
We cannot modify the data as SQL
The latency in the apache hive query is high.

If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.

Apache Hive

Recent Posts

Comments