Hive is a data warehouse tool to process structured data in Hadoop. It resides on top of Hadoop, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive Consists of Mainly 3 core parts
Clients are the entry point for hive infrastructure, Hive various drivers for communication with different application
JDBS : java related applications
ODBC : other related applications
Hive services are used to interacts with hive client, gets the request from what kind of operations need to be preformed
3.Hive Storage and Computing
Hive services such as Metastore in turn communicates with Hive storage and performs the following actions
* Metadata information of tables created in Hive is stored in Hive “Meta storage database”.
* Query results and data loaded in the tables are going to be stored in Hadoop cluster on HDFS.
Job exectution flow
The data flow in Hive behaves in the following pattern;
- Executing Query from the CLI or User Interface
- The driver is interacting with Compiler for getting the with basic validation and preparation of query process and its related metadata information gathering
- The compiler creates the plan for a job to be executed. Compiler communicating with Metastore for getting metadata request
- Metastore sends metadata information back to compiler
- Compiler communicating with Driver with the proposed plan to execute the query
- Driver Sending execution plans to Execution engine
- Execution Engine acts as a bridge between Hive and Hadoop to process the query. For DFS operations.
- Execution Engine should first contacts Name Node and then to Data nodes to get the values stored in tables.
- Execution Engine is going to fetch desired records from Data Nodes. The actual data of tables resides in data node only. While from Name Node it only fetches the metadata information for the query.
- It collects actual data from data nodes related to mentioned query
- Execution Engine communicates bi-directionally with Meta store present in Hive to perform DDL (Data Definition Language) operations. Here DDL operations like CREATE, DROP and ALTERING tables and databases are done. Meta store will store information about database name, table names and column names only. It will fetch data related to query mentioned.
- Execution Engine (EE) in turn communicates with Hadoop daemons such as Name node, Data nodes, and Resource manager checks for the resource avaliablity and execute the query on top of Hadoop file system
- Fetching results from Execution Engine
- Fetching results from diver
- Sending results CLI or User Interface