Introduction
Developed originally at Facebook, Hive adds structure to the data stored on HDFS. The schema of tables is stores in a separate metadata store. It converts SQL like semantics to multiple map reduce jobs running on HDFS in the backend.
Schema
Traditional databases follow the schema on write policy where once a schema is designed for a table, at the time of writing data itself, it is checked whether the data to be written conforms to the pre-defined schema. If it does not, the write is rejected.
In case of Hive, it is the opposite. It uses the schema on read policy. Both the policies have their own individual trade-offs:
In case of schema on write, load time is more and loads are slower because schema conformance is verified at the time of loading data. However, it provides faster query time because it can index data based on pre-defined columns in the schema.
However there may be cases where the indexing cannot be specified while populating the data initially
and this is where schema on read comes in handy. It provides the option to have 2 different schema present on the same underlying data depending on the kind of analysis required.
What is Hive suited for?
Hive is well suited for bulk access, updates of data as a new update requires a completely new table to be constructed. Also, query time is slower as compared to traditional databases because of the absence of indexing.
Hive Internals
- Hive stores the metadata into a relational database called the "Metastore".
- There are 2 kinds of tables in Hive:
- Managed tables - Where the data file for the table is predefined and is moved to the Hive warehouse directory on HDFS (in general, or any other Hadoop filesystem). When a table is deleted, in that case, the metadata and the data both are deleted from the filesystem.
- External tables - Here you can create data into the table lazily. There is no data moved to the Hive warehouse directory in this case and the schema/metadata is loosely coupled to the actual data. When a table is deleted, only the metadata gets deleted and the actual data is left untouched. It becomes helpful in cases if you want the data to be used by multiple databases. Another reason of using the same maybe when you need multiple schemas on the same underlying data.
- There is a provision to partition and sub-partition data in Hive based on a certain field. The advantage of doing this is that the data is grouped by that field and kept in one file/location for faster access. Hive performs input pruning so that only the relevant files are scanned when a SELECT query is issued.
- There is also the concept of buckets in Hive where it adds more structure to the data and faster retrieval times by effecting map side joins where if two tables are bucketed on the same columns, then only the relevant columns are picked from the second table as an input to the map.
- The partitions and buckets are effected through directory-file structuring like for example, one directory per partition and one file in a directory per bucket in the corresponding partition.