A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.
In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):
- MySQL / MariaDB 10.0: Row based database
- MongoDB 2.4: NoSQL database
- Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
- Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
- Impala 1.0: Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet
View original post 1,925 more words