Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)


impalaA common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

  • MySQL / MariaDB 10.0: Row based database
  • MongoDB 2.4: NoSQL database
  • Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
  • Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
  • Impala 1.0:  Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet

View original post 1,925 more words

Python [snippet/benchmark] : Redis test insert php vs python

I have a test script about redis, 1 in PHP, and 1 in Python. I want to test which one is faster.

Here is the PHP code :


$redis = new Redis();
$counter = 0;

for ($i = 1; $i <= 100000; $i++){
    if ($redis->sadd("php:key_test",$i)){
        $counter += 1;

echo "Added $counter data\n";

$count = $redis->scard('php:key_test');

echo "count_data = $count \n";

root@fajri-laptop:/home/fajri/php_project/php_redis# time php redis_test_insert.php
Added 100000 data
count_data = 100000

real    0m12.093s
user    0m1.672s
sys    0m2.488s

root@fajri-laptop:/home/fajri/php_project/php_redis# php -v
PHP 5.3.2-1ubuntu4.17 with Suhosin-Patch (cli) (built: Jun 19 2012 01:35:33)
Copyright (c) 1997-2009 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies

Here is the Python code :

import redis

redis_server = redis.Redis("")
counter = 0

for i in range(100000):
    if redis_server.sadd('python:key_test',i):
        counter +=1

print "added %s data" % (counter)

count = redis_server.scard('python:key_test')

print "count_data = %s" % (count)

fajri@fajri-laptop:~/python_project/py_redis$ time python
added 100000 data
count_data = 100000

real    0m14.914s
user    0m6.100s
sys    0m2.104s

fajri@fajri-laptop:~/python_project/py_redis$ python -V
Python 2.6.5

You can see, PHP is faster. So I googling & chat on ##python-friendly @ freenode, and got an idea, it should use xrange instead range.
So, I change to xrange, and it really works. Now Python is faster than PHP.
Why it could be? because range is just a list and xrange is Generator, which generator is using less memory than list, so it can be faster than before.
Finally it took only 10 seconds.

Hope this simple snippet is useful for you 😀