[Improvement] News Web Crawler combine with Redis

Redis logoRedis. Wait, What?

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

Redis is simple, i can understand it so fast, cause it’s much like python structure data. The most like data type for me (for now) is sets, because, it can validate duplicate. It’s connected to my recent work, I want to make my news web crawler going faster.

Intoduction to Redis

The important thing is must known about redis itself (key-value store) and redis data type, there are, string, hashes, lists, set, and sorted set. New to Redis? just try the interactive redis.

  • String
    String is simple key value store with String data type, You can insert a json like string, XML, HTML, or some paragraph of text. the command is set key value

    redis 127.0.0.1:6379> set somekey "some value"
    OK
    redis 127.0.0.1:6379> get somekey
    "some value"
    redis 127.0.0.1:6379> set somekey "some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text "
    OK
    redis 127.0.0.1:6379> get somekey
    "some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text some long text "
    redis 127.0.0.1:6379> set somekey '{"error": {"errors": [{"domain": "usageLimits","reason": "accessNotConfigured","message": "Access Not Configured"}],"code": 403,"message": "Access Not Configured"}}'
    OK
    redis 127.0.0.1:6379> get somekey
    "{\"error\": {\"errors\": [{\"domain\": \"usageLimits\",\"reason\": \"accessNotConfigured\",\"message\": \"Access Not Configured\"}],\"code\": 403,\"message\": \"Access Not Configured\"}}"
    

    Redis Doc String

  • List
    Redis List is same as Python list, it has command like push, insert, trim and pop and some other.

    redis 127.0.0.1:6379> rpush somelist "value"
    (integer) 1
    redis 127.0.0.1:6379> rpush somelist "value"
    (integer) 2
    redis 127.0.0.1:6379> rpush somelist "value 1"
    (integer) 3
    redis 127.0.0.1:6379> rpush somelist "value 2"
    (integer) 4
    redis 127.0.0.1:6379> lrange somelist 0 -1
    1) "value"
    2) "value"
    3) "value 1"
    4) "value 2"
    redis 127.0.0.1:6379> lpush somelist "left value"
    (integer) 5
    redis 127.0.0.1:6379> rpop somelist
    "value 2"
    redis 127.0.0.1:6379> lrange somelist 0 -1
    1) "left value"
    2) "value"
    3) "value"
    4) "value 1"
    

    Redis Doc List

  • Set
    In my version of set, set is just like list, but only save unique value. And this is what I wanted. I want to create unique data list, using mysql INSERT IGNORE is too heavy. It make my wa operation (I/O) so high, more than 80%.

    redis 127.0.0.1:6379> sadd someset value
    (integer) 1
    redis 127.0.0.1:6379> smembers someset
    1) "value"
    redis 127.0.0.1:6379> sadd someset value
    (integer) 0
    redis 127.0.0.1:6379> smembers someset
    1) "value"
    redis 127.0.0.1:6379> sadd someset "http://google.co.id"
    (integer) 1
    redis 127.0.0.1:6379> sadd someset "http://google.co.id"
    (integer) 0
    redis 127.0.0.1:6379> smembers someset
    1) "http://google.co.id"
    2) "value"
    redis 127.0.0.1:6379> sadd someset "value 1"
    (integer) 1
    redis 127.0.0.1:6379> sadd someset "value 2"
    (integer) 1
    redis 127.0.0.1:6379> sadd someset "new value"
    (integer) 1
    redis 127.0.0.1:6379> smembers someset
    1) "value 2"
    2) "http://google.co.id"
    3) "value"
    4) "value 1"
    5) "new value"
    

    When I add the same value, Redis automatically ignore it. So much faster than MYSQL INSERT IGNORE.
    Redis Doc Set

  • Sorted Set
    It is similar to a regular set, but now each value has an associated score. This score is used to sort the elements in the set.

    > ZADD hackers 1940 "Alan Kay"
    true
    > ZADD hackers 1953 "Richard Stallman"
    true
    > ZADD hackers 1965 "Yukihiro Matsumoto"
    true
    > ZADD hackers 1916 "Claude Shannon"
    true
    > ZADD hackers 1969 "Linus Torvalds"
    true
    > ZADD hackers 1912 "Alan Turing"
    true
    > ZADD hackers 1969 "Linus Torvalds"
    false
    > ZRANGE hackers 0 -1
    ["Alan Turing","Claude Shannon","Alan Kay","Richard Stallman","Yukihiro Matsumoto","Linus Torvalds"]
    

    Redis Doc Sorted Set

  • Hashes
    Sets field in the hash stored at key to value. If key does not exist, a new key holding a hash is created. If field already exists in the hash, it is overwritten.

    redis>  HSET myhash field1 "Hello"
    (integer) 1
    redis>  HGET myhash field1
    "Hello"
    redis> HSET myhash field1 "Hello"
    (integer) 0
    redis> HSET myhash field1 "world"
    (integer) 0
    redis> HSET myhash field2 "world"
    (integer) 1
    redis> HGET myhash field1
    "world"
    redis> HGET myhash
    ERR wrong number of arguments for 'hget' command
    redis> HGET myhash field2
    "world"
    

    Redis Doc Hash

Some detail about news web crawler

The Goal of News Web Crawler is to make own news search engine. So in our DB (using mysql) there’s can not have duplicate value in url. Before I know Redis, I’m using INSERT IGNORE and added UNIQUE to url field (to prevent duplicate value), but it’s process make my VPS i/o so high. So, I’m using Redis to check duplicate value before insert to MySQL.

Why I’m using Redis? because Redis doing it’s job in memory, that’s why Redis is fast and using less memory.

Make Me Faster

To validate my data is unique using redis is so simple. I’m using Redis python client library. to install it, just type pip install redis / easy install redis. Here’s the logic to validate it :

    if redis_object.sadd("feeder:rss", "%s" %(data)):
         # insert data to MySQL
         # use INSERT INTO not INSERT IGNORE INTO
         # and remove UNIQUE

Simple? yes, just 1 line, but it very effective and make news crawler faster.

Advertisements

2 thoughts on “[Improvement] News Web Crawler combine with Redis

  1. […] I plan to make my application faster and faster with caching using redis, and I will improve my news web crawler to going faster with redis as […]

  2. […] People in channel #python / freenode is awesome, they give Me inspiration with a video about my project (news web crawler). […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s