Scrapy [How to] Step by step crawling bhinneka.com – part 1

I want to show you how to scrap one of the biggest online store in indonesia, Bhinneka.com. I’m often check to compare price, day by day. Especially, computer peripheral. My Goal from this crawler is to create item comparing site, You can track item price history day by day.bhinneka_1

Let’s do it

  1. Make sure you have scrapy installed in your machine (I use ubuntu 10.04). Installation
  2. First, open up Bhinneka.com in your browser, (I use google chrome), and find Shop by Categories.Bhinneka_2
  3. Then right click in magnify icon, click “inspect element”. And now you know what element in it. It show Us next link that we want to scrap. If we hover to the icon, the link will appear, maybe something like this.Bhinneka_3
  4. Then save the link, and try to use scrapy via console with this command to get all the links. scrapy shell http://www.bhinneka.com/categories.aspx.
    fajri@fajri-laptop:~$ scrapy shell http://www.bhinneka.com/categories.aspx
    2013-01-22 12:42:36+0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled item pipelines:
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6025
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6082
    2013-01-22 12:42:37+0700 [default] INFO: Spider opened
    2013-01-22 12:43:03+0700 [default] DEBUG: Crawled (200)  (referer: None)
    [s] Available Scrapy objects:
    [s]   hxs
    [s]   item       {}
    [s]   request
    [s]   response
    [s]   settings
    [s]   spider     <BaseSpider 'default' at 0x92f002c>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
  5. Then use this command to get all links inside the page hxs.select('//div[@id="ctl00_content_divContent"]//li[@class="item"]/a[2]/@href').extract(). The output is something like this.
    [u'/aspx/products/pro_display_products.aspx?CategoryID=01VV',
     u'/aspx/products/pro_display_products.aspx?CategoryID=01VW',
     u'/aspx/products/pro_display_products.aspx?CategoryID=01VK',
     u'/aspx/products/pro_display_products.aspx?CategoryID=01VG',
     ...
    
  6. We use those links to get our details. But those links seems not right, so put this function in your ipython console :
    def complete_url(string):
        """Return complete url"""
        return "http://www.bhinneka.com" + string
    
  7. And test again our script
    items = hxs.select('//div[@id="ctl00_content_divContent"]//li[@class="item"]/a[2]/@href').extract()
    new_items = [complete_url(item) for item in items]
    print new_items
    [u'http://www.bhinneka.com/aspx/products/pro_display_products.aspx?CategoryID=01VV',
     u'http://www.bhinneka.com/aspx/products/pro_display_products.aspx?CategoryID=01VW',
     u'http://www.bhinneka.com/aspx/products/pro_display_products.aspx?CategoryID=01VK',
     u'http://www.bhinneka.com/aspx/products/pro_display_products.aspx?CategoryID=01VG',
     ...
    

Conclusion

Now, You’re know how to use scrapy, this is just a beginning and the other side tutorial from original scrapy tutorial. If you can’t wait for next part of tutorial, just fork it.

Advertisements

5 thoughts on “Scrapy [How to] Step by step crawling bhinneka.com – part 1

  1. […] I know you read my first post, this is the next post how to scrap it. Actually, I really don’t know if my method is right […]

  2. […] know you read my first post and second post, that’s still not complete, and now I will complete this tuts, this is not […]

  3. […] know you read my first post, second post, and third post. This time, We will complete Our crawler, We will use scrapy to […]

  4. Wow this four series of tutorials are one of the best i’ve found on google for a long period. i’m really grateful that you share your knowledge with me as a newcomer to scraping!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s