Scrapy [How to] Step by step crawling – part 1

I want to show you how to scrap one of the biggest online store in indonesia, I’m often check to compare price, day by day. Especially, computer peripheral. My Goal from this crawler is to create item comparing site, You can track item price history day by day.bhinneka_1

Let’s do it

  1. Make sure you have scrapy installed in your machine (I use ubuntu 10.04). Installation
  2. First, open up in your browser, (I use google chrome), and find Shop by Categories.Bhinneka_2
  3. Then right click in magnify icon, click “inspect element”. And now you know what element in it. It show Us next link that we want to scrap. If we hover to the icon, the link will appear, maybe something like this.Bhinneka_3
  4. Then save the link, and try to use scrapy via console with this command to get all the links. scrapy shell
    fajri@fajri-laptop:~$ scrapy shell
    2013-01-22 12:42:36+0700 [scrapy] INFO: Scrapy 0.14.3 started (bot: scrapybot)
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Enabled item pipelines:
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Telnet console listening on
    2013-01-22 12:42:37+0700 [scrapy] DEBUG: Web service listening on
    2013-01-22 12:42:37+0700 [default] INFO: Spider opened
    2013-01-22 12:43:03+0700 [default] DEBUG: Crawled (200)  (referer: None)
    [s] Available Scrapy objects:
    [s]   hxs
    [s]   item       {}
    [s]   request
    [s]   response
    [s]   settings
    [s]   spider     <BaseSpider 'default' at 0x92f002c>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
  5. Then use this command to get all links inside the page'//div[@id="ctl00_content_divContent"]//li[@class="item"]/a[2]/@href').extract(). The output is something like this.
  6. We use those links to get our details. But those links seems not right, so put this function in your ipython console :
    def complete_url(string):
        """Return complete url"""
        return "" + string
  7. And test again our script
    items ='//div[@id="ctl00_content_divContent"]//li[@class="item"]/a[2]/@href').extract()
    new_items = [complete_url(item) for item in items]
    print new_items


Now, You’re know how to use scrapy, this is just a beginning and the other side tutorial from original scrapy tutorial. If you can’t wait for next part of tutorial, just fork it.


6 thoughts on “Scrapy [How to] Step by step crawling – part 1

  1. […] I know you read my first post, this is the next post how to scrap it. Actually, I really don’t know if my method is right […]

  2. […] know you read my first post and second post, that’s still not complete, and now I will complete this tuts, this is not […]

  3. […] know you read my first post, second post, and third post. This time, We will complete Our crawler, We will use scrapy to […]

  4. Wow this four series of tutorials are one of the best i’ve found on google for a long period. i’m really grateful that you share your knowledge with me as a newcomer to scraping!

  5. Definitely consider that which you stated. Your favorite justification seemed to be on the net the easiest
    factor to consider of. I say to you, I certainly get annoyed at
    the same time as other folks consider worries that they plainly do
    not recognize about. You controlled to hit the nail upon
    the top and outlined out the whole thing without having side effect ,
    other folks can take a signal. Will probably be back to get more.
    Thank you

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s