Recent Works [Python] – News Web Crawler

I have a request to crawl some news website and make some analytic tool with those data. Before that, my client (Mediawave) use Google Api for searching News Data, but now, that API is closed. So, the only way to get those data is “crawl” to news provider. In my Case (Indonesian News), just crawling 3 News Provider, vivanews.com, detiknews.com and kompas.com. All were made using Python. Feel free to ask me if you want to see it in real Action.

Library I used :

  1. Feedparser (Parse RSS Feed)
  2. Scrapy (Crawling using XPath)
  3. Twisted (it’s needed for Scrapy and automatic installed using pip / easy_install)
  4. MySQLdb (Library to connecting Mysql and Python)
  5. Django framework (GUI web based)
  6. Highcharts (Charting library)

And this is some screenshoot of them (open in new tab for higher resolution) :

Advertisements

4 thoughts on “Recent Works [Python] – News Web Crawler

  1. […] sets, because, it can validate duplicate. It’s connected to my recent work, I want to make my news web crawler going […]

  2. […] Here I will show you a little trick using croniter, in the future, it will be combined with Redis Caching and News Web Crawler. […]

  3. Hi fajri, I am looking for a (news) web crawler for a project i am doing….will you able to share some insight or bits of your code in github ….many thanks
    or point me some resources 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s