Write your Web Crawler using Scrapy

9 min readApr 5, 2019

Crawl responsibly. Do not flood or DDOS websites.

I was told to explain myself from my users’ perspectives. In this post, you will learn to develop a web crawler using Python Scrapy. In each section, I’ll first explain before showing what you should expect to see. If you do not have time to code, you can look at my project on GitHub.

Step 0: What is Scrapy?
Step 1: Getting things ready (Setup Scrapy, MongoDB, Scrapy-Splash, basic configurations)
Step 2: Building the crawler (Scrape pages and write item to MongoDB)
Conclusion

Step 0: What is Scrapy?

Scrapy is an application framework designed for web scraping — crawling web sites and extracting structured data. It is useful to look at the architecture below to understand how Scrapy functions behind the scenes.

Scrapy Architecture (https://docs.scrapy.org/en/latest/topics/architecture.html)

Step 1: Get things ready!

Install Python 3 & Scrapy

To install Scrapy, you need to install Python. I strongly recommend installing Python 3 as Python 2 will retire by end-2019.

After installing Python, you can now open Terminal (MacOS) or cmd (Windows) to install scrapy.

pip3 install scrapy

If for whatever reason you are unable to install Scrapy, refer to Scrapy’s installation guide here.

Install MongoDB & MongoDB Compass

Next, you’ll need a database to store the crawled data. You can use any database you’re familiar with. For this post, I’ve chosen MongoDB.
You can install it locally or you can use Docker. I’ll show you the Docker way.

First, ensure you have Docker installed and running. If you do not have Docker installed, you can install it here.

After you have Docker installed, open a Terminal and enter these commands:

docker pull mongodocker run -p 27017:27017 -d --mount type=bind,source=$PWD/data/bin,destination=/data/bin mongo

docker pull mongo: Pulls latest-tagged MongoDB Docker image
docker run -p 27017:27017 -d — mount type=bind,source=$PWD/data/bin,destination=/data/bin mongo:
Runs a docker container using the latest-tagged MongoDB Docker image on port 27017 (-p 27017:27017), in the background (-d), with a volume mounted to the docker image (source=$PWD/data/bin, destination=/data/bin mongo)
$PWD = Present Working Directory (Ensure that there is an existing /data/bin folder in your $PWD)

After you have a running MongoDB Docker container, simply download and install MongoDB Compass to view the data in MongoDB. You should see this:

Identify the website to scrape

I chose Uniqlo’s Singapore online web store for 2 reasons:

The structure of the website can be commonly found in other websites. You could easily replicate what you’ve learnt here.
Webpages are not loaded instantly. Although some webpages will load their data instantaneous, there are some that still requires some time for the JavaScript to load.

Category page takes some time to load items

Install Scrapy-Splash

To instruct your spider to wait for the JavaScript to load before scraping, we need to install Scrapy-Splash. Open a Terminal and enter these commands:

pip3 install scrapy-splashdocker pull scrapinghub/splashdocker run -p 8050:8050 -d scrapinghub/splash

pip3 install scrapy-splash: Install scrapy-splash. Scrapy-Splash uses Splash HTTP API, so you need a Splash instance. The next couple of steps will help you setup the Splash instance.
docker pull scrapinghub/splash: Pulls latest-tagged Scrapy-Splash image
docker run -p 8050:8050 -d scrapinghub/splash:
Runs a docker container using the latest-tagged Scrapy-Splash image on port 8050 (-p 8050:8050), in the background (-d).
Ensure that Splash is working by going to http://localhost:8050/.
You should see this:

Add crawler settings

In the root folder, you’ll find settings.py file. You have to ADD the settings below to customize your Scrapy crawler behaviour to work with MongoDB and ScrapySplash.

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression. HttpCompressionMiddleware': 810,
}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'crawler.pipelines.UniqloPipeline': 2,
}MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'uniqlo'
MONGODB_COLNAME = 'items'
SPLASH_URL = 'http://localhost:8050'

Set basic configuration

In the spiders folder, create a util.py file. This file does basic configuration such as creating and connecting to a Mongo Client.

import collections
from uniqlo.settings import *
import pymongodef set_mongo_server():
    conn = pymongo.MongoClient(host=MONGODB_HOST, port=MONGODB_PORT)
    print(conn)
    return conn[MONGODB_DBNAME]

Plan how you would scrape the website

Before you embark on your crawling journey, it is always important to plan! Study the website and plan how you want your spider will visit the pages.

For beginners, always ask yourself how you would do it manually before you automate the process. 2 guiding questions:
1. What do you want to scrape? — Your GOAL
2. How do you get to the webpage? — Your APPROACH

For Uniqlo, my goal was to scrape ALL product details which can only be found in the product detail page. I would start from the landing page, visit every category page and then visit every product detail page. The images below show how I studied Uniqlo’s website.

3. Exploring Uniqlo: Product detail page (GOAL)

Step 2: Build your crawler

Create Scrapy project

Now that you have a better idea of how the website is structured and how you want to crawl it, you can create your Scrapy project.

In Terminal, cd to the directory you want to create your project in and enter the commands below. Next, open your folder using an IDE (i.e. Visual Studio Code) and start coding! *excites*

If you’re having permission issues, you can change the permissions using sudo chmod -R 777 YOUR_PROJECT_NAME command.

scrapy startproject YOUR_PROJECT_NAMEcd YOUR_PROJECT_NAMEscrapy gen products START_URL

In items.py , define the item and fields you want to crawl. Since my goal was to crawl Uniqlo’s product details, I defined the item and fields as such:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy.item import Item, Fieldclass UniqloItem(Item):
    # define the fields for your item here like:
    name = Field()
    code = Field()
    currency = Field()
    price = Field()
    oldprice = Field()
    tags = Field()
    colors = Field()
    sizes = Field()
    imageUrl = Field()
    imageName = Field()
    material = Field()
    care = Field()
    description = Field()
    originalLink = Field()
    itemLink = Field()

In the spiders folder, you’ll find products.py file. This file will tell your spider how to explore the site.

name: Name of spider
allowed_domains: Accessible domains
start_urls: Root URL where spider will start
parse(self, response): Parse contents of visited page. In this case, it is the https://www.uniqlo.com/sg/store/.

import scrapyclass ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['www.uniqlo.com']
    start_urls = ['https://www.uniqlo.com/sg/store//']    def parse(self, response):
        pass

In our plan, I said we would start from the landing page, visit every category page and then visit every product detail page. Let’s break this down 1 step further: At the landing page, grab all category links and visit them. At each category page, grab all product detail links, visit them and scrape product details.

Scrape landing page to extract category links

In your browser, at the landing page, right click and inspect element. The goal is to grab all category href links. You can either use XPath or CSS selectors. I prefer CSS selectors as they are similar to jQuery.
The parse function will visit URLs you’ve assigned to start_urls and return a response. After getting a response, use scrapy’s response.css method to get the categoryLinks.
1. Select #navHeader element and get all .cateNaviLink elements
2. Extract allhref attribute contents in a tags; Returns array of strings.
3. Print and check each extracted link.

import scrapyclass ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['www.uniqlo.com']
    start_urls = ['https://www.uniqlo.com/sg/store//']    def parse(self, response):
        links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
        for link in links:
            print(link)

You can run the spider using this command (remove nolog for stats):

scrapy runspider --nolog products.py

Scrape category page to extract item links

After extracting the category links, we visit them by executing
yield SplashRequest.
1. yield: yield is a keyword that is used like return, except the function will return a generator. Generators are iterators, a kind of iterable you can only iterate over once.
2. SplashRequest(link, self.parse_categoryLink, meta={‘url’: link}, args={‘wait’: 1}): Since category webpages needed some time for the JavaScript to load, we must useSplashRequest. Most of the time, we simply use Request to visit a link to parse contents. This SplashRequest visited the category webpage (link), waited for 1 second before grabbing all items’ links (parse_categoryLink). If you noticed, the category link was also passed in as url metadata for future references.
*Ensure you ran the scrapy-splash docker container.
3. parse_categoryLink(self, response): Parse contents in the category link. For example, the first category link would be https://www.uniqlo.com/sg/store/women/featured/new-arrivals.html. Similar to how I parsed the contents in the landing page, I’ve used response.css to get all the item links and print them for checking.

import scrapyclass ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['www.uniqlo.com']
    start_urls = ['https://www.uniqlo.com/sg/store//']    def parse(self, response):
        links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
        for link in links:
             yield SplashRequest(link, self.parse_categoryLink, meta={'url': link}, args={'wait': 1})    def parse_categoryLink(self, response):
        itemLinks = response.css(".item > a::attr(href)").extract()
        originalLink = response.meta['url']
        for itemLink in itemLinks:
            print(itemLink)

Scrape item page to extract item details

This time, let’s visit the item link using Request.
1. parse_item(self, response): First, create a UniqloItem object to store what you’ll scrape from the page. Next, parse, extract and assign the contents you want from the page to respective object’s fields. Finally, print and yield the object.

import scrapyclass ProductsSpider(scrapy.Spider):
    name = 'products'
    allowed_domains = ['www.uniqlo.com']
    start_urls = ['https://www.uniqlo.com/sg/store//']def parse(self, response):
        links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
        for link in links:
             yield SplashRequest(link, self.parse_categoryLink, meta={'url': link}, args={'wait': 1})    def parse_categoryLink(self, response):
        itemLinks = response.css(".item > a::attr(href)").extract()
        originalLink = response.meta['url']
        for itemLink in itemLinks:
            yield Request(url=itemLink, meta={'originalLink': originalLink, 'itemLink': itemLink}, callback=self.parse_item)    def parse_item(self, response):
        item = UniqloItem()
        name = ''.join(response.css("#prodInfo > h1 > span ::text").extract()).strip()
        code = ''.join(response.css("div.basicinfo_wrap > span > ul > li.number::text").extract()).replace("ITEM CODE: ", "").strip()
        currency = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div.price-box > meta::attr(content)").extract()).strip()
        price = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div > p.special-price > span.price::text").extract())
        oldPrice = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div > p.old-price > span.price::text").extract())
        tags = response.css('#prodInfo > div > ul.special > li::text').getall()
        colors = response.css('#listChipColor > li > a::attr(title)').getall()
        sizes = response.css('#prodSelectAttribute > #prodSelectSize > #selectSizeDetail > #listChipSize > li > a > em::text').getall()
        imageUrl = response.css('#prodImgDefault > img::attr(src)').extract()
        imageName = response.css('#prodImgDefault > img::attr(alt)').extract()
        description = ''.join(response.css('#prodDetail > div::text').extract()).strip()
        material = response.css('#prodDetail > div.content > dl.spec > dd:nth-child(2)::text').extract()
        care = response.css('#prodDetail > div.content > dl.spec > dd:nth-child(6)::text').extract()        item["name"] = name
        item["code"] = code
        item["currency"] = currency
        item["price"] = price
        item["oldprice"] = oldPrice
        item["tags"] = tags
        item["colors"] = colors
        item["sizes"] = sizes
        item["imageUrl"] = imageUrl
        item["imageName"] = imageName
        item["material"] = material
        item["care"] = care
        item["description"] = description
        item["originalLink"] = response.meta['originalLink']
        item["itemLink"] = response.meta['itemLink']
        print(item)
        yield(item)

Write items into MongoDB

Hang in there! Last mile~ The final step to process the scraped item is to push it into an Item Pipeline (refer to step 8 in Scrapy’s architecture).
1. __init__(self): Initialise the MongoDB server.
2. process_item(self, item, spider): Convert the yielded item into a dict and insert it into MongoDB.

from uniqlo.spiders import util
from uniqlo.settings import *
import jsonclass UniqloPipeline(object):
    def __init__(self):
    # Initialise mongo server
    self.db = util.set_mongo_server()    def process_item(self, item, spider):
         try:
             if "name" in item:
                 self.db[MONGODB_COLNAME].insert(dict(item))
         except Exception as ex:
             self.logger.warn('Pipeline Error (others): %s %s' %
(str(ex),  str(item)))

Run your completed spider! It will start crawling all items from the website. Go to Mongo Compass to look at what your spider crawled in for you. (I managed to crawl 1765 items in 3 mins.)