Crawl responsibly. Do not flood or DDOS websites.
I was told to explain myself from my users’ perspectives. In this post, you will learn to develop a web crawler using Python Scrapy. In each section, I’ll first explain before showing what you should expect to see. If you do not have time to code, you can look at my project on GitHub.
Contents
- Step 0: What is Scrapy?
- Step 1: Getting things ready (Setup Scrapy, MongoDB, Scrapy-Splash, basic configurations)
- Step 2: Building the crawler (Scrape pages and write item to MongoDB)
- Conclusion
Step 0: What is Scrapy?
Scrapy is an application framework designed for web scraping — crawling web sites and extracting structured data. It is useful to look at the architecture below to understand how Scrapy functions behind the scenes.
Step 1: Get things ready!
Install Python 3 & Scrapy
To install Scrapy, you need to install Python. I strongly recommend installing Python 3 as Python 2 will retire by end-2019.
After installing Python, you can now open Terminal (MacOS) or cmd (Windows) to install scrapy.
pip3 install scrapy
If for whatever reason you are unable to install Scrapy, refer to Scrapy’s installation guide here.
Install MongoDB & MongoDB Compass
Next, you’ll need a database to store the crawled data. You can use any database you’re familiar with. For this post, I’ve chosen MongoDB.
You can install it locally or you can use Docker. I’ll show you the Docker way.
First, ensure you have Docker installed and running. If you do not have Docker installed, you can install it here.
After you have Docker installed, open a Terminal and enter these commands:
docker pull mongodocker run -p 27017:27017 -d --mount type=bind,source=$PWD/data/bin,destination=/data/bin mongo
- docker pull mongo: Pulls latest-tagged MongoDB Docker image
- docker run -p 27017:27017 -d — mount type=bind,source=$PWD/data/bin,destination=/data/bin mongo:
Runs a docker container using the latest-tagged MongoDB Docker image on port 27017 (-p 27017:27017), in the background (-d), with a volume mounted to the docker image (source=$PWD/data/bin, destination=/data/bin mongo)
$PWD = Present Working Directory (Ensure that there is an existing /data/bin folder in your $PWD)
After you have a running MongoDB Docker container, simply download and install MongoDB Compass to view the data in MongoDB. You should see this:
Identify the website to scrape
I chose Uniqlo’s Singapore online web store for 2 reasons:
- The structure of the website can be commonly found in other websites. You could easily replicate what you’ve learnt here.
- Webpages are not loaded instantly. Although some webpages will load their data instantaneous, there are some that still requires some time for the JavaScript to load.
Install Scrapy-Splash
To instruct your spider to wait for the JavaScript to load before scraping, we need to install Scrapy-Splash. Open a Terminal and enter these commands:
pip3 install scrapy-splashdocker pull scrapinghub/splashdocker run -p 8050:8050 -d scrapinghub/splash
- pip3 install scrapy-splash: Install scrapy-splash. Scrapy-Splash uses Splash HTTP API, so you need a Splash instance. The next couple of steps will help you setup the Splash instance.
- docker pull scrapinghub/splash: Pulls latest-tagged Scrapy-Splash image
- docker run -p 8050:8050 -d scrapinghub/splash:
Runs a docker container using the latest-tagged Scrapy-Splash image on port 8050 (-p 8050:8050), in the background (-d). - Ensure that Splash is working by going to http://localhost:8050/.
You should see this:
Add crawler settings
In the root folder, you’ll find settings.py
file. You have to ADD the settings below to customize your Scrapy crawler behaviour to work with MongoDB and ScrapySplash.
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression. HttpCompressionMiddleware': 810,
}# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'crawler.pipelines.UniqloPipeline': 2,
}MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'uniqlo'
MONGODB_COLNAME = 'items'
SPLASH_URL = 'http://localhost:8050'
Set basic configuration
In the spiders
folder, create a util.py
file. This file does basic configuration such as creating and connecting to a Mongo Client.
import collections
from uniqlo.settings import *
import pymongodef set_mongo_server():
conn = pymongo.MongoClient(host=MONGODB_HOST, port=MONGODB_PORT)
print(conn)
return conn[MONGODB_DBNAME]
Plan how you would scrape the website
Before you embark on your crawling journey, it is always important to plan! Study the website and plan how you want your spider will visit the pages.
For beginners, always ask yourself how you would do it manually before you automate the process. 2 guiding questions:
1. What do you want to scrape? — Your GOAL
2. How do you get to the webpage? — Your APPROACH
For Uniqlo, my goal was to scrape ALL product details which can only be found in the product detail page. I would start from the landing page, visit every category page and then visit every product detail page. The images below show how I studied Uniqlo’s website.
Step 2: Build your crawler
Create Scrapy project
Now that you have a better idea of how the website is structured and how you want to crawl it, you can create your Scrapy project.
In Terminal, cd to the directory you want to create your project in and enter the commands below. Next, open your folder using an IDE (i.e. Visual Studio Code) and start coding! *excites*
If you’re having permission issues, you can change the permissions using
sudo chmod -R 777 YOUR_PROJECT_NAME
command.
scrapy startproject YOUR_PROJECT_NAMEcd YOUR_PROJECT_NAMEscrapy gen products START_URL
In items.py
, define the item and fields you want to crawl. Since my goal was to crawl Uniqlo’s product details, I defined the item and fields as such:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy.item import Item, Fieldclass UniqloItem(Item):
# define the fields for your item here like:
name = Field()
code = Field()
currency = Field()
price = Field()
oldprice = Field()
tags = Field()
colors = Field()
sizes = Field()
imageUrl = Field()
imageName = Field()
material = Field()
care = Field()
description = Field()
originalLink = Field()
itemLink = Field()
In the spiders
folder, you’ll find products.py
file. This file will tell your spider how to explore the site.
name
: Name of spiderallowed_domains
: Accessible domainsstart_urls
: Root URL where spider will startparse(self, response)
: Parse contents of visited page. In this case, it is the https://www.uniqlo.com/sg/store/.
import scrapyclass ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.uniqlo.com']
start_urls = ['https://www.uniqlo.com/sg/store//'] def parse(self, response):
pass
In our plan, I said we would start from the landing page, visit every category page and then visit every product detail page. Let’s break this down 1 step further: At the landing page, grab all category links and visit them. At each category page, grab all product detail links, visit them and scrape product details.
Scrape landing page to extract category links
In your browser, at the landing page, right click and inspect element. The goal is to grab all category href links. You can either use XPath or CSS selectors. I prefer CSS selectors as they are similar to jQuery.
The parse
function will visit URLs you’ve assigned to start_urls
and return a response. After getting a response, use scrapy’s response.css
method to get the categoryLinks.
1. Select #navHeader
element and get all .cateNaviLink
elements
2. Extract allhref
attribute contents in a
tags; Returns array of strings.
3. Print and check each extracted link
.
import scrapyclass ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.uniqlo.com']
start_urls = ['https://www.uniqlo.com/sg/store//'] def parse(self, response):
links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
for link in links:
print(link)
You can run the spider using this command (remove nolog for stats):
scrapy runspider --nolog products.py
Scrape category page to extract item links
After extracting the category links, we visit them by executingyield SplashRequest
.
1. yield
: yield
is a keyword that is used like return
, except the function will return a generator. Generators are iterators, a kind of iterable you can only iterate over once.
2. SplashRequest(link, self.parse_categoryLink, meta={‘url’: link}, args={‘wait’: 1})
: Since category webpages needed some time for the JavaScript to load, we must useSplashRequest
. Most of the time, we simply use Request
to visit a link to parse contents. This SplashRequest
visited the category webpage (link
), waited for 1 second before grabbing all items’ links (parse_categoryLink
). If you noticed, the category link was also passed in as url
metadata for future references.
*Ensure you ran the scrapy-splash docker container.
3. parse_categoryLink(self, response)
: Parse contents in the category link. For example, the first category link would be https://www.uniqlo.com/sg/store/women/featured/new-arrivals.html. Similar to how I parsed the contents in the landing page, I’ve used response.css
to get all the item links and print them for checking.
import scrapyclass ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.uniqlo.com']
start_urls = ['https://www.uniqlo.com/sg/store//'] def parse(self, response):
links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
for link in links:
yield SplashRequest(link, self.parse_categoryLink, meta={'url': link}, args={'wait': 1}) def parse_categoryLink(self, response):
itemLinks = response.css(".item > a::attr(href)").extract()
originalLink = response.meta['url']
for itemLink in itemLinks:
print(itemLink)
Scrape item page to extract item details
This time, let’s visit the item link using Request
.
1. parse_item(self, response)
: First, create a UniqloItem object to store what you’ll scrape from the page. Next, parse, extract and assign the contents you want from the page to respective object’s fields. Finally, print
and yield
the object.
import scrapyclass ProductsSpider(scrapy.Spider):
name = 'products'
allowed_domains = ['www.uniqlo.com']
start_urls = ['https://www.uniqlo.com/sg/store//']def parse(self, response):
links = response.css('#navHeader .cateNaviLink a::attr(href)').extract()
for link in links:
yield SplashRequest(link, self.parse_categoryLink, meta={'url': link}, args={'wait': 1}) def parse_categoryLink(self, response):
itemLinks = response.css(".item > a::attr(href)").extract()
originalLink = response.meta['url']
for itemLink in itemLinks:
yield Request(url=itemLink, meta={'originalLink': originalLink, 'itemLink': itemLink}, callback=self.parse_item) def parse_item(self, response):
item = UniqloItem()
name = ''.join(response.css("#prodInfo > h1 > span ::text").extract()).strip()
code = ''.join(response.css("div.basicinfo_wrap > span > ul > li.number::text").extract()).replace("ITEM CODE: ", "").strip()
currency = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div.price-box > meta::attr(content)").extract()).strip()
price = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div > p.special-price > span.price::text").extract())
oldPrice = ''.join(response.css("div.basicinfo_wrap > span > ul > li:nth-child(2) > div > p.old-price > span.price::text").extract())
tags = response.css('#prodInfo > div > ul.special > li::text').getall()
colors = response.css('#listChipColor > li > a::attr(title)').getall()
sizes = response.css('#prodSelectAttribute > #prodSelectSize > #selectSizeDetail > #listChipSize > li > a > em::text').getall()
imageUrl = response.css('#prodImgDefault > img::attr(src)').extract()
imageName = response.css('#prodImgDefault > img::attr(alt)').extract()
description = ''.join(response.css('#prodDetail > div::text').extract()).strip()
material = response.css('#prodDetail > div.content > dl.spec > dd:nth-child(2)::text').extract()
care = response.css('#prodDetail > div.content > dl.spec > dd:nth-child(6)::text').extract() item["name"] = name
item["code"] = code
item["currency"] = currency
item["price"] = price
item["oldprice"] = oldPrice
item["tags"] = tags
item["colors"] = colors
item["sizes"] = sizes
item["imageUrl"] = imageUrl
item["imageName"] = imageName
item["material"] = material
item["care"] = care
item["description"] = description
item["originalLink"] = response.meta['originalLink']
item["itemLink"] = response.meta['itemLink']
print(item)
yield(item)
Write items into MongoDB
Hang in there! Last mile~ The final step to process the scraped item is to push it into an Item Pipeline (refer to step 8 in Scrapy’s architecture).
1. __init__(self)
: Initialise the MongoDB server.
2. process_item(self, item, spider)
: Convert the yielded item into a dict and insert it into MongoDB.
from uniqlo.spiders import util
from uniqlo.settings import *
import jsonclass UniqloPipeline(object):
def __init__(self):
# Initialise mongo server
self.db = util.set_mongo_server() def process_item(self, item, spider):
try:
if "name" in item:
self.db[MONGODB_COLNAME].insert(dict(item))
except Exception as ex:
self.logger.warn('Pipeline Error (others): %s %s' %
(str(ex), str(item)))
Run your completed spider! It will start crawling all items from the website. Go to Mongo Compass to look at what your spider crawled in for you. (I managed to crawl 1765 items in 3 mins.)
Conclusion
This article shows you a quick and simple way of using Scrapy. Learning the techniques above would help you crawl most websites today.
Thank you for reading and I hope you learnt something interesting today!