Python web scraping (part 1).

What's web scraping?

Web scraping is also known as web harvesting or web crawling it can be defined as simply and generally as data extraction from websites, It can be done either manually or automatically (which is our topic today) using a software (or a script) called by convention a bot, spider, spider bot or web crawler.

Web Scraping vs Web Parsing vs Web Crawling?

The three words are sometimes simply thought inaccurately as the same thing but actually each one has its right shade of meaning: Web scraping involves first fetching the page then extracting data from it, Therefore, scraping is the first step that retrieves pages for processing, this later is called parsing and it extracts data we want. Crawling can be distinguished by the fact that it's about following the tree structure (hierarchy) of links in a given page and scraping them as well until the needed depth is reached.

Why web scraping?

The actual progress humanity has reached with internet and the abundance of every kind of data online raised a need for searching, triage and comparison of data for all imaginable purposes and quickly, from web indexing to web shopping or climate tracking, here's a non exhaustive list: contact scraping, web indexing, data mining, price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup ...

Legal considerations

Web scraping has raised a controversy and a legal uncertainty since the technology itself is legal and you won't think intuitively that scraping some publicly available data wouldn't generate any legal issues, but there has been law suites (especially in US and the EU) with that regard in which courts ruled against web scraping, there's also some countries that outlaw some types of web scraping. So obviously any actor or business that relies on it should get infomations or even better, get legal advice regarding the laws in force.

The main reasons for this ambiguous situation are that:

web scrapers might send a huge number of requests and overwhelm the scraped website and even cause a crash (which would cause financial) that is hard to fix since often the web scrapers hide their identity.
It is sometimes being used for business purposes to gain a competitive advantage.
It often ignores laws and terms of service.

Web scraping in Python

Python is one of the best languages when it comes to web scraping, it offers a large set of tools that cover all the average needs and most of the advanced ones easily. The only technology that might overcome Python in my opinion would by NodeJs and co.

Content retrieving

Python Requests

Python requests comes bundled with Python, it’s well known for all Python developers http://docs.python-requests.org/en/master/, it’s usually enough for most of the scraping tasks except those that require Javascript rendering to work correctly for example.

>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}

Phantomjs

“PhantomJS is a headless web browser scriptable with JavaScript.” http://phantomjs.org Usually used to retrieve pages that load content using Javascript, i.e: pages that require Javascript to render content. unfortunately Phantomjs development is suspended (until further notice).

Selenium webdriver

Selenium is a portable software-testing framework for web applications, it’s used in the web scraping context to perform complex tasks requiring the most human like behavior that can be simulated to some extent using the driven regular browsers https://www.seleniumhq.org/projects/webdriver/.

Parsing

XPath

The xpath expressions are used for Xml content parsing. in Python, lxml is used for that purpose https://pypi.org/project/lxml/

pip install lxml

Let's take this example (it could be an HTML too since HTML is a subset of XML)

users_data.xml

<?xml version="1.0" encoding="UTF-8"?>
<users>
    <user id="1">
        <name>Alice</name>
        <age>28</age>
    </user>
    <user id="2">
        <name>Bob</name>
        <age>30</age>
    </user>
    <user id="3">
        <name>Celine</name>
        <age>21</age>
    </user>
    <user id="4">
        <name>Dan</name>
        <age>24</age>
    </user>
    <user id="5">
        <name>Eva</name>
        <age>33</age>
    </user>
</users>

Let's play with some XPath expression to quickly demonstrate how it works


# -*- coding: utf-8 -*-

>>> from lxml import etree
>>> tree = etree.parse("users_data.xml")
>>> # print all names
... for name in tree.xpath("/users/user/name/text()"):
...     print name
...
Alice
Bob
Celine
Dan
Eva
>>>
>>> # print user immediately after Celine
... print tree.xpath("/users/user[name/text() = 'Celine']/following::user[1]/name/text()")
['Dan']
>>> # or
... print tree.xpath("//user[name/text() = 'Celine']/following::user[1]/name")[0].text
Dan
>>>
>>> # print all ages
... print tree.xpath("//age/text()")
['28', '30', '21', '24', '33']
>>>
>>> # print ids of all users older than 25
... print tree.xpath("//user[age/text() > 25]//@id")
['1', '2', '5']
>>> # or
... print tree.xpath("//user[number(age/text()) > 25]//@id")
['1', '2', '5']
>>>
>>>
>>> # print names of all users older than 25 but younger than 30
... print tree.xpath("//user[age/text() > 25 and age/text() > 30]/name/text()")
['Eva']
>>>

To master Xpath expressions, make sure to spend as much as you can learning https://www.w3schools.com/xml/xpath_syntax.asp (or any similar resources)

Regex

Regular expressions are used to select, filter and transform texts which comes builtin in Python https://docs.python.org/2/howto/regex.html

quick example:

Let’s assume we want to extract some information from the following text

>>> import re
>>> product_text = """ Spilihp H4 Vision. Headlighting Lamp +30% Light.
... by Spilihp
... Be the first to review this item
... Price: $6.48
... With 30% more light, Premium is the entry point of the Spilihp range.
... The quality level of the Premium lamps ensures excellent light beam performance for a very competitive price.
... Voltage: 12 V. Wattage: 60/ 55 W. Type: H4.
... """
>>> # extract product name
... print re.search(r'^.+(?=\.\nby)', product_text).group()
 Spilihp H4 Vision. Headlighting Lamp +30% Light
>>>
>>> # extract product price
... print re.search(r'(?<=Price: \$).+(?=\n)', product_text).group()
6.48
>>>
>>> # extract manufacturer
... print re.search(r'(?<=by).+(?=\n)', product_text).group()
 Spilihp
>>>

For more, you can check https://developers.google.com/edu/python/regular-expressions#basic-examples (or any similar resources)

Json parsing

A Json encoder/decoder is included by default in Python https://docs.python.org/2/library/json.html, you can use it to parse Json data which is increasing being used as web developmens evolves due to its usability and lightweight compared to XML for example.

json_data.json

[
    {
        "id" : 1,
        "name": "Alice",
        "age": "28"
    },
    {
        "id" : 2,
        "name": "Bob",
        "age": "9"
    },
    {
        "id" : 3,
        "name": "Celine",
        "age": "21"
    },
    {
        "id" : 4,
        "name": "Dane",
        "age": "24"
    },
    {
        "id" : 5,
        "name": "Eva",
        "age": "33"
    }
]

>>> import json
>>> with open('json_data.json', 'r') as f:
    data = json.load(f)

>>>for x in data:
    print("%s: %d" % (x, x['name']))
...
Alice
Bob
Celine
Dane
Eva

As you can see we can use the parse data as dictionary so we can do any regular dictionary selection and treatment or load it into a Python object.

Other tools

Scrapy

Scrapy web-crawling framework written in Python built around "spiders", which are self-contained crawlers that are given a set of instructions. it tries to make easier to build and scale large crawling projects.

https://doc.scrapy.org/en/latest/

Challenges:

the most challenges you could face in web scraping would be either the data obfuscation, Javascript rendering (for websites that requires it) or websites that try to prevent web scraping usually by:

Blocking requests based on browser headers (e.g: if it detects something like PhantomJS in the headers, it will know that it’s a bot).
Banning your IPs or geolocation (would be overcome by using proxies).
Throttling/rate limiting web requests (you’ll need to evenly space your requests and make retries after waiting for some time).
Disabling exposed APIs or hiding them or requiring some kind of authentication.
Using Captchas.
Relying on advanced mouse tracking algorithms to detect human/bot patterns.

Next?

This has been a brief and concise first article of the "Python web scraping" series it went through definitions, general aspects, main tools and challenges. the next two articles would be about Selenium Webdriver and Scrapy.