What's web scraping?
Web scraping is also known as web harvesting or web crawling it can be defined as simply and generally as data extraction from websites, It can be done either manually or automatically (which is our topic today) using a software (or a script) called by convention a bot, spider, spider bot or web crawler.
Web Scraping vs Web Parsing vs Web Crawling?
The three words are sometimes simply thought inaccurately as the same thing but actually each one has its right shade of meaning: Web scraping involves first fetching the page then extracting data from it, Therefore, scraping is the first step that retrieves pages for processing, this later is called parsing and it extracts data we want. Crawling can be distinguished by the fact that it’s about following the tree structure (hierarchy) of links in a given page and scraping them as well until the needed depth is reached.
Why web scraping?
The actual progress humanity has reached with internet and the abundance of every kind of data online raised a need for searching, triage and comparison of data for all imaginable purposes and quickly, from web indexing to web shopping or climate tracking, here’s a non exhaustive list: contact scraping, web indexing, data mining, price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup …
Web scraping has raised a controversy and a legal uncertainty since the technology itself is legal and you won’t think intuitively that scraping some publicly available data wouldn’t generate any legal issues, but there has been law suites (especially in US and the EU) with that regard in which courts ruled against web scraping, there’s also some countries that outlaw some types of web scraping. So obviously any actor or business that relies on it should get infomations or even better, get legal advice regarding the laws in force.
The main reasons for this ambiguous situation are that:
- web scrapers might send a huge number of requests and overwhelm the scraped website and even cause a crash (which would cause financial) that is hard to fix since often the web scrapers hide their identity.
- It is sometimes being used for business purposes to gain a competitive advantage.
- It often ignores laws and terms of service.
Web scraping in Python
Python is one of the best languages when it comes to web scraping, it offers a large set of tools that cover all the average needs and most of the advanced ones easily. The only technology that might overcome Python in my opinion would by NodeJs and co.
Selenium is a portable software-testing framework for web applications, it’s used in the web scraping context to perform complex tasks requiring the most human like behavior that can be simulated to some extent using the driven regular browsers https://www.seleniumhq.org/projects/webdriver/.
The xpath expressions are used for Xml content parsing. in Python, lxml is used for that purpose https://pypi.org/project/lxml/
Let’s take this example (it could be an HTML too since HTML is a subset of XML)
Let’s play with some XPath expression to quickly demonstrate how it works
To master Xpath expressions, make sure to spend as much as you can learning https://www.w3schools.com/xml/xpath_syntax.asp (or any similar resources)
Regular expressions are used to select, filter and transform texts which comes builtin in Python https://docs.python.org/2/howto/regex.html
Let’s assume we want to extract some information from the following text
For more, you can check https://developers.google.com/edu/python/regular-expressions#basic-examples (or any similar resources)
A Json encoder/decoder is included by default in Python https://docs.python.org/2/library/json.html, you can use it to parse Json data which is increasing being used as web developmens evolves due to its usability and lightweight compared to XML for example.
As you can see we can use the parse data as dictionary so we can do any regular dictionary selection and treatment or load it into a Python object.
Scrapy web-crawling framework written in Python built around “spiders”, which are self-contained crawlers that are given a set of instructions. it tries to make easier to build and scale large crawling projects.
- Blocking requests based on browser headers (e.g: if it detects something like PhantomJS in the headers, it will know that it’s a bot).
- Banning your IPs or geolocation (would be overcome by using proxies).
- Throttling/rate limiting web requests (you’ll need to evenly space your requests and make retries after waiting for some time).
- Disabling exposed APIs or hiding them or requiring some kind of authentication.
- Using Captchas.
- Relying on advanced mouse tracking algorithms to detect human/bot patterns.
This has been a brief and concise first article of the “Python web scraping” series it went through definitions, general aspects, main tools and challenges. the next two articles would be about Selenium Webdriver and Scrapy.