03 Scraping

Web crawling, data crawling, and web scraping are all names to define the process of data extrac- tion. With the help of this technique, data is extracted from various website pages and repositories. The data is then saved and stored for further use and analysis. Basically, what web scraping does is that it copies all the content from a web page and delivers raw data of your choice in a specific structured format.

Advantages

Up-to-date data
No rate limitations to scraping
Customization and well-structured data
Anonymous

Document Object Model (DOM)

“The W3C Document Object Model (DOM) is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document.”

In other words it is a standard for how to get, change, add, or delete HTML elements.

When a web page is loaded, the browser creates a Document Object Model of the page.

In particular, it defines

The HTML elements as objects
The properties of all HTML elements
The methods to access all HTML elements
The events for all HTML elements

Protections

Some websites have some protections against scraping. They are able to detect weird behaviours (e.g., repeated scrolling) and prevent you from scraping.

Static pages

We define static pages as HTML pages which does not change according to the user iteractions with the page (they do not have dynamic content).

The problem to solve is to correctly and efficiently process HTML pages to extract valuable information. Since HTML code is tagged text, any tool that allows to manipulate and process text could be potentially used to parse a page.

Scraping is done in two steps:

Retrieving a static page: many tools that support this action:
- curl, using command line
- requests, using Python scripts
Parsing the page: extracting data that respects specific patterns:
- XPath
- Regular Expressions

Dynamic Pages

What we need to do if the page loads the content we need only after clicking a specific field or while scrolling the page itself (e.g.: Facebook and Instagram posts)?

We need to simulate a full blown browser that interacts with the page and arrives at the point we want doing some programmable actions. Then we can finally download the HTML and parse the content.

Website updates

Most of the times, the HTML changes over time, especially the classes and sometimes the structure of the HTML. Therefore, it is really important to keep the scraper updated.

Scraping in practice

Python is one of the best languages to build scrapers: it is easy to learn, fast to code and it has plenty of libraries to get anything done efficiently.

Two modules are needed to set up a scraper:

Selenium: module developed for automating websites testing. It simulates clicks, drags, scrolls and any other possible interaction with a website.
BeautifulSoup: HTML (and XML) parser module

01 The Data-driven Virtuous Cycle

02 What is Big Data and new ways to solve problems

03 Data-driven decisions

01 ER

02 Relational Model

03 ER Exercises

01 Introduction to API

02 RESTful API

03 Scraping

01 NoSQL General Concepts

02 Transactional Properties in NoSQL

03 Brief NoSQL history

04 A map of NoSQL technologies

01 Graph Theory

02 Graph Databases

03 Neo4J

Exam Questions

01 Introduction

02 MongoDB

03 MongoDB Queries

Exam questions

01 Introduction

02 Redis

03 Memcache

Exam questions

01 Introduction

02 Cassandra

03 HBase

04 Cassandra Query Language

01 ELK stack

02 Elasticsearch

03 Elasticsearch operations

04 Logstash

Introduction

HDFS Architecture, Security and Configuration

HDFS Commands

Hadoop 2.0

MapReduce

Hadoop key components

Scheduling

HBase

Pig

Hive

Impala

Storm

Flume

Sqoop

01 The challange and importance of data wrangling

02 Data Wrangling Process

03 Scraping

Advantages

Document Object Model (DOM)

Protections

Static pages

Dynamic Pages

Website updates

Scraping in practice

No Comments