Web Scraping with Python
What is Web Scraping?
Web scraping is the process of extracting data from websites. All the job is carried out by a piece of code which is called a “scraper”. First, it sends a “GET” query to a specific website. Then, it parses an HTML document based on the received result. After it’s done, the scraper searches for the data you need within the document, and, finally, converts it into the specified format.
What is ETL?
ETL is a process that extracts the data from different source systems, then transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse system. ETL is short for extract, transform, load, three database functions that are combined into one tool to pull data out of one database and place it into another database. The ETL process requires active inputs from various stakeholders including developers, analysts, testers, top executives and is technically challenging.
Why do you need ETL?
- Transactional databases cannot answer complex business questions that can be answered by ETL example.
- A Data Warehouse provides a common data repository
- ETL provides a method of moving the data from various sources into a data warehouse.
- ETL process allows sample data comparison between the source and the target system.
- ETL process can perform complex transformations and requires the extra area to store the data.
- ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to one consistent system.
- ETL is a predefined process for accessing and manipulating source data into the target database.
- ETL in data warehouse offers deep historical context for the business.
What is Extract?
- Extract is the process of reading data from a database. In this stage, the data is collected, often from multiple and different types of sources
Step by step ETL process
1.0 Extract the Data
In this step of ETL architecture, data is extracted from the source system into the staging area. Transformations if any are done in the staging area so that performance of source system is not degraded.
The website used for this project is TUMI . This is one of those that is legal to do web scraping. There are many types of bags’ category from this website. For this project, we are going to use the category of travel backpack.
The attributes of the travel backpack we have extracted from website Tumi.my are:
The first step to extract data using scrapy is by installing the scrapy package.
Then, import the library.
Scrapy provides Item pipelines that enable us to write functions in the spider which it can process our data. So, to extract data from the website, create the spider. In the spider, create a name which is the URL of the website of data we want to extract and a selector which is the part that will be extracted.
2.0 Transform the Data
In this process, we will explain the transformation in general. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data.
Data extracted from the source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped, and transformed. In the transformation step, we are going to perform a few operations on the data.
Cleaning task 1: Convert Color to lowercase
For converting color dataset into lower case. The color attribute has unbalanced capital letters such as BLACK and NAVY. So we change it to lowercase to help read them smoothly.
Cleaning 2: convert width, depth and weight columns to float
Based on the column in width, depth, and weight it will write ‘in’ and ‘lbs’ for every row. This will deform the scenery of the row. So we decide to set them to float so that can be read clearly.
Cleaning 3: To check whether the column is null or not
In this cleaning part, we cant to check if there are any null data in our data set to confirm if the data become null. In Tumi.com they will update new products weekly and it may cause any data to become null.
Cleaning 4: Attributes identification
In this process, some of the attributes need to be shown as a confirmation so that we could see every attributes whether is it object, float, or string.
Cleaning 5 : Matching to correct inconsistent data entry
In this part, the consistency of data is to arrange data to be synchronized which is recycled ballistic nylon actually is a nylon type but with extra materials. With the correction of data this will help to set data to be stacked better.
2.1 Data visualization
Histogram: Bags against the depth of the bags.
Box Plot: The Comparison of Bags Width and Bags Depth
Line Plot : The Comparison of Bags Width and Bags Depth
Loading data into the target data warehouse database is the last step of the ETL process. The load is a process of writing the data into the target database. In a typical Data warehouse, a huge volume of data needs to be loaded in a relatively short period (nights). Hence, the load process should be optimized for performance. For the final process of this scraping and data cleaning session, we store the data in a CSV file and download it. The coding is as below:
Thank you for reading our sharing on data scrap based on Tumi.my. The complete code can be found here
Special thanks to our lecturer Dr Sabrina & Dr Lalilatul, who guided us throughout the subject and my friends Alif, Aqilah, Amira and Rahal, who were the backbone in this project alongside me.