Sound Decisions: Data Scraping and Cleaning 🎧
Lately, I’ve been on the hunt for a new pair of headphones. Instead of spending countless hours searching online for the best pair, I decided to put my data science skills to good use!
To begin, I scraped data from Amazon to create a dataset of headphone products that I could analyse to find the perfect pair.
In this blog post, I’ll share how I tackled web scraping, discuss why it’s such a valuable tool and explain the steps I took to clean the data. This will be the first post in a series where I’ll be sharing my journey in finding the perfect headphones.
Web Scraping
Web scraping is a method used to automatically collect or ‘scrape’ data from websites. It involves writing scripts to extract specific information from websites allowing users to create custom dataset tailored to their needs.
While I’ve used many datasets from Kaggle to build my portfolio, I have come to recognise the importance of collecting my own data. Not only does it offer greater flexibility for projects, like this one, but it also has been a great way to get hands-on experience with real-world data!
How To Web Scrape
Step 1: Assess Robots.txt
The first step in web scraping is to assess the robots.txt file of the website you plan to scrape from. This file outlines the permissions for what can and cannot be scraped, helping you avoid any violations.
To access the robots.txt file, simply go to the base URL of the website and append /robots.txt
, like this:
After reviewing the file for Amazon, scraping individual customer reviews was not allowed. This slightly changed the scope of my project since I initially was planning to carry out sentiment analysis on those reviews.
Step 2: Define Base URL
Next, navigate to the specific URL to scrape from. For this project, I focused on the search results for ‘adult headphones’.
https://www.amazon.co.uk/ URL of home page
s?keywords=adult+headphones: Search results for ‘adult headphones’
i = electronics: Searching within the ‘Electronics’ category.
page= Amazon displays a large number of results and not all of them appear on a single page. To handle this, I add page= to the URL, which I will specify later when scraping.
Step 3: Scrape!
To begin, I loop through the pages until I reach a predefined limit, which I set using the variable num_pages
. For each page, I append the current page number to the base URL. This allows me to construct the URL for each specific page I want to scrape. I also included print statements to track the progress of the scraping.
I then define headers to mimic a request from a web browser. Without these headers, Amazon’s server may block the request, as it can detect automated scraping attempts.
If the request is successful, begin reading the content using the BeautifulSoup package. This library allows you to navigate and extract data from the page’s HTML.
To scrape on the products returned by the search, you need to specify which HTML elements BeautifulSoup should read. This requires inspecting the webpage to identify the elements that hold product information.
For example, by using Inspect Element, I found the div containing the search results:
Next, I use .findall
from BeautifulSoup to find all div elements with a data-component-type attribute that identifies them as search result items. This is where all the product information is stored.
The following code loops through each headphone in all_headphones and extracts the following:
To gather the HTML elements for the information I wanted to extract, I again used Inspect Element. By right-clicking on specific items like product descriptions, prices and ratings, I identified the relevant HTML tags and classes.
Step 4: Store the scraped data
I used a dictionary to store all the information I wanted in my dataset:
Step 5: Time delay
When scraping data, it’s best practice to introduce a time delay between each request (in this case, after each page) to avoid overloading the server. This delay also helps the web scraping script mimic more human behaviour—such as taking time to browse.
Step 6: Exporting results to CSV
Once done, I create a dataframe using the dictionary I created and export this dataframe to a CSV ready for cleaning.
Data Cleaning
Once data has been scraped from a website like Amazon, the raw data is often messy and requires cleaning before any analysis to ensure insights are accurate and reliable.
First, I checked for missing values and duplicates in the dataset. I removed any duplicates and incomplete rows to ensure a complete dataset.
After the pre-liminary cleaning I decided it best to try and extract as much information as possible from the product description.
Feature Engineering using Product Description
Feature Engineering is the process of creating new columns from data already present in the dataset.
For the following features, I used regular expressions to extract specific features from the product descriptions:
Wireless
I searched the product description for the term ‘wireless’ and created a binary column to indicate whether the product was wireless or not.
Noise Cancelling
Similarly, I searched for the term ‘noise cancelling’ in the description to identify whether this feature was present.
Colour
Extracted colour information from the description using a list of common colours.
Battery Life
Extracted battery life in hours using regular expressions to find numerical values associated with time.
Microphone
Looked for terms like ‘mic’ or ‘microphone’ to identify if the product had a microphone.
Over Ear
Identified ‘over-ear’ products based on related keywords in the description.
Foldable
Flagged products as foldable based on terms like ‘foldable’.
Brand
I also tried to extract the brand names from the product descriptions using SpaCy, a NLP library. However, most brand names in the dataset were either unrecognised or incorrectly identified as common words like ‘wireless’ or ‘bluetooth’. As a result, I thought it best to exclude the product brand from the dataset as the results did not seem too useful.
For examples of the regular expressions used for feature extraction and details on SpaCy implementation, please refer to the project repository (link to be provided once project is complete).
Data Dictionary
Column | Description |
---|---|
Product ID | Product ASIN code |
Description | Product description |
Price | Product price |
Rating | Product rating on a scale of 1 to 5 |
Is Prime | Binary field showing prime eligibility |
Wireless | Binary field indicating if product is wireless |
Noise Cancelling | Binary field indicating if product is noise cancelling |
Colour | Product colour |
Battery Life | Product battery life in hours |
Microphone | Binary field indicating if product has a microphone |
Over Ear | Binary field indicating if product is over-ear |
Foldable | Binary field indicating if product is foldable |
Summary
In Part 1 of the Sound Decisions series, I discussed what web scraping is, its relevance and how to do it. I also covered how I processed the scraped data and carried out feature engineering to enhance the dataset.
In the next post, I will focus on exploratory data analysis (EDA) and gathering insights from the data.