Web Scraping with the Puppeteer JavaScript Library
This week I’d like to talk about a library I just learned about that makes web scraping super simple. Most of the time web scraping is referenced in the context of Python, but for those who are more comfortable with using JavaScript libraries, Puppeteer exists. The process is fairly similar to how web scraping is done in Python, but it uses Node syntax to import and run.
To demonstrate how to perform web scraping with Puppeteer, I want to walk you through the way I learned it, by creating an array of all of the music compositions that the Count of Saint Germain produced in his lifetime (well, one of them, I suppose).
SETUP AND INSTALLATION
To get yourself set up with a new project for our Puppeteer demonstration, go ahead and make a new directory with one file inside of it, an index.js. Next, you will want to navigate to this directory in your terminal and run these following commands:
npm init -y
npm i puppeteer
npm i puppeteer-core
This first one, as you may know, initializes Node Package Manager capabilities within your project, and the following two add Puppeteer to the project.
UTILIZING PUPPETEER
Inside the index.js file, add this code which appears on the documentation:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });
await browser.close();
})();
This model is for obtaining a screenshot of a given page, and it accomplishes this by first importing Puppeteer then creating an async call which launches the browser, creates a new page, navigates to a given URL, screenshots the page, and closes the browser.
To modify this code to accomplish what I mentioned above, obtaining the compositions of the Count of Saint Germain, I navigated to his Wikipedia Page and added that link to the 6th line of code. I changed the path of the screenshot to ‘wiki.png’ and modified the launch to include the object {headless: false}. Next, all we have to do is the scraping!
SCRAPING PAGES
To scrape the page, head over to the “Elements” section of your browser’s DevTools. Here we can dig into the HTML which is representing the page. On this particular page, we can see the “Music by the Count” section is all “li” elements:
However, there are many other “li” elements all over the page. How did I go about getting just the music compositions? In the code, I had to make a page.evaluate() call to get at the elements. All I had to do inside of that was select the “li” elements, spread them into an array, map over them to get the inner text and return the filtered array down to those which contain “Op.”.
If you want to give this a shot on your own, feel free, if not, here is my solution:
const result = await page.evaluate(() => {
let headers = document.querySelectorAll('li');
const headerArray = [...headers];
const textArray = headerArray.map(header => header.innerText);
return textArray.filter(header => header.includes('Op.'));
});
console.log(result);
I added this between the 7th & 9th lines of code. If you have this in your file, go ahead and head back to your terminal and type node index.js to check it out:
Just like that, a bunch of weird titles by an extremely weird guy!
I hope this has been an illuminating introduction to web scraping via Puppeteer for you! Feel free to ask any questions below!