Web Scraping with the Puppeteer JavaScript Library

This week I’d like to talk about a library I just learned about that makes web scraping super simple. Most of the time web scraping is referenced in the context of Python, but for those who are more comfortable with using JavaScript libraries, Puppeteer exists. The process is fairly similar to how web scraping is done in Python, but it uses Node syntax to import and run.

To demonstrate how to perform web scraping with Puppeteer, I want to walk you through the way I learned it, by creating an array of all of the music compositions that the Count of Saint Germain produced in his lifetime (well, one of them, I suppose).

SETUP AND INSTALLATION

To get yourself set up with a new project for our Puppeteer demonstration, go ahead and make a new directory with one file inside of it, an index.js. Next, you will want to navigate to this directory in your terminal and run these following commands:

npm init -y
npm i puppeteer
npm i puppeteer-core

This first one, as you may know, initializes Node Package Manager capabilities within your project, and the following two add Puppeteer to the project.

UTILIZING PUPPETEER

Inside the index.js file, add this code which appears on the documentation:

const puppeteer = require('puppeteer');

(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');
await page.screenshot({ path: 'example.png' });

await browser.close();
})();

This model is for obtaining a screenshot of a given page, and it accomplishes this by first importing Puppeteer then creating an async call which launches the browser, creates a new page, navigates to a given URL, screenshots the page, and closes the browser.

To modify this code to accomplish what I mentioned above, obtaining the compositions of the Count of Saint Germain, I navigated to his Wikipedia Page and added that link to the 6th line of code. I changed the path of the screenshot to ‘wiki.png’ and modified the launch to include the object {headless: false}. Next, all we have to do is the scraping!

SCRAPING PAGES

To scrape the page, head over to the “Elements” section of your browser’s DevTools. Here we can dig into the HTML which is representing the page. On this particular page, we can see the “Music by the Count” section is all “li” elements:

(my apologies for the small view)

However, there are many other “li” elements all over the page. How did I go about getting just the music compositions? In the code, I had to make a page.evaluate() call to get at the elements. All I had to do inside of that was select the “li” elements, spread them into an array, map over them to get the inner text and return the filtered array down to those which contain “Op.”.

If you want to give this a shot on your own, feel free, if not, here is my solution:

const result = await page.evaluate(() => {
let headers = document.querySelectorAll('li');
const headerArray = [...headers];
const textArray = headerArray.map(header => header.innerText);
return textArray.filter(header => header.includes('Op.'));
});
console.log(result);

I added this between the 7th & 9th lines of code. If you have this in your file, go ahead and head back to your terminal and type node index.js to check it out:

Just like that, a bunch of weird titles by an extremely weird guy!

I hope this has been an illuminating introduction to web scraping via Puppeteer for you! Feel free to ask any questions below!

--

--

--

Programmer / Artist / Believer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Generating PDF in Nodejs

var, let, const in JavaScript || Basics :)

Yarn 1 vs Yarn 2 vs NPM

Mastering Open Source (Week 13 @ Encora Academy)

How to setup Cypress 10.x.x with Cucumber and Typescript

18 makeable parallax on codepen

Difference in behavior for copying contents in primitive and non primitive type

Basic Algorithm with Typescript (Part 2)

conditional statement in programming

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lucas Thinnes

Lucas Thinnes

Programmer / Artist / Believer

More from Medium

Simple Ways to Convert a String into a Number

How To build a JavaScript Tic-Tac-Toe Game With Colors

Use ESNext to write less JavaScript code!

Debugging JavaScript using Snippets in Chrome Developer Tools