Node Js Web Scraping

Simple web scraping with Node.js / JavaScript Following up on my popular tutorial on how to create an easy web crawler in Node.js I decided to extend the idea a bit further by scraping a few popular websites. For now, I'll just append the results of web scraping to a.txt file, but in a future post I'll show you how to insert them into a database. Web scraping is a technique used to extract data from websites using a computer program that acts as a web browser. The program requests pages from web servers in the same way a web browser does, and it may even simulate a user logging in to obtain access.

Node.js Web Scraping Tutorial
Web Scraping Software

If you’ve ever visited a website and thought the information was useful but the data wasn’t available through an API, well I have some good new for you. You can scrape that website data using Node.js!

Web scraping refers to the collection of data on a website without relying on their API or any other service. If you can visit a website on your browser then you can visit that website through code.

All websites are built from HTML, CSS, and Javascript. If you open up the developer tools on a website, then you’ll see the HTML code of that website.

‍

So to scrape the data from a website using web scraping methods, you’re getting HTML data from that website, and then extracting the content you want from the HTML.

This article will guide you through an introduction to web scraping using Javascript and Node.js. You’ll create a Node.js script that visits HackerNews and saves the post titles and links to a CSV file.

‍

Important concepts

Before we get started, you should know about a few concepts relevant to web scraping. These concepts are DOM elements, query selectors, and the developer tools/inspector.

DOM Elements

DOM Elements are the building blocks of HTML code. They make up the content of a HTML website and can consist of elements such as headings, paragraphs, images, and many others. When scraping websites, you search for web content by searching for the DOM elements they’re defined within.

Query Selectors

Query selectors are methods available in the browser and Javascript that allow you to select DOM elements. After you select them, you can read the data or manipulate them such as changing the text or CSS properties. When scraping the web, you use query selectors to select the DOM elements you want to read from.

Developer Tools

Chrome, Firefox, and other browsers have tools built into their browser that allow developers to have an easier time working with websites. You can find the DOM elements of the content you want using the developer tools and then select it with code.

Different tools/libraries you can use for web scraping

There are many different tools and libraries you can use to scrape the web using Javascript and Node.js. Here is a list of some of the most popular tools.

Cheerio

Cheerio is a library that allows you to use jQuery like syntax on the server. Cheerio is often paired with a library like request or request-promise to read HTML code using jQuery on the server.

Nightmare.js

Nightmare.js is a high-level browser automation library that can be used to do some interaction on the website before you scrape the data. For example, you may want to enter a form and then submit it before you want to scrape the website. Nightmare.js allows you to do this with an easy to use API.

Puppeteer

Puppeteer is a Node.js library that can run headless Chrome to do automation tasks. Puppeteer can do things such as:

Generate screenshots and PDFs of pages.
Automate form submission, UI testing, keyboard input, etc.
Test Chrome Extensions.
And more.

Axios

Axios is a popular library for making requests over the web. This library is robust and has many features. To make a simple request to get a website’s HTML content is simple for this library. It’s often used in combination with a library like Cheerio for scraping the web.

‍

Tutorial

In this tutorial, we’ll be scraping the front-page of HackerNews to get the post titles and links and save them to a CSV file.

Prerequesites

Node.js installed on your computer.
Basic understanding of Javascript and Node.js.

1. Project setup

To start, we’ll need to setup a Node.js project. In your terminal, change directories into an empty directory and type:

yarn init -y

npm init -y

To initialize a new Node.js project. The -y flag skips all the questions that a new project asks you.

We’ll need to install two dependencies for this project: Cheerio and Axios.

In your terminal, type:

yarn add cheerio axios

That will install the packages in your project.

Now let’s get something printing on the screen.

Create a new file called scraper.js in your project directory and add the following code to the file

-- CODE language-js --console.log('Hello world!');

Next, in your terminal run the command:

node scraper

And you should see the text Hello world! in your terminal.

‍

2. See what DOM elements we need using the developer tools

Now that our project is set-up, we can visit HackerNews and inspect the code to see which DOM elements we need to target.

Visit HackerNews and right-click on the page and press “Inspect” to open the developer tools.

That’ll open up the developer tools which looks like:

Since we want the title and URL, we can search for their DOM elements by pressing Control + Shift + C to select an element. When you hover over an element on the website after pressing Control + Shift + C then the element will be highlighted and you can see information about it.

If you click the highlighted element then it will open up in the developer tools.

‍

This anchor tag has all the data we need. It contains the title and the href of the link. It also has a class of storylink so what we need to do is select all the elements with a class of storylink in our code and then extract the data we want.

‍

3. Use Cheerio and Axios to get HTML data from HackerNews

Now it’s time to start using Cheerio and Axios to scrape HackerNews.

Delete the hello world console log and add the packages to your script at the top of your file.

-- CODE language-js --const cheerio = require('cheerio');
const axios = require('axios');

Next, we want to call axios using their get method to make a request to the HackerNews website to get the HTML data.

That code looks like this:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
console.log(response.data);
});

If you run your script now, then you should see a large string of HTML code.

Here is where Cheerio comes into play.

We want to load this HTML code into a Cheerio variable and with that variable, we’ll be able to run jQuery like methods on the HTML code.

That code looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
});
‍

The $ is the variable that contains the parsed HTML code ready for use.

Since we know that the .storylink class is where our data lies, we can find all of the elements that have a .storylink class using the $ variable. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
console.log($('.storylink'));
});

If you run your code now, you’ll see a large object that is a Cheerio object. Next, we will run methods on this Cheerio object to get the data we want.

‍

4. Get the title and link using Cheerio

Since there are many DOM elements containing the class storylink, we want to loop over them and work with each individual one.

Cheerio makes this simple with an each method. This looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
console.log(i);
console.log(e);
}
});

i is the index of the array, and e is the element object.

What this does is loop over all the elements containing the storylink class and within the loop, we can work with each individual element.

Since we want the title and URL, we can access them using text and attr methods provided by Cheerio. That looks like:

-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
console.log(title);
console.log(link);
}
});

If you run your code now, you should see a large list of post titles and their URLs!

Next, we’ll save this data in a CSV file.

‍

5. Save the title and link into a CSV file.

Creating CSV files in Node.js is easy. We just need to import a module called fs into our code and run some methods. fs is available with Node so we don’t have to install any new packages.

At the top of your code add the fs module and create a write stream.

-- CODE language-js --const fs = require('fs');
const writeStream = fs.createWriteStream('hackernews.csv');

What this does is it creates a file called hackernews.csv and prepares your code to write to it.

Next, we want to create some headers for the CSV file. This looks like:

-- CODE language-js --writeStream.write(`Title,Link n`);

What we’re doing here, is just writing a single linke with the string Title,Link n.

This prepares the CSV with headings.

What’s left is to write a line to the CSV file for every title and link. That looks like:
-- CODE language-js --axios.get('https://news.ycombinator.com/').then((response) => {
let $ = cheerio.load(response.data);
$('.storylink').each((i, e) => {
let title = $(e).text();
let link = $(e).attr('href');
writeStream.write(`${title}, ${link} n`);
});
});

What we’re doing is writing a new line to the file that contains the title and link in its appropriate location and then adding a new line for the next line.

The string in use is called template literals and it’s an easy way to add variables to strings in nicer syntax.

If you run your code now, you should see a CSV file created in your directory with the title and link of all the posts from HackerNews.

Your final code should look like this:

‍

Searching DuckDuckGo with Nightmare.js

In this tutorial, we'll be going over how to search DuckDuckGo with Nightmare.js and get the URLs of the first five results.

Nightmare.js is a browser automation library that uses Electron to mimic browser like behavior. Using Nightmare, you're able to automate actions like clicking, entering forms, going to another page, and everything you can do on a browser manually.

To do this, you use methods provided by Nightmare such as `goto`, `type`, `click`, `wait`, and many others that represent actions you would do with a mouse and keyboard.

Let's get started.

Prerequisites

- Node.js installed on your computer.
- Basic understanding of Javascript and Node.js.
- Basic understanding of the DOM.

1. Project setup

If you've initialized a Node project as outlined in the previous tutorial, you can simply create a new file in the same directory called `nightmare.js`.

If you haven't created a new Node project, follow Step 1 in the previous tutorial to see how to create a new Node.js project.

Next, we'll add the nightmare.js package. In your terminal, type:

yarn add nightmare

Next, add a console.log message in `nightmare.js` to get started.

Your `nightmare.js` file should look like:

-- CODE language-js --console.log('Hello from nightmare!');

If you run `node nightmare` in your terminal, you should see:

Hello from nightmare!

2. See what DOM elements we need using the developer tools

Next, let's visit [DuckDuckGo.com](https://duckduckgo.com/) and inspect the website to see which DOM elements we need to target.

Visit DuckDuckGo and open up the developer tools by right-clicking on the form and selecting `Inspect`.

And from the developer tools, we can see that the ID of the input form is `search_form_input_homepage`. Now we know to target this ID in our code.

Next, we need to click the search button to complete the action of entering a search term and then searching for it.

Right-click the search icon on the right side of the search input and click `Inspect`.

From the developer tools, we can see that the ID of the search button is `search_button_homepage`. This is the next element we need to target in our Nightmare script.

3. Search for a term in DuckDuckGo using Nightmare.js

Now we have our elements and we can start our Nightmare script.

In your nightmare.js file, delete the console.log message and add the following code:

-- CODE language-js --const Nightmare = require('nightmare');
const nightmare = Nightmare({ show: true });
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.then();

What we're doing here is first importing the Nightmare module, and then creating the nightmare object to work with.

The nightmare object takes in some options that you can see more of [here](https://github.com/segmentio/nightmare#nightmareoptions). The option we care about is `show: true` because this shows the electron instance and the actions being taken. You can hide this electron instance by setting `show` to `false`.

Next, we're telling the nightmare instance to take some actions. The actions are described using the methods `goto`, `type`, `click`, and `then`. They describe what we want nightmare to do.

First, we want it to go to the duckduckgo URL. Then, we want it to select the search form element and type 'web scraping'. Then, we want it to click the search button element. Then, we're calling `then` because this is what makes the instance run.

If you run this script, you should see Nightmare create an electron instance, go to duckduckgo.com, and then search for web scraping.

4. Get the URLs of the search results

The next step in this action is to get the URLs of the search results.

As you saw in the last step, Nightmare allows us to go to another page after taking an action like searching in a form, and then we can scrape the next page.

If you go to the browser and right-click a link in the search results page of DuckDuckGo, you'll see the element we need to target.

The class of the URL result we want is `result__url js-result-extras-url`.

To get DOM element data in Nightmare, we want to write our code in their `evaluate` method and return the data we want.

Update your script to look like this:

-- CODE language-js --
nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
return results;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
}); Bustafellows vndb.

‍

What we added here is a `wait`, `evaluate`, `end`, `catch`, and a console.log to the `then`.

The `wait` is so we wait a few seconds after searching so we don't scrape a page that didn't load.

Then `evaluate` is where we write our scraping code. Here, we're getting all the elements with a class of `result__url js-result-extras-url` and returning the results which will be used in the `then` call.

Then `end` is so the electron instance closes.

Then `then` is where we get the results that were returned from `evaluate` and we can work with it like any other Javascript code.

Then `catch` is where we catch errors and log them.

If you run this code, you should see an object logged.

-- CODE language-js --{
'0': { jQuery1102006895228087119576: 151 },
'1': { jQuery1102006895228087119576: 163 },
'2': { jQuery1102006895228087119576: 202 },
'3': { jQuery1102006895228087119576: 207 },
'4': { jQuery1102006895228087119576: 212 },
'5': { jQuery1102006895228087119576: 217 },
'6': { jQuery1102006895228087119576: 222 },
'7': { jQuery1102006895228087119576: 227 },
'8': { jQuery1102006895228087119576: 232 },
'9': { jQuery1102006895228087119576: 237 },
'10': { jQuery1102006895228087119576: 242 },
'11': { jQuery1102006895228087119576: 247 },
'12': { jQuery1102006895228087119576: 188 }
}

This is the object returned from the evaluate method. These are all the elements selected by `document.getElementsByClassName('result__url js-result-extras-url');`.

We don't want to use this object, we want the URLs of the first 5 results.

To get the URL or href of one of these objects, we simply have to select it using `[]` and calling the `href` attribute on it.

Update your code to look like this:

-- CODE language-js --nightmare
.goto('https://duckduckgo.com')
.type('#search_form_input_homepage', 'web scraping')
.click('#search_button_homepage')
.wait(3000)
.evaluate(() => {
const results = document.getElementsByClassName(
'result__url js-result-extras-url'
);
const urls = [];
urls.push(results[2].href);
urls.push(results[3].href);
urls.push(results[4].href);
urls.push(results[5].href);
urls.push(results[6].href);
return urls;
})
.end()
.then(console.log)
.catch((error) => {
console.error('Search failed:', error);
});

Since the first two elements are URLs of ads, we can skip them and go to elements 2-6.

What we're doing here is creating an array called `urls` and pushing 5 hrefs to them. We select an element in the array using `[]` and call the existing href attribute on it. Then we return the URLs to be used in the `then` method.

If you run your code now, you should see this log:

-- CODE language-js --[
'https://en.wikipedia.org/wiki/Web_scraping',
'https://www.guru99.com/web-scraping-tools.html',
'https://www.edureka.co/blog/web-scraping-with-python/',
'https://www.webharvy.com/articles/what-is-web-scraping.html',
'https://realpython.com/tutorials/web-scraping/',
];

‍

And this is how you get the first five URLs of a search in DuckDuckGo using Nightmare.js.

Your final code should look like this:

# What we covered

- Introduction to web scraping with Node.js

- Important concepts for web scraping.

- Popular web scraping libraries in Node.js

- A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.

- A tutorial about how to get the search results on DuckDuckGo using Nightmare.js.

‍

What we covered

Introduction to web scraping with Node.js
Important concepts for web scraping.
Popular web scraping libraries in Node.js
A tutorial about how to scrape the HackerNews frontpage and save data to a CSV file.

‍

The internet has a wide variety of information for human consumption. But this data is often difficult to access programmatically if it doesn't come in the form of a dedicated REST API. With Node.js tools like jsdom, you can scrape and parse this data directly from web pages to use for your projects and applications.

Let's use the example of needing MIDI data to train a neural network that can generate classic Nintendo-sounding music. In order to do this, we'll need a set of MIDI music from old Nintendo games. Using jsdom we can scrape this data from the Video Game Music Archive.

Getting started and setting up dependencies

Node.js Web Scraping Tutorial

Before moving on, you will need to make sure you have an up to date version of Node.js and npm installed.

Navigate to the directory where you want this code to live and run the following command in your terminal to create a package for this project:

The --yes argument runs through all of the prompts that you would otherwise have to fill out or skip. Now we have a package.json for our app.

For making HTTP requests to get data from the web page we will use the Got library, and for parsing through the HTML we'll use Cheerio.

Run the following command in your terminal to install these libraries:

jsdom is a pure-JavaScript implementation of many web standards, making it a familiar tool to use for lots of JavaScript developers. Let's dive into how to use it.

Using Got to retrieve data to use with jsdom

Web Scraping Software

First let's write some code to grab the HTML from the web page, and look at how we can start parsing through it. The following code will send a GET request to the web page we want, and will create a jsdom object with the HTML from that page, which we'll name dom:

When you pass the JSDOM constructor a string, you will get back a JSDOM object, from which you can access a number of usable properties such as window. As seen in this code, you can navigate through the HTML and retrieve DOM elements for the data you want using a query selector.

For example, querySelector('title').textContent will get you the text inside of the <title> tag on the page. If you save this code to a file named index.js and run it with the command node index.js, it will log the title of the web page to the console.

Using CSS Selectors with jsdom

If you want to get more specific in your query, there are a variety of selectors you can use to parse through the HTML. Two of the most common ones are to search for elements by class or ID. If you wanted to get a div with the ID of 'menu' you would use querySelectorAll('#menu') and if you wanted all of the header columns in the table of VGM MIDIs, you'd do querySelectorAll('td.header')

How to merge contacts on iphone. Again, go to Settings Accounts & Passwords (or Settings Mail, Contacts, Calendars) on your. Open your iPhone or iPad's Settings app. Tap Accounts & Passwords Add Account Google. Enter your email and password.

What we want on this page are the hyperlinks to all of the MIDI files we need to download. We can start by getting every link on the page using querySelectorAll('a'). Add the following to your code in index.js:

This code logs the URL of every link on the page. We're able to look through all elements from a given selector using the forEach function. Iterating through every link on the page is great, but we're going to need to get a little more specific than that if we want to download all of the MIDI files.

Filtering through HTML elements

Before writing more code to parse the content that we want, let’s first take a look at the HTML that’s rendered by the browser. Every web page is different, and sometimes getting the right data out of them requires a bit of creativity, pattern recognition, and experimentation.

Our goal is to download a bunch of MIDI files, but there are a lot of duplicate tracks on this webpage, as well as remixes of songs. We only want one of each song, and because our ultimate goal is to use this data to train a neural network to generate accurate Nintendo music, we won't want to train it on user-created remixes.

When you're writing code to parse through a web page, it's usually helpful to use the developer tools available to you in most modern browsers. If you right-click on the element you're interested in, you can inspect the HTML behind that element to get more insight.

You can write filter functions to fine-tune which data you want from your selectors. These are functions which loop through all elements for a given selector and return true or false based on whether they should be included in the set or not.

If you looked through the data that was logged in the previous step, you might have noticed that there are quite a few links on the page that have no href attribute, and therefore lead nowhere. We can be sure those are not the MIDIs we are looking for, so let's write a short function to filter those out as well as elements which do contain a href element that leads to a .mid file:

Now we have the problem of not wanting to download duplicates or user generated remixes. For this we can use regular expressions to make sure we are only getting links whose text has no parentheses, as only the duplicates and remixes contain parentheses:

Try adding these to your code in index.js by creating an array out of the collection of HTML Element Nodes that are returned from querySelectorAll and applying our filter functions to it:

Run this code again and it should only be printing .mid files, without duplicates of any particular song.

Downloading the MIDI files we want from the webpage

Now that we have working code to iterate through every MIDI file that we want, we have to write code to download all of them.

In the callback function for looping through all of the MIDI links, add this code to stream the MIDI download into a local file, complete with error checking:

Run this code from a directory where you want to save all of the MIDI files, and watch your terminal screen display all 2230 MIDI files that you downloaded (at the time of writing this). With that, we should be finished scraping all of the MIDI files we need.

Go through and listen to them and enjoy some Nintendo music!

The vast expanse of the World Wide Web

Now that you can programmatically grab things from web pages, you have access to a huge source of data for whatever your projects need. One thing to keep in mind is that changes to a web page’s HTML might break your code, so make sure to keep everything up to date if you're building applications on top of this. You might want to also try comparing the functionality of the jsdom library with other solutions by following tutorials for web scraping using Cheerio and headless browser scripting using Puppeteer or a similar library called Playwright.

If you're looking for something to do with the data you just grabbed from the Video Game Music Archive, you can try using Python libraries like Magenta to train a neural network with it.

I’m looking forward to seeing what you build. Feel free to reach out and share your experiences or ask any questions.

Email: [email protected]
Twitter: @Sagnewshreds
Github: Sagnew
Twitch (streaming live code): Sagnewshreds