How to get all Links from a Webpage using Node.js and Cheerio

Node

59 articles

Using ENUM Type in TypeORM

February 07, 2022

TypeORM: 2 Ways to Exclude a Column from being Selected

January 30, 2024

TypeORM: How to Limit Query Execution Time

April 23, 2022

Node.js: Using __dirname and __filename with ES Modules

February 12, 2023

5 Ways to Read JSON Files in Node.js

February 12, 2023

TypeORM: Entity with Decimal Data Type

May 20, 2022

How to Use Subqueries in TypeORM

January 30, 2024

The Best Way to Upgrade Node.js on Mac

July 10, 2022

Node.js is a javascript runtime environment that helps you create high-performance apps. Because Node.js functions are non-blocking (commands execute concurrently or even in parallel, it can perform heavy tasks without incurring the cost of thread context switching. For that reason, it’s perfect to use Node.js for scraping web pages.

In this article, we will crawl and extract all links (including “href” and “text”) from a webpage using Node.js and 2 packages: got and cheerio.

got is an easy-to-use and powerful HTTP request library for Node.js that will help download HTML from a webpage.
cheerio is a fast implementation of core jQuery designed specifically for the server that can help us parse HTML much easier.

The Example Project
Another Approach (Deprecated)
Conclusion

The Example Project

In this example, we will get all links from the homepage of a website named books.toscrape.com which lets us free to scrape it without worrying about any legal issues.

1. Open your terminal and navigate to the folder you want your project lives in then run then create a new file named index.js.

2. Install the required libraries:

npm i got cheerio

3. Add the following to your index.js:

// Kindacode.com
const cheerio = require('cheerio');

const got = (...args) => import('got').then(({default: got}) => got(...args)); 
// You cannot use "require" with the latest version of got
// If you're using ES Module or TypeScript, just import got like this: import got from 'got'

const extractLinks = async (url) => {
  try {
    // Fetching HTML
    const response = await got(url);
    const html = response.body;

    // Using cheerio to extract <a> tags
    const $ = cheerio.load(html);

    const linkObjects = $('a');
    // this is a mass object, not an array

    // Collect the "href" and "title" of each link and add them to an array
    const links = [];
    linkObjects.each((index, element) => {
      links.push({
        text: $(element).text(), // get the text
        href: $(element).attr('href'), // get the href attribute
      });
    });

    console.log(links);
    // do something else here with these links, such as writing to a file or saving them to your database
  } catch (error) {
    console.log(error.response.body);
  }
};

// Try it
const URL = 'http://books.toscrape.com/';
extractLinks(URL);

4. Run your code:

node index.js

You should see something similar to this when looking at your console (I have reduced a lot of logs because it’s too long):

{
    text: 'Libertarianism for Beginners',
    href: 'catalogue/libertarianism-for-beginners_982/index.html'
},
{ text: '', href: 'catalogue/its-only-the-himalayas_981/index.html' },
{
    text: "It's Only the Himalayas",
    href: 'catalogue/its-only-the-himalayas_981/index.html'
},
{ text: 'next', href: 'catalogue/page-2.html' }

Another Approach (Deprecated)

In this example, we’ll use request-promise instead of got. The implementation process is not much different from the example above.

Note: Because request-promise is now deprecated, you should no longer use it in new projects. I keep this section in order to provide some information for people who are still working with this library but will delete it in the future.

Installation:

npm install cheerio request-promise

Code:

const cheerio = require('cheerio');
const rp = require('request-promise');

const url = 'https://en.wikipedia.org/wiki/Main_Page';
// I use Wikipedia just for testing purpose

rp(url).then(html => {
  const $ = cheerio.load(html);
    const linkObjects = $('a');
    // this is a mass object, not an array

    const total = linkObjects.length;
    // The linkObjects has a property named "lenght"

    const links = [];
    // we only need the "href" and "title" of each link

    for(let i = 0; i < total; i++){
        links.push({
            href: linkObjects[i].attribs.href,
            title: linkObjects[i].attribs.title
        });
    }

    console.log(links);
    // do something else here with links
})
.catch(err => {
    console.log(err); 
})

Conclusion

In this article, you learned how to extract all the links on a website using Node.js with the help of the got and cheerio libraries. From here, you can develop your own more complex web crawlers. Please keep in mind that many sites don’t allow you to scrape their contents.

If you’d like to explore more about modern Node.js, take a look at the following articles:

You can also check out our Node.js category page for the latest tutorials and examples.