How to get all Links from a Webpage using Node.js and Cheerio

Last updated on January 22, 2021 A Goodman Loading... Post a comment

Node.js is a javascript runtime environment that helps you create high-performance apps. Because Node.js functions are non-blocking (commands execute concurrently or even in parallel, it can perform large tasks without incurring the cost of thread context switching. For that reason, it’s perfect to use Node.js for scraping web pages.

In this article, we will crawl and extract all links (including “href” and “text”) from a webpage using Node.js and 2 packages: got and cheerio.

  • got is an easy-to-use and powerful HTTP request library for Node.js that will help download HTML from a webpage.
  • cheerio is a fast implementation of core jQuery designed specifically for the server that can help us parse HTML much easier.

The Example Project

In this example, we will get all links from the homepage of a website named books.toscrape.com which lets us free to scrape it without worrying about any legal issues.

1. Open your terminal and navigate to the the folder you want your project lives in then run then create a new file named index.js.

2. Install the required libraries:

npm i got cheerio

3. Add the following to your index.js:

// Kindacode.com
const got = require('got');
const cheerio = require('cheerio');

const extractLinks = async (url) => {
  try {
    // Fetching HTML
    const response = await got(url);
    const html = response.body;

    // Using cheerio to extract <a> tags
    const $ = cheerio.load(html);

    const linkObjects = $('a');
    // this is a mass object, not an array

    // Collect the "href" and "title" of each link and add them to an array
    const links = [];
    linkObjects.each((index, element) => {
      links.push({
        text: $(element).text(), // get the text
        href: $(element).attr('href'), // get the href attribute
      });
    });

    console.log(links);
    // do something else here with these links, such as writing to a file or saving them to your database
  } catch (error) {
    console.log(error.response.body);
  }
};

// Try it
const URL = 'http://books.toscrape.com/';
extractLinks(URL);

4. Run your code:

node index.js

You should see something similar to this when looking at your console (I have reduced a lot of logs because it’s too long):

{
    text: 'Libertarianism for Beginners',
    href: 'catalogue/libertarianism-for-beginners_982/index.html'
  },
  { text: '', href: 'catalogue/its-only-the-himalayas_981/index.html' },
  {
    text: "It's Only the Himalayas",
    href: 'catalogue/its-only-the-himalayas_981/index.html'
  },
  { text: 'next', href: 'catalogue/page-2.html' }

Another Approach (Deprecated)

In this example, we’ll use request-promise instead of got. The implementation process is not much different from the example above.

Note: Because request-promise is now deprecated, you should no longer use it in new projects. I keep this section in order to provide some information for people who are still working with this library but will delete it in the future.

Installation:

npm install cheerio request-promise

Code:

const $ = require('cheerio');
const rp = require('request-promise');

const url = 'https://en.wikipedia.org/wiki/Main_Page';
// I use Wikipedia for the exmaple but you can use other sites you like

rp(url).then(html => {
    const linkObjects = $('a', html);
    // this is a mass object, not an array

    const total = linkObjects.length;
    // The linkObjects has a property named "lenght"

    const links = [];
    // we only need the "href" and "title" of each link

    for(let i = 0; i < total; i++){
        links.push({
            href: linkObjects[i].attribs.href,
            title: linkObjects[i].attribs.title
        });
    }

    console.log(links);
    // do something else here with links
})
.catch(err => {
    console.log(err); 
})

Conclusion

In this article, you learned how to extract all the links on a website using Node.js with the help of the got and cheerio libraries. From here, you can develop your own more complex web crawlers.

If you would like to learn more about Node.js, check out our Node.js category page for the latest tutorials and examples.

Related Articles

guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x