How to get all links from a webpage using Node.js

February 10, 2020 Goodman Loading... Post a comment

In this article, we will crawl all links (including “href” and “title”) using Node.js and 2 packages: cheerio and request-promise.

Installation:

npm install cheerio request-promise

Code:

const $ = require('cheerio');
const rp = require('request-promise');

const url = 'https://en.wikipedia.org/wiki/Main_Page';
// I use Wikipedia for the exmaple but you can use other sites you like

rp(url).then(html => {
    const linkObjects = $('a', html);
    // this is a mass object, not an array

    const total = linkObjects.length;
    // The linkObjects has a property named "lenght"

    const links = [];
    // we only need the "href" and "title" of each link

    for(let i = 0; i < total; i++){
        links.push({
            href: linkObjects[i].attribs.href,
            title: linkObjects[i].attribs.title
        });
    }

    console.log(links);
    // do something else here with links
})
.catch(err => {
    console.log(err); 
})

Simple as that. From here you are pretty good to go. Now, you can start building more complex web crawlers 🙂

avatar

Related Articles