Extract all links from a webpage using Python and Beautiful Soup 4
( 48 Articles)
This article shows you how to get all links from a webpage using Python 3, the Requests module, and the Beautiful Soup 4 module. For the demonstration purpose, I will scrape and extract the main page of Wikipedia:
https://en.wikipedia.org/wiki/Main_Page
Please note that not all websites allow you to crawl content from them.
Getting Started
Install the required modules by running the following commands:
pip install requests
and:
pip install beautifulsoup4
If you’re using a Mac, you may need to type pip3 instead of pip.
The code with explanation:
import requests
# BeautifulSoup is imported with the name bas4
import bs4
URL = 'https://en.wikipedia.org/wiki/Main_Page'
# Fetch all the HTML source from the url
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')
# Print out the result
for link in links:
print(link.get_text())
if link.get('href') != None:
if 'https://' in link.get('href'):
print(link.get('href'))
else:
print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL
print('----------------------------') # Just a line break
When running that program, you should see something like the following:
Main page
https://en.wikipedia.org/wiki/Main_Page
---------------------------
Contents
https://en.wikipedia.org/wiki/Wikipedia:Contents
----------------------------
Current events
https://en.wikipedia.org/wiki/Portal:Current_events
----------------------------
Random article
https://en.wikipedia.org/wiki/Special:Random
----------------------------
About Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:About
----------------------------
Contact us
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
----------------------------
Donate
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
----------------------------
Help
https://en.wikipedia.org/wiki/Help:Contents
----------------------------
Learn to edit
https://en.wikipedia.org/wiki/Help:Introduction
----------------------------
Community portal
https://en.wikipedia.org/wiki/Wikipedia:Community_portal
----------------------------
Recent changes
https://en.wikipedia.org/wiki/Special:RecentChanges
----------------------------
Upload file
https://en.wikipedia.org/wiki/Wikipedia:File_Upload_Wizard
----------------------------
What links here
https://en.wikipedia.org/wiki/Special:WhatLinksHere/Main_Page
----------------------------
Related changes
https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Main_Page
----------------------------
Upload file
https://en.wikipedia.org/wiki/Wikipedia:File_Upload_Wizard
----------------------------
Special pages
https://en.wikipedia.org/wiki/Special:SpecialPages
----------------------------
Permanent link
https://en.wikipedia.org/w/index.php?title=Main_Page&oldid=987965326
----------------------------
Page information
https://en.wikipedia.org/w/index.php?title=Main_Page&action=info
----------------------------
Cite this page
https://en.wikipedia.org/w/index.php?title=Special:CiteThisPage&page=Main_Page&id=987965326&wpFormIdentifier=titleform
----------------------------
Wikidata item
https://www.wikidata.org/wiki/Special:EntityPage/Q5296
----------------------------
Download as PDF
https://en.wikipedia.org/w/index.php?title=Special:DownloadAsPdf&page=Main_Page&action=show-download-screen
----------------------------
Printable version
https://en.wikipedia.org/w/index.php?title=Main_Page&printable=yes
----------------------------
Wikimedia Commons
https://commons.wikimedia.org/wiki/Main_Page
----------------------------
MediaWiki
https://www.mediawiki.org/wiki/MediaWiki
----------------------------
Meta-Wiki
https://meta.wikimedia.org/wiki/Main_Page
----------------------------
Wikispecies
https://species.wikimedia.org/wiki/Main_Page
----------------------------
Wikibooks
https://en.wikibooks.org/wiki/Main_Page
----------------------------
Wikidata
https://www.wikidata.org/wiki/Wikidata:Main_Page
----------------------------
Wikimania
https://wikimania.wikimedia.org/wiki/Wikimania
----------------------------
Wikinews
https://en.wikinews.org/wiki/Main_Page
----------------------------
Wikiquote
https://en.wikiquote.org/wiki/Main_Page
----------------------------
Wikisource
https://en.wikisource.org/wiki/Main_Page
----------------------------
Wikiversity
https://en.wikiversity.org/wiki/Wikiversity:Main_Page
----------------------------
Wikivoyage
https://en.wikivoyage.org/wiki/Main_Page
----------------------------
Wiktionary
https://en.wiktionary.org/wiki/Wiktionary:Main_Page
----------------------------
العربية
https://ar.wikipedia.org/wiki/
----------------------------
Български
https://bg.wikipedia.org/wiki/
----------------------------
Bosanski
https://bs.wikipedia.org/wiki/
----------------------------
Català
https://ca.wikipedia.org/wiki/
----------------------------
Čeština
https://cs.wikipedia.org/wiki/
----------------------------
Dansk
https://da.wikipedia.org/wiki/
----------------------------
Deutsch
https://de.wikipedia.org/wiki/
----------------------------
Eesti
https://et.wikipedia.org/wiki/
----------------------------
Ελληνικά
https://el.wikipedia.org/wiki/
----------------------------
Español
https://es.wikipedia.org/wiki/
----------------------------
Esperanto
https://eo.wikipedia.org/wiki/
----------------------------
Euskara
https://eu.wikipedia.org/wiki/
----------------------------
فارسی
https://fa.wikipedia.org/wiki/
----------------------------
Français
https://fr.wikipedia.org/wiki/
----------------------------
Galego
https://gl.wikipedia.org/wiki/
----------------------------
한국어
https://ko.wikipedia.org/wiki/
----------------------------
Hrvatski
https://hr.wikipedia.org/wiki/
----------------------------
Bahasa Indonesia
https://id.wikipedia.org/wiki/
----------------------------
Italiano
https://it.wikipedia.org/wiki/
----------------------------
עברית
https://he.wikipedia.org/wiki/
----------------------------
ქართული
https://ka.wikipedia.org/wiki/
----------------------------
Latviešu
https://lv.wikipedia.org/wiki/
----------------------------
Lietuvių
https://lt.wikipedia.org/wiki/
----------------------------
Magyar
https://hu.wikipedia.org/wiki/
----------------------------
Македонски
https://mk.wikipedia.org/wiki/
----------------------------
Bahasa Melayu
https://ms.wikipedia.org/wiki/
----------------------------
Nederlands
https://nl.wikipedia.org/wiki/
----------------------------
日本語
https://ja.wikipedia.org/wiki/
----------------------------
Norsk bokmål
https://no.wikipedia.org/wiki/
----------------------------
Norsk nynorsk
https://nn.wikipedia.org/wiki/
----------------------------
Polski
https://pl.wikipedia.org/wiki/
----------------------------
Português
https://pt.wikipedia.org/wiki/
----------------------------
Română
https://ro.wikipedia.org/wiki/
----------------------------
Русский
https://ru.wikipedia.org/wiki/
----------------------------
Simple English
https://simple.wikipedia.org/wiki/
----------------------------
Slovenčina
https://sk.wikipedia.org/wiki/
----------------------------
Slovenščina
https://sl.wikipedia.org/wiki/
----------------------------
Српски / srpski
https://sr.wikipedia.org/wiki/
----------------------------
Srpskohrvatski / српскохрватски
https://sh.wikipedia.org/wiki/
----------------------------
Suomi
https://fi.wikipedia.org/wiki/
----------------------------
Svenska
https://sv.wikipedia.org/wiki/
----------------------------
ไทย
https://th.wikipedia.org/wiki/
----------------------------
Türkçe
https://tr.wikipedia.org/wiki/
----------------------------
Українська
https://uk.wikipedia.org/wiki/
----------------------------
Tiếng Việt
https://vi.wikipedia.org/wiki/
----------------------------
中文
https://zh.wikipedia.org/wiki/
----------------------------
Creative Commons Attribution-ShareAlike License
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
----------------------------
https://en.wikipedia.org//creativecommons.org/licenses/by-sa/3.0/
----------------------------
Terms of Use
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Terms_of_Use
----------------------------
Privacy Policy
https://en.wikipedia.org//foundation.wikimedia.org/wiki/Privacy_policy
----------------------------
Wikimedia Foundation, Inc.
https://en.wikipedia.org//www.wikimediafoundation.org/
----------------------------
Privacy policy
https://foundation.wikimedia.org/wiki/Privacy_policy
----------------------------
About Wikipedia
https://en.wikipedia.org/wiki/Wikipedia:About
----------------------------
Disclaimers
https://en.wikipedia.org/wiki/Wikipedia:General_disclaimer
----------------------------
Contact Wikipedia
https://en.wikipedia.org//en.wikipedia.org/wiki/Wikipedia:Contact_us
----------------------------
Mobile view
https://en.wikipedia.org//en.m.wikipedia.org/w/index.php?title=Main_Page&mobileaction=toggle_view_mobile
----------------------------
Developers
https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
Note that the content of Wikipedia’s main page may change over time so it’s totally fine if your output is different from mine. Another important thing to keep in mind is that not all websites allow you to scrape their contents.
Further reading:
- List, Dict, and Set Comprehensions in Python 3
- Python 3: Formatting a DateTime Object as a String
- How to Install Python Libraries in Google Colab
- Examples of numpy.linspace() in Python
You can also check out our Python category page for the latest tutorials and examples.