A quick and easy way to access the URL source code of any web page is by using the urllib module in Python. This can directly print the raw HTML source code of any open access URL. First, we check the URL code returned by the page to ensure it is live and accepting requests to access. Then we retrieve the complete HTML source code and dump it in a variable, and print the variable. This can be saved for later analysis. The code is shown below.
# ============================================================================= # retrieving source html code from a web page # ============================================================================= import urllib.request def main(): # access a URL using urllib pageURL = urllib.request.urlopen("http://www.python.org") # print code of the target URL to test if it is live print ("URL code: " + str(pageURL.getcode())) # get the URL and print the page source code sourceCode = pageURL.read() print ("URL source code:\n" + str(sourceCode)) if __name__ == "__main__": main()
The output to this would be the HTML source of the target URL you added in the code. It is currently set to python.org.
It’s as simple as that!
Please let me know what your thoughts are on this. The aim is to learn together. Inform me if I missed anything in the comments! 🙂