* < 1 minute

A quick and easy way to access the URL source code of any web page is by using the urllib module in Python. This can directly print the raw HTML source code of any open access URL. First, we check the URL code returned by the page to ensure it is live and accepting requests to access. Then we retrieve the complete HTML source code and dump it in a variable, and print the variable. This can be saved for later analysis. The code is shown below.


# =============================================================================
# retrieving source html code from a web page 
# =============================================================================
import urllib.request 

def main():
  # access a URL using urllib
  pageURL = urllib.request.urlopen("http://www.python.org")
  
  # print code of the target URL to test if it is live
  print ("URL code: " + str(pageURL.getcode()))
  
  # get the URL and print the page source code
  sourceCode = pageURL.read()
  print ("URL source code:\n" + str(sourceCode))

if __name__ == "__main__":
  main()

The output to this would be the HTML source of the target URL you added in the code. It is currently set to python.org.

It’s as simple as that!
Please let me know what your thoughts are on this. The aim is to learn together. Inform me if I missed anything in the comments! 🙂

1

Mohammad D.

Mohammad D. works with sentiment anlaysis, NLP and Python. He loves to blog about these and other related topics in his free time.
LinkedIn

Leave a Reply