How to Quickly Check Status Code of URLs in Sitemaps using Python?
Analyze URLs in sitemap on large scale.
Before starting this snippet I wanna appreciate the work Elias Dabbas is doing with Advertools.
He is really helping the Digital Marketing community a lot by making this awesome tool free to use.
So, in this snippet, I am going to use Advertools for checking the status code of 1000+ URLs in my sitemap.
I first thought of using Screaming Frog but then it will crawl and extract everything, yeah I can untick those options but I found Advertools faster and easier
So let's first grab the sitemap. I am gonna use the sitemap of great Bill Slawski’s website - SEO by the Sea
Now that we have the sitemap let’s just jump right into the code.
The first step is to install and import the required packages.
Type the following in CMD to install packages -
pip install panadas
pip install advertools
Import the packages -
import pandas as pd
import advertools as adv
The next step is to convert the sitemap to a data frame and then to a list
sitemap_url = 'https://www.seobythesea.com/post-sitemap1.xml'
sitemap_df = adv.sitemap_to_df(sitemap_url)
url_list = sitemap_df['loc'].tolist()
Then we will feed this list of URLs to advertools and check the response headers. We will then store the URLs and their status code in a CSV file.
adv.crawl_headers(url_list, 'url_status.jl')
headers_df = pd.read_json('url_status.jl', lines = True)
print(headers_df[['url','status']])
headers_df[['url','status']].to_csv(r'C:\Users\ABShukla\Documents\url_status_checker.csv')
Here is what output looks like -
Full code snippet -
import pandas as pd
import advertools as adv
sitemap_url = 'https://www.seobythesea.com/post-sitemap1.xml'
sitemap_df = adv.sitemap_to_df(sitemap_url) url_list = sitemap_df['loc'].tolist()
adv.crawl_headers(url_list, 'url_status.jl')
headers_df = pd.read_json('url_status.jl', lines = True)
print(headers_df[['url','status']])
headers_df[['url','status']].to_csv(r'C:\Users\ABShukla\Documents\url_status_checker.csv')
Hope this helped!
Thanks for reading.
Tweet @stanabk if facing any issues.
Sharing is caring!