How to extract the sub-folder of 1000+ URLs using Google Sheets?
Fetch the category of all 1000+ URLs.
Let’s say you got a list of random 1000+ URLs, and you want to extract the sub-folder or 2nd path of these URLs.
There can be multiple reasons to do this -
You want to analyze any metric (e.g. page views) by sub-folder
You want to analyze the count of URLs in each category
You want to create disavow rules in the robots.txt file.
So, let’s take a random 1000+ URLs with their clicks in A & B column respectively from - shop.googlemerchandisestore.com
It would look something like this -
Now we will extract the category of all the URLs using the REGEXEXTRACT() formula of Google sheet.
Here is what the formula looks like -
=REGEXEXTRACT(A2,"shop.googlemerchandisestore.com/(.*?)\/")
Let’s decode this formula.
REGEX stands for Regular Expression. Regex is a very powerful tool to extract string/number patterns from any block of text.
Learn how to write any regex rules here.
shop.googlemerchandisestore.com/(.*?)\/
The above regex rule says - extract anything that comes between domainname.com/ and the first slash it detects.
So if we apply this rule on www.example.com/category/topic/ it will extract the “category” string.
.* is an expression to match any string. It starts grabbing everything after www.example.com/
? says stop when you detect the character mentioned outside the brace bracket (i.e. front slash)
\/ stands for a front slash. Front slash is also one of the regex expressions. So, to make sure regex considers front slash as front slash we need to add a backslash before it as an escape character.
Combining all 3 makes - (.*?)\/
So by applying this formula in the C column we can extract the sub-folder of all URLs.
The result will look something like this -
Now that we have the sub-folder of all the URLs, we can analyze the total clicks and count of each folder.
Which can help us to see which category is getting the most clicks.
Looks like google+redesign is getting the most traffic after home.
Here is a free Google spreadsheet you can use for your use. Please make a copy and avoid requesting edit access.
Thanks for reading.
Sharing is caring.
Tweet @stanabk if facing any issues.