Scrapy: keep track of your web objects inventory

Every time you post a new article or upload a new video, you are creating a new object on the web. Google will index its URL, and Facebook will add it to the Open Graph.

Keeping track of your web objects inventory can help you to size how big is your website (the more content, the better) and check how it is indexed in Google, Facebook or any other social network.

Scrapy is a pretty useful python library to crawl domains. I’ve modified Kevin Jacobs’ implementation to get only the links and titles from the domain you want to scan.

Here are the installation steps:

pip3 install scrapy
scrapy startproject links_mapper
cd links_mapper
scrapy genspider leocelis leocelis.com

Then you need to modify items.py and leocelis.py (or whatever you named your spider.)

Once you are done you can run:

scrapy crawl leocelis -o links.csv -t csv

This last command will generate a links.csv file with a list of all the URLs and titles from the domain you’ve specified.

Author
Recent Posts

Leo Celis

Founder & CEO at InTheValley

I help startups fix engineering teams that should be moving faster. If you're scaling a startup, you've probably felt the pain: great people on paper, but execution feels slow. I've been building remote teams for startups since 2005 — engineers you can trust who actually deliver and know how to leverage AI to ship faster.

Scrapy: keep track of your web objects inventory

Related Posts