Scrapy: keep track of your web objects inventory

Every time you post a new article or upload a new video, you are creating a new object on the web. Google will index its URL, and Facebook will add it to the Open Graph.

Keeping track of your web objects inventory can help you to size how big is your website (the more content, the better) and check how it is indexed in Google, Facebook or any other social network.

Scrapy is a pretty useful python library to crawl domains. I’ve modified Kevin Jacobs’ implementation to get only the links and titles from the domain you want to scan.

Here are the installation steps:

pip3 install scrapy
scrapy startproject links_mapper
cd links_mapper
scrapy genspider leocelis leocelis.com

Then you need to modify items.py and leocelis.py (or whatever you named your spider.)

Once you are done you can run:

scrapy crawl leocelis -o links.csv -t csv

This last command will generate a links.csv file with a list of all the URLs and titles from the domain you’ve specified.

Author
Recent Posts

Leo Celis

Founder & CEO at InTheValley

I build remote engineering teams for startups. My blog uses a proprietary platform to generate in-depth articles about something I’m passionate about: Advertising.

Scrapy: keep track of your web objects inventory

Related Posts