Guides & Tutorials
Step-by-step tutorials for getting the most out of crawler.sh.
How to Preprocess Web Content for RLHF Training Pairs
A step-by-step guide to crawling web content, cleaning it, and structuring it into preference pairs for RLHF reward model training.
How to Fetch a Single Page with CLI
Learn how to fetch a single URL using crawler.sh CLI without crawling the entire site. Get clean output with smart, path-based filenames.
How to Integrate crawler.sh into MLOps Pipelines
Learn how to use crawler.sh CLI in MLOps workflows to collect training data, validate documentation sites, and automate web crawling in CI/CD pipelines.
How to Find Orphan Pages on a Website with CLI
Learn how to detect orphan pages with zero incoming internal links using crawler.sh CLI. Identify isolated pages and fix your internal linking.
How to Crawl Data to Train AI Model with CLI
Learn how to crawl website content and extract clean Markdown for AI training datasets using crawler.sh CLI. Export structured data for LLM fine-tuning.
How to Find Long Content with CLI
Learn how to detect pages with over 5,000 words using crawler.sh CLI. Find excessively long pages that may need to be split for better user experience and SEO.
How to Find Empty H1 Tags with CLI
Learn how to detect pages with empty H1 tags using crawler.sh CLI. Find headings that contain no text and fix them to improve SEO and page structure.
How to Find Broken Links of a Website with CLI
Learn how to detect broken links and dead pages on any website using crawler.sh CLI. Crawl your site, identify 4xx/5xx errors, and export a report.
Showing 8 of 30 guides