Caterpillar is a PHP class intended for website crawling
and screen scraping. It handles parallel requests using a
modified and wrapped version of Josh Fraser's
Rolling Curl
library which utilizes curl_multi()
functions in an efficient manner.
Unlike most other curl_multi()
implementations where you must wait
for the set of requests to complete before processing the batch, Rolling cURL
processes each request as soon as it has completed. This eliminates wasted
CPU cycles due to busy waiting. The library also has a queue implementation
for lining up future crawler requests. This ensures that the number of links
being crawled at any given time is as close to the max as possible.
Because requests are handled in parallel, the fastest completed requests will trigger enqueuing any newly found URLs, ensuring the crawler runs continuously and efficiently. Rolling Curl is set to allow for a maximum number of simultaneous connections to ensure you do not DOS attack the requested host with requests.
Caterpillar will crawl the entirety of an internal website when given a starting URL and begin indexing (sitemapping) the pages it hits. When it encounters links on a page, it checks for their existance in the database and either inserts the link or updates their inbound link count. It also creates a contenthash to better determine when pages have been last modified. Caterpillar can easily be used to facilitate the generation of a Google Sitemap XML file.
Caterpillar requires a small amount of legwork on your part to get up and running
due to the necessity for data storage in MySQL. Note that crawling a website
can be a memory intensive activity. For that reason, you are advised to bump
up the PHP memory_limit
to suit your needs.
caterpillar.sql
file into the database of your choice.CREATE TABLE
, DROP TABLE
, and TEMPORARY TABLES
privileges.Downloads are available via github. The decision is all yours:
git clone [email protected]:cballou/caterpillar.git
git clone https://github.com/cballou/caterpillar.git
wget https://github.com/cballou/caterpillar/archive/master.zip
wget https://github.com/cballou/caterpillar/archive/master.tar.gz
If you have any problems with Caterpillar, please file a ticket/issue/bug on Github and I will attempt to address it at my earliest convenience.
Caterpiller Issues on GithubCaterpillar is licensed under the MIT License.
The MIT License is simple and easy to understand and it places almost no restrictions on what you can do with Caterpillar.
You are free to use Caterpillar in commercial projects as long as any copyright headers and license file are left intact.
contenthash
- The hashed page content for checksumming.filesize
- The page filesize.last_update
- Timestamp used for deletion of removed pages after 2 weeks missing.last_tested
- The last time the page was crawled.Corey Ballou is a full-stack web applications developer in Charlotte, NC with 9+ years professional experience. He holds a bachelors degree in Computer Science and has been working remotely since 2012. He specializes in LAMP/LEMP stack development with Laravel and WordPress. Corey is the owner and principal consultant at Craft Blue, a custom web applications development consultancy. He's also the co-organizer of the Queen City PHP meetup group in Charlotte. He is an entrepreneur, blogger, open source contributor, beer lover, startup advocate, chicken wrangler, hydroponics gardening dabbler, and homebrewer.
Corey works with agencies, startups, and businesses.
Contact Corey to see how Craft Blue can help you.