URL Insert at Status HTML Title External Links Google Analytics? Click to crawl Url Actions
http://www.thelastcodebender.com/ 3 months ago done TheLastCodeBender 35 n/a Crawl Url Delete
http://www.wikipedia.org 3 months ago done Wikipedia 288 n/a Crawl Url Delete
http://Www.ccvitaal.nl 3 months ago done geregistreerd via Argeweb 1 n/a Crawl Url Delete

What is Web Crawler?

A web crawler is a process that is performed by a search engine crawler while it is looking for significant website paths or links on the index page of a website. This process is called Web crawling or spidering.

Web crawlers can be used to gather specific relevant of information from Web pages, such as harvesting e-mail addresses (usually for spam), or address that links to some specific website or app or your choice. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

In the above tutorial demo, i created a simple web crawler that do the following;

  • 1. Fetch the HTML title of the given URL.
  • 2. Fetch the number of external links of the URL.
  • 3. Check if the URL has Google Analytics included in it.

The system is going to allow the user to insert URLs via a form. After submitting the URLs will be saved to MySQL table.

After the URLs is saved, the user can see it in a table view. The user can also delete URLs. Every URL will have a default status call “new”. Whenever the Crawling of URL is completed, the status will change to “done”.

During the crawling process, the status should be changed to “crawling”. Inside the table, the status of each URL should be visible and the user should be able to filter for the status.

Each result of each crawling will be stored in the database table “urls_metrics”. When it is not possible to fetch the metrics (e.g. if the URL is offline) the URL status is going to change to “crawling failed”. The Google Analytics result is going to change to n/a If the URL doesn’t have Google Analytics. The system will also allow the user to fetch all URLs with the status “new, crawling and done”.

You can Download the full code from my github account. https://github.com/suleigolden/webcrawler

NOTE: The github version is develope using my custome PHP MVC Framework. Learn more about my custome PHP MVC Framework here

You can send me a mail if you need the Laravel Verion, i will be happy to send it to you for free. suleimamman@gmail.com