A web crawler is a process which is performed by a search engine crawler while it is looking for significant websites paths or links on the index page of a website. This process is called Web crawling or spidering.
Web crawlers can be used to gather specific relevant of information from Web pages, such as harvesting e-mail addresses (usually for spam), or address that links to some specific website or app or your choice. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.
In the above tutorial demo, i created a simple web crawler that do the following;
The system is going to allow the user to insert URLs via a form. After submitting the URLs will be saved to MySQL table.
After the URLs is saved, the user can see it in a table view. The user can also delete URLs. Every URL will have a default status call “new”. Whenever the Crawling of URL is completed, the status will change to “done”.
During the crawling process, the status should be change to “crawling”. Inside the table, the status of each URL should be visible and the user should be able to filter for the status.
Each result of each crawling will be stored in database table “urls_metrics”. When it is not possible to fetch the metrics (e.g. if the URL is offline) the URL status is going to change to “crawling failed”. The Google Analytics result is going to change to n/a If the URL doesn’t have Google Analytics. The system will also allow the user to fetch all URLs with the status “new, crawling and done”.