Search This Blog

Wednesday, July 4, 2018

Implementing a Multithreaded Spider

When developing my website view generator, I wanted to enhance the application by adding a spider that crawls the given site; with this method, the user will get better results when checking their website's views.  However, I ran into a quick problem.  The classic web spider that ran with a single thread via iteration or recursion is simply to slow (especially recursion).  Hence, I decided to implement my own multithreaded spider (you can check the code out on my WVG repo on Github - just download LinkQueue.java, LinkGetter.java, and Spider.java but as of now, the code is messy and needs to be improved).  In this post, I will be sharing with you what I did to implement this extremely fast type of web spider.
Before I start, I decided to make the spider's depth be dependent on the threads and make the user provide a link limit (the link limit will not be used exactly but will prevent the multithreaded spider from running on too many links).  Also, before any threads start, there will be the first thread which only spiders the given link.  However, all the threads should start simultaneously (in a for loop with a thread array) and a thread should only end if the thread before it in the thread array has ended - all the threads will process the links in their given queue (I implemented the queues with linked lists), which are constantly updated with links if the previous threads are alive - the previous threads will be adding the links they find to the following thread's queue.  With this method, you can have many threads run simultaneously, constantly checking their specific queues for new links to spider.
In my program, I added new links to an ArrayList and the next thread's queue (the next thread's spider will then spider those links) and parsed the links retrieved from the webpages' html via regex ("<a\\s+(?:[^>]*?\\s+)?href+=[\"']([^\"']*)['\"]" with case ignored); although regex is not the best method, it does take up less code and space while also working for most cases.
This explanation may seem very complicated, so here is a general diagram.
This diagram will suffice for all threads except for the first one (the only difference for the first one is that it spiders only one link and has no queue of links).

I hope this explanation will help you understand how to implement such a tool now.  Also, Happy 4th of July!

No comments:

Post a Comment