Fork me on GitHub

GSOC : Search Module: how to evade the 30 second time limit?  Bottom

  • Hi all,

    For those who don't know im currently doing the Google summer of code project on improving the search capabilities of zikula. See here for a link of the project as a whole: http://community.zikula.org/downloads/SOC2008/search.pdf.

    I'm currently working on the crawler, ie a class which simulates a web-user and goes around the website picking up links to later index. It is essentially done. The crawler works by sending http requests to the server repeatedly, then anaylzing the returned HTML. So for instance it will start with zikula.org, by sending the http request http://zikula.org/ and will then scan the resulting HTML and pick out more links from it, and iterativley send http requests to these found links.

    After anaylzing the performance, picking out links from a page takes 3/100th of a second. Initialising the connection itself takes even less time. Not too much of a problem.
    However, each http request takes about 0.5 seconds, even though its not really going anywhere. This is not too surprising, but 0.5 seconds is quite crippling. As far as i can tell, theres no way to lower this, as its just the time it takes to run whatever script is called, and is out of my control.
    So with PHP scripts having a limit of 30 seconds, and not all servers being unix with cron jobs available, what is the best way to solve this?

    My idea is a bit scrappy, not really sure if it will work but essentially it is as follows. Your opinions on its viability and whether it would work in all environments would be appretiated:

    - have a block that uses Ajax to run the crawler periodically (Mark West's schedular module?)
    - Use a timer to make the script run for 20 seconds, when the timer runs out, save the state to the database then call itself again before 30 seconds elapses.
    (i guess with an http request)
    - continually do this until the state reaches completition.

    There are a lot of overheads in there, including saving and then calling the state. I would be worried about the server maybe slowing down a little bit with this script basically constnatly running (each page requires 0.5 seconds.. so a crawl of a resonable sized website would take several minutes at least, larger websites could take hours).

    I would greatly appretiate feedback in the following areas:

    - would this work? Any advide on implementing it well

    - Any other totally different ideas to solve the problem.

    Thanks for your time!
  • You might like to have a look into the search index rebuild script we did for our forum module PNphpBB2, the code is here. Simply spoken, after 100 http-requests we save the result to the database, do a pause and then continue. That's a fine way to have it running even on high traffic sites, it does not put too much load onto the server.

    Greetings,
    Chris

    --
    an operating system must operate
    development is life
    my repo
  • Thank you Chris! :)
  • Take the following for what it's worth. I take a pretty different approach to site development and management than most. But, for what it's worth, here is an opinion from someone whose philosophy is that laziness, not necessity, is the mother of all invention:

    I personally think the best way to get around the 30-second time limit is, first and foremost, to make the best possible use of that 30 seconds. I think maybe a way to start would be to create a hook that queues modified and newly created pages in hook-aware content modules to be crawled and only crawl the queued pages within hook-aware modules, assuming that the rest are exactly the same as the last time you looked at them. That would eliminate a fair number of unproductive requests. This assumes, of course, that you offer the administrator the ability to do a full hand-initiated crawl of his/her site after your module is installed. Since the hand-initiated crawl of the entire site will presumably only be necessary once (or at least very, very infrequently), you don't have to give too much consideration to things like server load. An admin worth his weight in dog dirt should have the sense to perform operations such as this during off-peak hours.

    Obviously all modules aren't hook-aware, so you'll have to crawl those on some regular schedule. But you can probably even make the most of this by creating a block with no visible output that queues those pages in hook-unaware modules that are actually visited. How many people create a new content item or edit an existing one and then don't look at it? So give priority to those pages that have actually been visited since your last crawl, assuming that they are the most likely to have changed.

    Then let's assume that you determine you can safely send 100 http requests and have time to record the results without timing out. If you know that since your last crawl you have:

    14 new or modified pages from hook-aware modules
    28 visited pages from hook-unaware modules

    That's 42 priority requests, leaving you 58 to allocate in whatever way makes the most sense to you. In this way, you can slow your crawler way down. I might even get a little funky with it and use the block to determine when to actually initiate a crawl. A site that gets 100 visits a day probably doesn't need to be crawled continuously. I might go with the "unlucky visitor" model, wherein the invisible block, while it's sitting there doing very little, checks the size of the queue. When the queue reaches, say 42 priority requests, it initiates a crawl by sending an http request back to my module via a 1px iframe, which shouldn't significantly hurt the page load time for the visitor. If it takes, on average, 1000 unique page visits to a site to get the queue to that level, what are the odds that the same user will experience the lag caused by being the unlucky visitor to initiate the crawl twice? What I like about this approach is that the crawl rate is determined by how often a crawl is actually required. You can even put in a nifty javascript slider in the admin menu to allow the admin to adjust the crawl rate if he/she feels like the crawler is operating too much or too little. The default queue trigger level at the middle of the slide and a sensible min and max are at either end.

    I used a scheme similar to this on an e-commerce site to index added / removed / modified products. I found that the magic number in my case was something closer to 15,000 page loads between triggers, meaning that for someone to be the "unlucky visitor" twice within any reasonably memorable period of time would make them the sort of person you don't want to stand next to during a meteor shower. And hey, even the biggest and best sites bog down every once in a while. One slow page load will always be forgiven if your site has anything worth looking at (and all of our sites are worth looking at, right?).

    --
    Help Now! Fast and affordable help for do-it-yourself webmasters from Wicked Viral :: Chicago's Only Web Development Firm Specializing in Social Networking Integration
  • Blocks? The concept sounds well enough. But from a support standpoint, it will be nothing but problems. Having users maintain a 'do nothing' block and making sure they have it available for use, or being put in an unused block zone, is very potentially problematic. This type of system would be very confusing to new users who are already overwhelmed with a new system. I don't think search should be something that should be maintained at 'presentation' level.

    Maybe as a search enhancement, but not a core functionality.

    Just my concerns. icon_wink




    --
    David Pahl
    Zikula Support Team
  • Yeah, maybe a block isn't the right way to go. I just used that because it had already been mentioned as an idea. I'm still in the .764 mindset when it comes to blocks, but now it's pretty easy to have a different set of blocks for any number of pages. So what about the same basic functionality in a theme plugin. Slap it just before the close of the body tag in all user templates (just master.htm and home.htm for most people). Of course, if everyone designed themes like I do, they could just put it in common/footer.htm, since I strip out any elements that are certain to be common to every template and include them instead of repeating 85% of the HTML in a bunch of templates.

    My thinking is that there has to be a way to intiate a crawl automatically without the use of cron jobs (since we've accepted that there are places that people host that don't offer them). And the best way I can think to do that without having a process either constantly running is to have it triggered by an event. The most obvious event that is universal to all sites is a user loading a page. I know some of the same concerns still apply. But it is sort of a set-it-and-forget-it setup.

    My next thought hit me over a beer just before bed. Why use HTTP requests at all to follow links you find? Sure, that's the traditional way to crawl a site, but we're talking about a Zikula site here. If you're writing for the system, take advantage of the system's API. It's not difficult to determine the module, function type, function, and paramters from the URL of the link. Wouldn't simply using pnModFunc to output the content of the page into a variable and reading it for links be a far more efficient way to go? I'm envisioning something like this...

    URL Found
    Parse the URL for module, type, func, and params
    Run pnModFunc to get content.
    - On error, assume it's not Zikula content and that's why it failed, so send an HTTP Request to get the content.
    Now you have the content. Grab all links and start over.

    I'm pretty sure that you can do that dozens of times in the .5 seconds it takes to send a single HTTP request.



    --
    Help Now! Fast and affordable help for do-it-yourself webmasters from Wicked Viral :: Chicago's Only Web Development Firm Specializing in Social Networking Integration
  • dreamingmonkey

    Of course, if everyone designed themes like I do, they could just put it in common/footer.htm, since I strip out any elements that are certain to be common to every template and include them instead of repeating 85% of the HTML in a bunch of templates.

    We exist icon_biggrin I even released a stupid holiday one as an example. icon_biggrin BlankTheme does this too.. but not to where I think it should be.. but it does more than the traditional home/master/module setup. Code once, reuse!

    I do think using the API is a better solution than a template level component. But this starts to move outside what I a qualified to make comments on. icon_biggrin But I am interested to see how this goes.

    --
    David Pahl
    Zikula Support Team

This list is based on users active over the last 60 minutes.