Posted by mossy2100 on November 13, 2011 at 12:06am
Hi all
This is a little project I've been working on for the past few days: http://drupalfinder.com
It's just a web crawler that looks for Drupal sites, although you can use it to search any sites in the database by search string or host.
So far it's found nearly 1000 Drupal sites. The web crawler is powered by a cron job so there will be thousands more automatically added over the coming days.
Please let me know if you have any ideas or feedback.
At the moment it's running on a shared host in Texas (hostgator) so it's a little slow - if anyone would like to host for me on a nice fast server, please let me know!
Cheers,
Mossy

Comments
Nice idea
You might want to check your algorithm - I don't think ABC.net.au is Drupal!
don't forget
You need to check the "Drupal sites only" checkbox. I will try to make it clearer.
Very nice.
Smooth. Very nice.
Any chance of a feed? Especially on "drupal sites hosted in Australia." ?
Was doing social distancing before it was cool.
good idea
I will add it to the list :)
Thanks
Is it appropriate for me to offer a bounty for that feature?
Was doing social distancing before it was cool.
sure
of course!
Well then.
When you get the time to progress an Aussie feed I will seed a bounty.
Might be good if you get a paypal donation widget happening somewhere? Else let me know your Paypal Email and I will donate that way. Else I will contact you to arrange another way.
Seems that to deliver all (identified) Aussie Drupal sites, feeds could be for:
a. Drupal Sites hosted in Australia.
b. Drupal site Host ends with .au or .au.com (other possibilities?).
c. a or b.
Was doing social distancing before it was cool.
possible solutions
I was thinking about this a bit yesterday after you suggested it. I think the best approach could be to make the feed customisable, if such a thing is possible (I don't have much experience with RSS). So, someone could conduct a search, then select "Create a feed from this search", then they would get a URL like http://drupalfinder.com/rss.xml?host_pattern=ends_with&host=.au (for example). In my mind this would work, but I have to try it.
It's a good idea about the PayPal widget. I'm going to put some Google Ads on there, that should help to fund further development - that is, if people use the site! It remains to be seen if people will find value in it, although I think it's very useful.
The main problem at the moment is the speed of the crawler, it's very slow, but I know this is because it's on a shared host. Hence the low number of sites in the database (although it is climbing slowly).
done?
Hey John
Have a look at http://drupalfinder.com, run a search, grab the Atom feed link and see if it's what you want.
Shaun
Well then.
{apology for duplicate. removing}
Was doing social distancing before it was cool.
Nice!
What algorithm do you use to determine a site is a drupal site?
regex
There are lots of regular expressions used throughout. To determine if it's Drupal I use a regex from Wappalyzer, which basically looks for drupal.js or Drupal.settings. Have a read of the About page :)
Groetjes
Shaun
Hosting available
Looks good Shaun... we can offer you hosting on one of our servers. Please contact me directly.
--
Chidium
www.openquarter.com
Another nice-to-have
Might be a tag cloud type categorising of scraped sites, just at time of scraping. Might make for some more interesting search options.
elaborate
Hi Ben, can you please explain a bit more what you mean?
Just imagining
Take it you are already scraping with a script that handles new node inserts into drupal db. While you've retrieved the page content you could also parse it for important keywords (density, meta tags, whatever - like as for SEO I imagine) and also insert into drupal for a few matching tags (taxonomy in tags mode).
Then you could just use a tag cloud module to display it.
nice ...
nice concept!
Sree
merging entries
Could you combine/merge entries in the index so that there's only 1 match for www.site.com and site.com?
http://www.slv.vic.gov.au/
yes
It's on the todo list but hasn't been a priority as yet.
Cheers,
Mossy
Nice concept !!! Absolutly
Nice concept !!! Absolutly agree!!!!
Thanks!
Thanks for the positive feedback! I'm about to move it to a new VPS so it should run a bit faster and find sites a bit faster too.
Cheers,
Mossy
for your interest
DrupalFinder has now found over 10,000 Drupal sites :)
http://drupalfinder.com/stats.php
Cool, but...
That's cool mossy!
But when I tried to find mine I got:
Fatal error: Call to a member function fetchObject() on a non-object in /home/mossy/public_html/drupalfinder.com/inc/db.inc on line 276
really???
oops :) I have never seen that bug! I will check it out.
Finding and feeding.
And it is currently feeding into the top of http://drupal.com.au/
Thanks Shaun.
Smooth operation.
Was doing social distancing before it was cool.
not drupal?
Cool site mossy! You didn't use Drupal for this?
Oh just read the about page.
Oh just read the about page. Looks like the stats page is getting slow though, database query? That could use some caching probably ....
yep
Yes now with over 3 million rows it needs a little TLC to tighten it up! The search page is slow because it's scanning the database to find all the countries and Drupal versions, which is easy to cache. The stats could updated once an hour.
I didn't do it as Drupal initially because it started out just simple, but I think I will convert it to a Drupal site so I can leverage a few features such as site stats and the contact form, etc.
updates
Hey
I've added some caching so the Search and Stats pages load much faster now.
I've also addressed the issue that @richardhayward mentioned, so example.com and www.example.com are merged now. The spider will first test the site without the 'www', and if can't load it then it tries the host with the 'www'.
Unfortunately after making this change I had to re-examine about 100,000 sites, so there are a lot fewer Drupal sites appearing in the search results now. But it should catch up in a few days.
Cheers,
Mossy
What does your crawler
What does your crawler identify as? We have some probbles with crawlers because of some pages we can't cache, be good to know how to recognise yours.
Although, you probably don't
Although, you probably don't cause much of a problem because you don't index the whole site yeah?
crawler user agent
The user agent was just 'spider' (because that was in the code I originally borrowed) but you've prompted me to change it, so it's now 'DrupalFinder'.
You're right, it doesn't index the whole site, just the home page. The crawler just requests the site without any path, i.e. http://example.com and never http://example.com/blah/blah.php. That's why it doesn't identify Drupal sites running in sub-directories. It's much simpler that way.
The only time it doesn't do that is when you submit a link. Then it will grab that link's page and scan it for links to other sites.
robots.txt
@mossy2100 Do you honour the robots.txt and the robot meta tags in the page?
@sime Do you have a robots.txt to exclude the pages that cause you issues?
no
I don't look at those things at the moment.
@dave yes but i haven't
@dave yes but i haven't looked that closer to see if certain crawlers are honouring because we have enough problems on pages we want indexed. It's not the crawler's fault.
take drupalfinder.com
Hi
Would anyone like to take this little project over? I think it's quite useful, but I don't have time to work on it and I think it would be a shame to waste it.
It's running quite slow on a shared host atm, with the crawler only running for 5 minutes out of every 15 due to server constraints. However, RedyHost have been kind enough to provide a new hosting environment - the site just needs to be moved over, and there needs to be a RedyHost logo/ad/link displayed somewhere.
You would need PHP/MySQL skillz, and the site would benefit from being converted to Drupal. Some database tuning would be a good idea as well.
All I ask in return is a gift such as an itunes voucher or a nice steak, and a logo/ad/link from the site back to IWDA. You get the code, database, domain name and my list of ideas for future dev.
selling it now
Hi
I've had a few responses already, and I've decided to sell drupalfinder.com rather than give it away. Sorry if that's annoying! But I put a fair bit of work into it, the code is nice and clean, it has value to the Drupal community, and you can easily make the money back on Google ads. The price is $900 + GST, which is significantly less than the cost of the time I put into it.
Just a word of warning - make sure you have at least a little bit of time for it! The mistake I made was thinking that it would be a one-off simple programming exercise, then ended up working on it for a few days straight, then every weekend for a month... It's just one of those projects! But if you genuinely want to develop it and make it a valuable part of the Drupal ecosystem, then it will be worth your while.
Again apologies for adding a price. If I was rich I wouldn't have. But it does separate the people who are acting impulsively from those who are serious about it. If the price seems too high, feel free to make an offer.
Cheers,
Mossy
Supporting this in a small way.
Just adding that the app has received a small amount of financial support from myself, and there is some possible potential for that to occur again.
The feed from drupalfinder has been featured prominently at the top of drupal.com.au for awhile.
I periodically swap the top content block between this feed and the feed from Drupal Downunder.
I observe that whichever feed is at the top of the drupal.com.au home page does currently define the description text that google picks out for the drupal.com.au listing. That listing has been the first or near first on a search for 'drupal' for quite some years. 'Drupal Australia' as a keyword term is currently being optimised for and is currently also at or near the top of google page one.
The point being - that all going well - whoever takes over Drupal Finder also as a bonus ~~possibly also~~ gets excellent exposure through to themselves via drupal.com.au.
Naturally over the next month ~increasingly~ Drupal Downunder will get the top spot on the home page.
Ta.
Was doing social distancing before it was cool.
When the Drupal Downunder feed is top of home page.
Google text at, or near, top of its listing for 'Drupal':
Was doing social distancing before it was cool.
How Google constructs description text for drupal downunder.
Google constructs its description by selecting text from the blurb about Drupal Downunder, next to the feed, which currently reads:
Was doing social distancing before it was cool.
When the Drupal Finder feed is at the top of the home page
Here is the exposure the app owner is receiving:
There is no promise that level, or any level, of exposure continues. Ta.
Was doing social distancing before it was cool.
How Google constructs description text for Drupal Finder
Google constructs its description by selecting text from the blurb about Drupal Finder and Web Development Academy, next to the feed, which currently reads:
Was doing social distancing before it was cool.