drupalfinder.com

Events happening in the community are now at Drupal community events on www.drupal.org.
mossy2100's picture

Hi all

This is a little project I've been working on for the past few days: http://drupalfinder.com

It's just a web crawler that looks for Drupal sites, although you can use it to search any sites in the database by search string or host.

So far it's found nearly 1000 Drupal sites. The web crawler is powered by a cron job so there will be thousands more automatically added over the coming days.

Please let me know if you have any ideas or feedback.

At the moment it's running on a shared host in Texas (hostgator) so it's a little slow - if anyone would like to host for me on a nice fast server, please let me know!

Cheers,
Mossy

Comments

Nice idea

gollyg's picture

You might want to check your algorithm - I don't think ABC.net.au is Drupal!

don't forget

mossy2100's picture

You need to check the "Drupal sites only" checkbox. I will try to make it clearer.

Very nice.

jdsaward's picture

Smooth. Very nice.

Any chance of a feed? Especially on "drupal sites hosted in Australia." ?

Was doing social distancing before it was cool.

good idea

mossy2100's picture

I will add it to the list :)

Thanks

jdsaward's picture

Is it appropriate for me to offer a bounty for that feature?

Was doing social distancing before it was cool.

sure

mossy2100's picture

of course!

Well then.

jdsaward's picture

When you get the time to progress an Aussie feed I will seed a bounty.

Might be good if you get a paypal donation widget happening somewhere? Else let me know your Paypal Email and I will donate that way. Else I will contact you to arrange another way.

Seems that to deliver all (identified) Aussie Drupal sites, feeds could be for:

a. Drupal Sites hosted in Australia.
b. Drupal site Host ends with .au or .au.com (other possibilities?).
c. a or b.

Was doing social distancing before it was cool.

possible solutions

mossy2100's picture

I was thinking about this a bit yesterday after you suggested it. I think the best approach could be to make the feed customisable, if such a thing is possible (I don't have much experience with RSS). So, someone could conduct a search, then select "Create a feed from this search", then they would get a URL like http://drupalfinder.com/rss.xml?host_pattern=ends_with&host=.au (for example). In my mind this would work, but I have to try it.

It's a good idea about the PayPal widget. I'm going to put some Google Ads on there, that should help to fund further development - that is, if people use the site! It remains to be seen if people will find value in it, although I think it's very useful.

The main problem at the moment is the speed of the crawler, it's very slow, but I know this is because it's on a shared host. Hence the low number of sites in the database (although it is climbing slowly).

done?

mossy2100's picture

Hey John

Have a look at http://drupalfinder.com, run a search, grab the Atom feed link and see if it's what you want.

Shaun

Well then.

jdsaward's picture

{apology for duplicate. removing}

Was doing social distancing before it was cool.

Nice!

rdeboer's picture

What algorithm do you use to determine a site is a drupal site?

regex

mossy2100's picture

There are lots of regular expressions used throughout. To determine if it's Drupal I use a regex from Wappalyzer, which basically looks for drupal.js or Drupal.settings. Have a read of the About page :)

Groetjes
Shaun

Hosting available

chidium's picture

Looks good Shaun... we can offer you hosting on one of our servers. Please contact me directly.

Another nice-to-have

benhelps's picture

Might be a tag cloud type categorising of scraped sites, just at time of scraping. Might make for some more interesting search options.

elaborate

mossy2100's picture

Hi Ben, can you please explain a bit more what you mean?

Just imagining

benhelps's picture

Take it you are already scraping with a script that handles new node inserts into drupal db. While you've retrieved the page content you could also parse it for important keywords (density, meta tags, whatever - like as for SEO I imagine) and also insert into drupal for a few matching tags (taxonomy in tags mode).

Then you could just use a tag cloud module to display it.

nice ...

Sree's picture

nice concept!

Sree

merging entries

richardhayward's picture

Could you combine/merge entries in the index so that there's only 1 match for www.site.com and site.com?

yes

mossy2100's picture

It's on the todo list but hasn't been a priority as yet.

Cheers,
Mossy

Nice concept !!! Absolutly

Marasco's picture

Nice concept !!! Absolutly agree!!!!

Thanks!

mossy2100's picture

Thanks for the positive feedback! I'm about to move it to a new VPS so it should run a bit faster and find sites a bit faster too.

Cheers,
Mossy

for your interest

mossy2100's picture

DrupalFinder has now found over 10,000 Drupal sites :)
http://drupalfinder.com/stats.php

Cool, but...

rdeboer's picture

That's cool mossy!

But when I tried to find mine I got:

Fatal error: Call to a member function fetchObject() on a non-object in /home/mossy/public_html/drupalfinder.com/inc/db.inc on line 276

really???

mossy2100's picture

oops :) I have never seen that bug! I will check it out.

Finding and feeding.

jdsaward's picture

And it is currently feeding into the top of http://drupal.com.au/

Thanks Shaun.

Smooth operation.

Was doing social distancing before it was cool.

not drupal?

sime's picture

Cool site mossy! You didn't use Drupal for this?

Oh just read the about page.

sime's picture

Oh just read the about page. Looks like the stats page is getting slow though, database query? That could use some caching probably ....

yep

mossy2100's picture

Yes now with over 3 million rows it needs a little TLC to tighten it up! The search page is slow because it's scanning the database to find all the countries and Drupal versions, which is easy to cache. The stats could updated once an hour.

I didn't do it as Drupal initially because it started out just simple, but I think I will convert it to a Drupal site so I can leverage a few features such as site stats and the contact form, etc.

updates

mossy2100's picture

Hey

I've added some caching so the Search and Stats pages load much faster now.

I've also addressed the issue that @richardhayward mentioned, so example.com and www.example.com are merged now. The spider will first test the site without the 'www', and if can't load it then it tries the host with the 'www'.

Unfortunately after making this change I had to re-examine about 100,000 sites, so there are a lot fewer Drupal sites appearing in the search results now. But it should catch up in a few days.

Cheers,
Mossy

What does your crawler

sime's picture

What does your crawler identify as? We have some probbles with crawlers because of some pages we can't cache, be good to know how to recognise yours.

Although, you probably don't

sime's picture

Although, you probably don't cause much of a problem because you don't index the whole site yeah?

crawler user agent

mossy2100's picture

The user agent was just 'spider' (because that was in the code I originally borrowed) but you've prompted me to change it, so it's now 'DrupalFinder'.

You're right, it doesn't index the whole site, just the home page. The crawler just requests the site without any path, i.e. http://example.com and never http://example.com/blah/blah.php. That's why it doesn't identify Drupal sites running in sub-directories. It's much simpler that way.

The only time it doesn't do that is when you submit a link. Then it will grab that link's page and scan it for links to other sites.

robots.txt

skwashd's picture

@mossy2100 Do you honour the robots.txt and the robot meta tags in the page?

@sime Do you have a robots.txt to exclude the pages that cause you issues?

no

mossy2100's picture

I don't look at those things at the moment.

@dave yes but i haven't

sime's picture

@dave yes but i haven't looked that closer to see if certain crawlers are honouring because we have enough problems on pages we want indexed. It's not the crawler's fault.

take drupalfinder.com

mossy2100's picture

Hi

Would anyone like to take this little project over? I think it's quite useful, but I don't have time to work on it and I think it would be a shame to waste it.

It's running quite slow on a shared host atm, with the crawler only running for 5 minutes out of every 15 due to server constraints. However, RedyHost have been kind enough to provide a new hosting environment - the site just needs to be moved over, and there needs to be a RedyHost logo/ad/link displayed somewhere.

You would need PHP/MySQL skillz, and the site would benefit from being converted to Drupal. Some database tuning would be a good idea as well.

All I ask in return is a gift such as an itunes voucher or a nice steak, and a logo/ad/link from the site back to IWDA. You get the code, database, domain name and my list of ideas for future dev.

selling it now

mossy2100's picture

Hi

I've had a few responses already, and I've decided to sell drupalfinder.com rather than give it away. Sorry if that's annoying! But I put a fair bit of work into it, the code is nice and clean, it has value to the Drupal community, and you can easily make the money back on Google ads. The price is $900 + GST, which is significantly less than the cost of the time I put into it.

Just a word of warning - make sure you have at least a little bit of time for it! The mistake I made was thinking that it would be a one-off simple programming exercise, then ended up working on it for a few days straight, then every weekend for a month... It's just one of those projects! But if you genuinely want to develop it and make it a valuable part of the Drupal ecosystem, then it will be worth your while.

Again apologies for adding a price. If I was rich I wouldn't have. But it does separate the people who are acting impulsively from those who are serious about it. If the price seems too high, feel free to make an offer.

Cheers,
Mossy

Supporting this in a small way.

jdsaward's picture

Just adding that the app has received a small amount of financial support from myself, and there is some possible potential for that to occur again.

The feed from drupalfinder has been featured prominently at the top of drupal.com.au for awhile.

I periodically swap the top content block between this feed and the feed from Drupal Downunder.

I observe that whichever feed is at the top of the drupal.com.au home page does currently define the description text that google picks out for the drupal.com.au listing. That listing has been the first or near first on a search for 'drupal' for quite some years. 'Drupal Australia' as a keyword term is currently being optimised for and is currently also at or near the top of google page one.

The point being - that all going well - whoever takes over Drupal Finder also as a bonus ~~possibly also~~ gets excellent exposure through to themselves via drupal.com.au.

Naturally over the next month ~increasingly~ Drupal Downunder will get the top spot on the home page.

Ta.

Was doing social distancing before it was cool.

When the Drupal Downunder feed is top of home page.

jdsaward's picture

Google text at, or near, top of its listing for 'Drupal':

drupal.com.au | Promoting Drupal in Australia
www.drupal.com.a The feed is provided into drupal.com.au via Drupal Downunder. Drupal Downunder is an Australian-run, non-profit conference for the people who write, use and ...

Was doing social distancing before it was cool.

jdsaward's picture

Google constructs its description by selecting text from the blurb about Drupal Downunder, next to the feed, which currently reads:

The feed is provided into drupal.com.au via Drupal Downunder

Drupal Downunder is an Australian-run, non-profit conference for the people who write, use and support the Drupal Content Management System.

Drupal Downunder will feature sessions and panels from some of the most influential people and brightest minds within the local Drupal community.

"Bringing web professionals and enthusiasts together to connect, learn and create."

13-15 January 2012 at the Jasper Hotel, Melbourne.

Was doing social distancing before it was cool.

When the Drupal Finder feed is at the top of the home page

jdsaward's picture

Here is the exposure the app owner is receiving:

drupal.com.au | Promoting Drupal in Australia
www.drupal.com.au 2 days ago – The feed is provided into drupal.com.au courtesy of the developer of ... Web Development Academy, which is a provider of Drupal Training. ...

There is no promise that level, or any level, of exposure continues. Ta.

Was doing social distancing before it was cool.

How Google constructs description text for Drupal Finder

jdsaward's picture

Google constructs its description by selecting text from the blurb about Drupal Finder and Web Development Academy, next to the feed, which currently reads:

The feed is provided into drupal.com.au courtesy of the developer of DrupalFinder, Shaun Moss.

Shaun also runs International Web Development Academy, which is a provider of Drupal Training.

"We provide expert web development training to businesses, universities and government departments all over Australia."

Was doing social distancing before it was cool.

Australia

Group categories

Location

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week