First release of http://www.drupalcalifornia.com

Posted by coderintherye on December 15, 2010 at 7:36pm

I am pleased to announce the launch of http://www.drupalcalifornia.com (running on Drupal 7)
The site is designed to be a showcase for Drupal sites on California University campuses (especially CSU & UC systems).

I hope you will understand that the site is a work in progress, and it needs your help! You can help by adding your site to the Participant List wiki page or by sending me a message.

Everyone can participate in improving the site, from a simple as helping draft up an improved mission statement to as complex as writing custom modules or creating a custom site theme. If you are a member of this group, just contact me and I will give you an a content administrator role on the site. If you are not a member, and are using Drupal at a California university (faculty, staff, and students only, while we respect the work of consultants we also see a need to keep this internal).

On a side note, Stanford is currently over represented because they had a large list of Drupal sites, so to other schools, definitely submit a list of all your sites on the Participant List so you can get more representation.

Some questions I compiled while putting this together:

Site layout design, love it/hate it? Comment here (Using Views with views random seed and scraped data)
Should the names of the developers/authors behind the site be included?
Any limit on # of sites to showcase per University?
Should the main show case only show one site per university and then use taxonomy/tags to drill down to further sites from them?
How do people feel about automated scans? I'm thinking I could probably write a scraper that will cross-reference all listed california universitydomain names along with certain parameters to try to tease out all Drupal sites available, then automatically grab screenshots of them. It's all publicly available info so I imagine this shouldn't be a problem, but if anyone has objections to that let me know? As it is, I used the existing site lists that have been posted in order to scrape (using a custom module) what you see here.
If you would like your site showcased, please add it to the wiki list: http://groups.drupal.org/node/64018
Alternatively, if you have a lot of sites, if you want to send them to me in a csv or other format, you can do so by messaging me via the contact form here on groups.drupal.org
We are planning on a possible system-wide UC/CSU Drupal Camp. If you have an interest in this, please send me a message

Thanks everyone, I look forward to working with you on promoting Drupal. Cheers.

Comments

So some quick thoughts.

Posted by redndahead on December 15, 2010 at 8:10pm

So some quick thoughts. Thanks for putting this together hopefully it will be a great tool for evaluators.

Site design wise looks fine. Too many sites for me for the home page.
No names of devs or authors. Will change and can't keep up.
No limit, but they should be sectioned off by university.
No automated scans. Some people may not be ready for their site to be listed.
Add the ability to do a site showcase where universities can writeup their experience building one of their particular sites.
If this is for CIO's then a section at some point in time showing why drupal over others.

Congrats.

Adam

Thanks Adam

Posted by coderintherye on December 15, 2010 at 8:16pm

That is really useful feedback. Excellent suggestion on the ability to do a site showcase with a writeup. I think I'll create a webform for it and just manually approve submissions.

The audience is two-fold, the primary audience is CIOs and other executives to see what Drupal is capable of and the secondary audience is the Drupal developers themselves who want some recognition for their hard work. So also a good suggestion of doing a writeup on why Drupal. I think I can cobble something together from past presentations and available info on d.o. but then I will probably create a wiki page here for the writeup so that others can make edits if desired.

Thanks :)

Drupal evangelist.
www.CoderintheRye.com

I'm pro-scanning

Posted by bwood on December 16, 2010 at 3:00am

rednahead> # No automated scans. Some people may not be ready for their site to be listed.

If your site is live, why wouldn't you want it listed? It wouldn't necessarily be "featured" on drupalcalifornia and I doubt much traffic would be driven to the site given how many sites will be there.

I'm pro-scanning because we can't depend on people to list their sites and listing all known sites make this resource much more valuable. See my comments below.

Okay, so lets hear more of the argument against scanning.

Cross posting my response on the bdug list:

Great work Kevin! This is an excellent first step!

We are planning on a possible system-wide UC/CSU Drupal Camp. If you
have an interest in this, please send me a message
I'm interested.

Thanks for the mention on the About page.

Site layout design, love it/hate it? Comment here (Using Views with
views random seed and scraped data)

Like it.

Should the names of the developers/authors behind the site be
included?

Useful. Quickly see who did that site with feature X. What other sites did they do?

Any limit on # of sites to showcase per University?

Infinite!

Should the main show case only show one site per university and then
use taxonomy/tags to drill down to further sites from them?

I like the drill down feature. I think the default of showing all sites is okay. How about just exposing filters?

How do people feel about automated scans? I'm thinking I could
probably write a scraper that will cross-reference all listed
california universitydomain names along with certain parameters to try
to tease out all Drupal sites available, then automatically grab
screenshots of them.

Scanner should follow best/politest practices. Should have a user-agent set that references drupalcalifornia.com. If it just hits the frontpage and then bails if it's not a drupal site that's best.

On the main page it would be cool to see "10,000 drupal sites discovered in the the CSU/UC system (plus Stanford.edu) as of 12/15/10. Elsewhere a graph of # live drupal urls by date would be sweet.

It would be pretty interesting to know with what company the sites are hosted if they are not on university servers. (I'm not sure that UCB has a grasp on our externally-hosted drupal sites.) This would allow viewers of the site to see the popularity of different hosting companies. This is probably a lot more work though...

What method would you use to fingerprint drupal sites?

Brian

I can see what your saying,

Posted by redndahead on December 16, 2010 at 7:30am

I can see what your saying, but just because you can get to the site via a public url doesn't mean you want it listed on a website that showcases drupal sites. An example would be a temporary site you set up for a chancellor search. Or something that is meant for internal people, but you don't mind the chance of google crawling it. I think that if given the opportunity people would like to add their sites.

That's the thought process I was going through.

Yeah, and I do completely

Posted by coderintherye on December 16, 2010 at 7:38am

Yeah, and I do completely agree with that process, because I have gone through it myself many times, so I'm sure someone that just throws a site out there as a lesson of working with Drupal probably doesn't want it showcased but doesn't even realize that they have made it public.

The question I guess for me is, could it really do harm though (asides to our egos I mean, my ego has certainly been hurt in cases where a manager sends out a link a dev site that I didn't even mean to release yet and get some criticisms on missing pieces and such). And then the question is, if it can do harm, will it outweigh the potential benefits? Hard to say, cause not sure how to easily measure the benefits, but I can say it has been pretty neat discovering a lot of what is out there.

Drupal evangelist.
www.CoderintheRye.com

HTTP restrictions

Posted by bwood on December 16, 2010 at 6:19pm

Good discussion. Continuing in the spirit of debate:

All of my -dev and -qa sites only accept http requests from internal ucb subnets and our vpn subnets. This is very easily done using either 1) firewall restrictions or 2) apache "deny from all; accept from $subnet" in .htaccess, or elsewhere. Proof of concept sites use the same methods.

If there is some reason that a developer doesn't want to restrict access to their poc site, they really should at least throw a robots.txt in there. I believe that if google finds the same content and multiple urls it can hurt your rankings. Besides, it's like littering the web--it's not being a good citizen.

I would strongly discourage my peers at UCB from making poc/dev sites available publicly. If they are throwing a site up on Dreamhost or somewhere, are we sure that the file system permissions are correct on settings.php? Are we sure they didn't leave a database.sql where a crawler or browser can find it? Are we sure they are going to remember to take the site down at some point before some insecure module is discovered and suddenly staff-dev.berekeley.edu is selling Cialis and Viagra?

robots.txt

Posted by bwood on December 16, 2010 at 3:05am

The scanner would obey robots.txt, so site authors can opt out.

If their site is live with a permissive robots.txt, it's going to be indexed by google, bing, yahoo so why not drupalcalifornia?

Thanks for the response

Posted by coderintherye on December 16, 2010 at 4:57am

Excellent comments as always Brian. You do make some good points regarding why scraping is good. So that is 1 for and 1 against, and we'll continue to gather data points. One point I have in favor of it is that we can keep the data up-to-date automatically this way, which is always a tough thing to do, but the reality is that some sites will die off or move to another platform. From the oh 100+ sites that were submitted to the wiki list and scraped to list here, already 5 of them (give or take) were out of commission. I also think, being able to say x # of sites use Drupal on California universities would be really good way to get CIOs on board, establishing that Drupal is a well used and deployed product. That said, this is very much a community oriented effort, so whatever the community finds most useful will be what ends up getting done. Either way, no automatic scraping will be done for now until there has been plenty of time for discussion around it.

As far as how we would go about fingerprinting sites, we can rely on a couple of techniques. I'll discuss them here because webchick has talked about how to fingerprint Drupal sites in the past, and that gives me confidence that it is valid to discuss it openly. Now, I already knew some, but thanks to adulmec for drupaldetect we have a list that we can reference easily:

Check for Drupal specific JavaScript inclusion
Checking the inclusion of /misc/drupal.js or /misc/drupal?x.js
Check for 'Expires' value in page header
Check for Drupal specific textfiles at page root
Check for Drupal specific paths

Ill add some that weren't mentioned in that list which was we can also look for cached css/js files (they follow a css_[a-z0-9].css pattern). We can also look in the CDATA for Drupal.settings and finally with a more recent version of CTools we have a CToolsAjax string to look for in cached pages.

Then on the other side, we have to come with a list of California universities. This is a little trickier, but there are some datasets out there already listing most of the top level domains. So we can start there and work our way around.

In a pinch, we may even be able to rely completely on Google search results (though we'll fingerprint less sites this way, and it's a bit debatable if the way I see Drupal sites through Google will continue to remain viable/available), but that has the benefit of not having to hit web servers directly.

Using these methods above and a little more finangling on my part, I'm confident we can have a scraper which will have an extremely low false positive rate, but will probably still miss some sites, but any missed sites that have someone who cares to show them off can add them to the Participant List in the wiki =) And of course, if we were to ever scrape, there could be a flag link for each site in the showcase allowing someone to flag it for review/removal.

So, again thanks everybody for all the good feedback, I hope that when I get some time this weekend and over the break that I can iterate out some more features. In the meantime, feel free to keep adding sites to the participant list and I will check back in on it.

Drupal evangelist.
www.CoderintheRye.com

Another vote against

Posted by heatherwoz on December 16, 2010 at 6:08pm

Another vote against scraping. I think that quality is more important than quantity in this case. Especially if this site is going to be used to make a case for Drupal with CIOs and administrators. I think there should be some measure of control over what sites are listed. As already mentioned, you don't want dev sites or short-term sites. And what if there is a setup where every student group has their own multisite, or every class has a site, I don't think you really want all those individual sites listed, and you can't necessarily count on the site admins to setup a robots.txt to exclude them. What if an install gets hacked, and is serving up bad sites from an edu domain, you don't want that either.

I don’t think you should list the names of developers on the site, because that information changes so often, and can be hard to track down, and sometimes with development teams it can be difficult to decide who should and shouldn't be listed.

It would be helpful to see a list of modules used, and where the site is hosted, although there could be some security concerns with releasing that information.

Thanks for putting this together, can't wait to see it grow.

Heather, good points

Posted by coderintherye on December 16, 2010 at 6:21pm

Heather, good points regarding multisite setups and the desire to not want to list out all of them. I do agree with regards to developers that probably not best to try for me to manually put together a list of who maintains what site. I do think though we could have something beneficial in the way of self-reporting what sites are yours (and then I could verify). Then perhaps have a rule that sends out an e-mail every 6 months asking for the developer to confirm they still work on that site, if so they reply, mailhandler pulls in the reply, and keeps the field fresh, and if there is no reply within a set time limit (say 1 month), then the rule would automatically update that node with a blank field for the author link. Just an idea.

As to "It would be helpful to see a list of modules used, and where the site is hosted, although there could be some security concerns with releasing that information." Did you mean for all the sites or just for http://www.drupalcalifornia.com?

For DrupalCalifornia it is currently hosted on NearlyFreeSpeech, but thankfully SF State has agreed to host it =), so it will be moving over to our Drupal cluster sometime later this month.

As to the modules, its running:

admin_menu
ctools
imce
imce_wysiwyg
pathauto
token
views
views_random_seed
wysiwyg
A custom module I wrote to do the imports

In addition, I run a virtual machine with a custom shell script to get the site info/screenshots. I'll be putting up a blog post soon o how this is done.

And thanks for the comments, this effort will continue to grow through the good feedback by people like you =)

Drupal evangelist.
www.CoderintheRye.com

With the email rule to keep

Posted by heatherwoz on December 16, 2010 at 10:48pm

With the email rule to keep it up-to-date the developer listing might work. You would definitely need a measure like that to be sure it doesn't become bad information. On the module listing, I meant that each site could list what modules it uses. In case you are looking at a site that does something cool, and you think, "Wonder how they did that?" you can look at the list and start to explore the modules. Also it's good to know what the most popular modules are for our audiences.

On that note, I just had an

Posted by coderintherye on December 17, 2010 at 12:20am

On that note, I just had an idea, but a longshot one, so just throwing it out there. Since there is some move towards wanting opt-in, another way to approach the up to date information problem could be that I write a 'university site status' module.

Currently, with the core 'update status' module, Drupal phones home and reports on what modules are being used. You can see the statistics that drupal.org collects from what is used by going to module's project pages. Well, if you wanted to opt-in you could install a 'university site status' module and that would phone home to the drupalcalifornia server, to list stats. That same module could also potentially have an administrative form which allowed for entering in extra information (name of developer, site's story, etc). Just a thought.

Drupal evangelist.
www.CoderintheRye.com

Would people be motivated to install it? Crawling compromise

Posted by bwood on December 17, 2010 at 7:06pm

That would be a fun module to write. Given programmers reputation for laziness, my worry would be that few would choose to install the module you worked so hard on.

If we don't want to crawl to find drupal sites, maybe a good approach would be to create a short webform that allows people to submit their site url if they want it included. You could also gather a couple other key data elements. I'd keep the form short so it can be completed in under a minute.

What about this compromise on the crawling idea: You crawl, but you don't publish the urls. Then you could display:

UC Merced:
* 789 drupal sites discovered.
* 43 drupal sites featured here.

One or more people would become "admins" for their school. They can view a list of urls collected from their domains and select the ones that should be "featured." Nothning is featured without the admins approval. We spam the admins every so often to ask them to review their school's list.This way we can 1) give site owners control of what sites are featured 2) collect real stats that will be impressive when we need to make the case for drupal.

Speaking of laziness and scraping data. Yesterday I wrote a script to scan db dumps for php snippets. In the midst of writing complex preg_matches using lookahead/lookbehind, and getting a headache, I discovered http://us3.php.net/manual/en/function.token-get-all.php. Pretty sweet...maybe you already know about that...

Brian, What an excellent idea

Posted by coderintherye on December 17, 2010 at 7:37pm

Brian,

What an excellent idea for a compromise. I imagine no one will have objections to such an approach of just listing the number of sites available, and self-reporting the sites you want to showcase, that's a good mix.

As per tokens, I had read about them earlier from phpadvent (one article on PHP every day till Christmas http://phpadvent.org) and there is an article that talks about tokens: http://phpadvent.org/2010/bits-and-phpieces-by-jo%C3%ABl-perras
I haven't found a use for them yet though, as I mostly seem to get by using querypath, strpos, and strstr. More complex regex stuff I usually do from a shell script or using boost_regex in C++, but I'm sure there are uses out there, just haven't found them yet.

Drupal evangelist.
www.CoderintheRye.com

Approval

Posted by bwood on December 17, 2010 at 8:05pm

What an excellent idea for a compromise. I imagine no one will have objections to such an approach of just listing the number of sites available, and self-reporting the sites you want to showcase, that's a good mix.

How can we get enough people to approve this idea to allow us to move ahead? Should we email people at various campuses and ask for sign off? Or should we just go for it and apologize later. :-)

Expose content type to

Posted by redndahead on December 17, 2010 at 9:58pm

Expose content type to anonymous. Good compromise. Appologize later. 789 lol

Well, first step, I'll add in

Posted by coderintherye on December 17, 2010 at 10:24pm

Well, first step, I'll add in the requested features that are outside of this particular scope we are discussing, stuff that everyone wants and is comfortable with, and then we can work out the details on this stuff. I definitely don't want to upset anyone, that would be counter to the goal =] Probably we can put up a (working) poll on this idea just to get some feedback.

I have to say, this idea has really caught on even better than I hope, I think I've had e-mails from people a dozen different universities already, and even just here word of it is moving up the chain. Thanks Adam and Brian for all of the discussion, this is the kind of stuff that will get us there by figuring out what we need to do and how to do it.

Drupal evangelist.
www.CoderintheRye.com

Whoops not anonymous. To

Posted by redndahead on December 17, 2010 at 10:00pm

Whoops not anonymous. To registered.

Maybe you saw this post by Dries

Posted by bwood on December 25, 2010 at 11:32pm

http://buytaert.net/building-blocks-of-a-scalable-web-crawler
which links to
http://buytaert.net/drupal-site-crawler-project

Yep, saw it last week, and am

Posted by coderintherye on December 26, 2010 at 2:34am

Yep, saw it last week, and am in contact with Marc (the student who did the crawler for his thesis project). The paper is an excellent read, but he is also going to see if we could use (or purchase the rights to use) the crawler itself, which of course would be awesome =)

Drupal evangelist.
www.CoderintheRye.com

First release of http://www.drupalcalifornia.com

Comments

So some quick thoughts.

Thanks Adam

I'm pro-scanning

I can see what your saying,

Yeah, and I do completely

HTTP restrictions

robots.txt

Thanks for the response

Another vote against

Heather, good points

With the email rule to keep

On that note, I just had an

Would people be motivated to install it? Crawling compromise

Brian, What an excellent idea

Approval

Expose content type to

Well, first step, I'll add in

Whoops not anonymous. To

Maybe you saw this post by Dries

Yep, saw it last week, and am

California-HigherEd

Group organizers

New groups

Group notifications