Is D7's default use of the Semantic Web ethical?

matt_paz's picture

Obviously, the title was created was designed to elicit discussion ... but I am seriously interested in the prospects for unintended consequences of D7's default use of RDF. D7 will be a great leap forward, and I'm genuinely excited about the prospects of D7's adoption of RDFa. Upon further reflection, I'm wondering if D7's embrace might actually make it easier for users to be tracked by organizations like Recorded Future. I understand, that today's default settings can be "undone" ... but it is the default that has me most concerned ... precisely because it is the novice end user who isn't particularly savvy that will be put at most risk ... and is unlikely to even know what is going on "under the hood" ... I know, I know ... probably an exaggerated concern, but I thought I'd throw it out there and see if anyone might want to comment.

Comments

JohnChapman's picture

If someone installs D7, (or any software/package), and they don't read about usage, properties, features, etc; then they are the ones not being responsible.

This isn't even like an undocumented feature - and I contend that even if the use of RDFa (or equiv) was just something that worked out of the box, without notification, it still isn't anything unethical.

Obviously your title WAS designed to elicit discussion - but rather then throwing stuff out there, have you done any legal research to indicate that there is even a ghost of a concern?

Just making things easier

scor's picture

Interesting concerns, thanks for raising them! RDFa just makes extracting information from website easier for machines which don't have to guess where the title is, what's the name of the author of this comment, etc. It does not surface more data than what's already on the page in HTML, RDFa is no more than just extra semantic information wrapping bits of text. So a human could already scrap this data manually, RDFa just makes it easier. Another thing is that out of the box, there is no linkage to other sites out there, so if you are worried about your user account being linked to some other identity online, it won't happen unless you purposely do it via some custom code or contrib module. RSS has been in Drupal for years (with no way to be disabled) and RSS was one means to get data out of your site for syndication purposes, RDFa is not more than RSS on steroids, with much more extensibility and embedded in the HTML, that's it. RDFa is enabled in the standard profile because we think it's the way forward for the Web.

Responses ...

matt_paz's picture

have you done any legal research to indicate that there is even a ghost of a concern?

@chapster Ha. Not a lawyer, nor did I question the legality of it. Even then, there'd be the question of which legal regime one would use. The privacy regime in the EU, for instance, might offer many more interesting things to examine than that of say, the US. Again, my question was a little bit tongue-in-cheek.

I think this exaggerated concern is getting close to absurdity.

Heh, I guess time will tell whether or not privacy issues and the semantic web/RDFa have a relationship or not. I will say that my sister thinks it is absurd that I tell her to stay off FaceBook, so ya, I get dismissed a lot when I suggest a word of caution about these types of issues.

If someone installs D7, (or any software/package), and they don't read about usage,
properties, features, etc; then they are the ones not being responsible.

Hmm. Ya, I think that is more or less fair... of course, there is an issue of trust as well ... and again, as tech gets easier and more accessible, it is often easy for less savvy folks to "caught up" in the action. And mind you, I believe Drupal has a good history on this front ... and can point to something as recent as the following http://drupal.org/node/867820 as something that reinforces trust.

RDF just makes extracting information from website easier for machines

Ya, I guess that is my point in a way. An audience of humans is not likely to connect dots that machines can ... not that bots don't try to guess the semantic meaning and patterns from content on the web today ... just that it will be a no brainer with RDFa.

if you are worried about your user account being linked to some other identity online ...

Ya, that's it. And I understand that, out-of-the-box, there are no linkages. I also have some feel for techniques that attempt to link disparate identities online even when they aren't formally linked ... I'm guessing this will make it easier. Again, I know I'm stretching here.

RSS has been in Drupal for years

Ya, I'm just thinking out of the box, there was not any RSS exposed when creating a simple web site ... not that I'd want to imagine life w/out RSS.

RDFa is not more than RSS on steroids

Heh, indeed ... and we all understand the side-effects of steroids
;) Sorry -- was just too good to pass up.

RDF is enabled in the standard profile because we think it's the way forward for the Web.

Yes, I do too. I think it has tremendous potential and D7's work with RDFa could well differentiate it among other players. The Googles and Yahoos of the world will love it. And mind you, I'm not saying scrap it all ... or that semantic web technologies are evil ... just a pause ... wondering about the potential for unintended consequences ... particularly as it relates to users (people) ... as compared to properties of things, etc. I can't imagine life w/out RSS today ... and I'm sure I'll say the same about the long term impact of RDFa. Perhaps the Recorded Future news was a bit of a tipping point for me, personally. Heh, maybe I'm just a Luddite in waiting ;)

Thanks for the discussion ... I'd be interested in any follow-up.
I understand and appreciate why the risks are perceived to be low and "casual" at best.

about installing software

brucewhealtonjr's picture

If someone installs D7, (or any software/package), and they don't read about usage,
properties, features, etc; then they are the ones not being responsible.

Hmm. Ya, I think that is more or less fair... of course, there is an issue of trust as well....

>
I have to take major issue here. I doubt that 99% of people who install software read all the possible features, uses and properties of what is installed. It is the idea of trust that does get to be problematic. We don't realize the risks or exposure that occurs. I'll probably keep using facebook and live in the false security that nothing will happen... but that article about "Recorded Future" did seriously seem chilling.
Don't get me wrong, I'm all in favor of this, very much so and excited about it. I just think that software users do tend to just click through and install things, trusting, in what they are installing and the developers. It doesn't seem like any software would be used if everyone worried about every possible implication of every possible feature or property. We should thus have laws about what groups should be able to do with our data... That's a dream that will never happen either.

There is an ethical question

linclark.research's picture

There is an ethical question in this work. I think that you kind of miss it, though, when you talk about the novice user. The ethical issue isn't whether any one site's information gets exposed. This RDFa is just exposing what is already exposed in HTML, and any novice user should understand if they put their information on a Web site and if they don't set access control on that site, that people are going to be able to access their content.

The real ethical issue with all of the SemWeb work is whether by providing a mass of data from a whole BUNCH of sites in an easy to exchange and process way, you are opening the door for Big Brother. But this is the same ethical issue with the Web and with the Internet themselves.... information was much harder for the CIA to process before that information was in HTML and on servers.

We have gained a lot from having the Web and Internet, and we have lost some privacy... the government can spy on us in new ways. But I think that the Web and the Internet are both net wins when it comes to what we've gained over what we've lost.

Similarly, I think the Semantic Web is going to be a net gain. Yes, the data is going to be out there for Big Brother to sniff. It is also going to be out there for people to exchange with each other. And, even more exciting, the government's data is going to be out there for people to analyze in new and powerful ways with a lot of the work that is going on in linked open government data.

In the end, if individual sites have a need to protect its site specific information, the owner is probably going to be aware of this need. And it is just as simple to protect the RDFa as it is the HTML.

SemWeb != everything is public and free

fgiasson's picture

Hi here,

The real ethical issue with all of the SemWeb work is whether by providing a mass of data from a whole BUNCH of sites in an easy to exchange and process way, you are opening the door for Big Brother.

I would like to make a simple point clearer here. This discussion is ok regarding RDFa by its nature (embeding RDF within web pages, for which most of them are publicly available).

However, I certainly won't add the "SemWeb" in the same bangwagon. The Semantic Web is really just a set of technologies, techniques and concepts that helps managing information, and to try to get more value out of it. Nobody (at least not me) said that the Semantic Web was about freely, and openly, publish the World's data directly on the Web (it is the purpose of some groups such as the LOD community, Drupal's RDFa developers, and such).

Most of the technologies, techniques and concepts driving the "Semantic Web" is (and will be) used for all and any purposes, but only a small percentage is (and will be) noticeable (and publicly available).

The real value of the Semantic Web for businesses is not to publish all their data on the Web. It is one of the value... and only for some really specific usecases.

Thanks!

Take care,

Fred

Ya, but ...

matt_paz's picture

but only a small percentage is (and will be) noticeable (and publicly available)

Maybe that might be true for the broader case of SemTech, but it sounds more like a description of what is happening with SemTech today ... as compared to say, five years from now. Even then, I wonder what D7's use case for on by default is in practice ... I don't know, but I'd guess that the adoption is more likely to be lots of smaller sites that ARE publically available ... or at least that would be my guess about the distribution of drupal deployments ... 90% small to mid size sites ... 8% biggerish ... 2% really big. I wonder if anyone has an empirical data on that.

D7 is breaking new ground ...

matt_paz's picture

I guess I just want to note that I think Drupal is really breaking new ground ... I'm not aware of any other system that combines the reach of Drupal and its vision for the Semantic Web. I'd love to learn about 'em if I've missed it. It will be an big, fascinating experiment and I'm not sure anyone can predict how it will all play out.

Yes, one can make comparisons between simply publishing information on the web in html, but just as RSS has transformed content delivery and consumption, I believe RDFa might too. Like others, I believe it is big. I believe it is an order of magnitude bigger than simply supporting content distribution via RSS ... and that's why I wanted to start the thread. This new, hyper-efficient community plumbing has lots of transformational potential ... I just hope it's downsides are as well understood as the upsides.

Speaking of which, can anyone point me to any discussion about the decisions supporting drupal's adoption of it ... I've found lots of information about what is being done (which is great), but I've been unable to find much about the ebb-and-flow of discussions that lead up to the decision. Outside of early comments from Dries, this is the closest thing I've found thus far ...

http://groups.drupal.org/node/16597

... which included questions about the ability to turn-off the native support for RDFa

One more thought ... one might argue that the target audience of RSS is humans (even though it is machine readable) and the subject matter is content .... while they've been around for a long time, it seems to me that the real-world consumers of FOAF and SIOC aren't as obvious ... and their adoption could be considered more dubious as a result.

@linclark Yes, I think i get the point ... my emphasis on novice users was just because they simply don't understand the potential implications and as a result will have less incentive to turn off the defaults. Savvy users can make informed decisions ... I'm quite pleased with the ability to selectively turn off/and or extend drupal's base support for a RDFa ... just pausing to thing about other users. We're so close to the technology, I just worry what happens when we make decisions like this for users ... and again, it is not that user's don't have a choice ... but it is kinda like debating "real standards" vs. "defacto standards" ... and the circumstances that created one vs. the other.

The whole point of turning it on by default is to encourage adoption, yes? And to make it so that users don't have to "think" to participate in the Semantic Web. Perhaps my perception about the motives for on by default is incorrect tho'

any novice user should understand if they put their information on a Web site and if they don't set access control on that site, that people are going to be able to access their content

In principle, I'd want to agree. They should. The question is, is that true that they do? And even then, do they really grok that their audience is not just people accessing their content, but bots. I'm sure that they understand "people" can see it ... but are their perceptions of use bounded by the perception that "a person" is consuming the content ... and as a result, one's risk is bounded by perceptions about the audience for the content ... for example, while it is true that that one might publish content in the open web, if the perception is that no one, except a select few would be interested in it, then perhaps the perceived risk is lower.

I know it all seems benign, and it may well be, I'm just saying that I'm not sure that we can draw parallels between the early days of the web where people could put information on the web and maybe it would be found in a search engine -- and what's coming with the world of the semantic web as a result of systems like Drupal defaulting to exposure of SIOC, FOAF and the like.

And again, I want to emphasize that I'm not anti-semantic, but I do have serious reservations about widespread adoption of SIOC and FOAF. I'm just worried that Drupal might be creating a normative experience that could accompany more cons than pros.

If I haven't expressed myself clearly, I apologize.
I just wanted to get it out there ...

Think of SEO for example.

scor's picture

Think of SEO for example. Some folks pay crazy amounts of money to get good ranking on google. I believe RDF can help search engines to be smarter at answering queries. It's a long process but we have to start somewhere. Now, imagine someone does a search on your name, you would expect Google to bring up a result with say your name, your title (say "Drupal genius"), a few words on your bio, and maybe a picture like what you might have on your Twitter profile. Imagine that instead of a picture of you, Google displays a random picture it found on the sidebar of your site. Wouldn't it be better if Google knew which picture was yours and displayed it correctly? With RDF, you can tell bots "This is my picture" and google does not have to go through a guessing algorithm to choose a picture of you on your behalf (and pick the wrong one). Imagine there was a way to turn off the SEO of Drupal, should the standard profile comes with SEO off? The standard profile enables the modules that most Drupal sites might want to have on. Why do people set up public websites? Isn't it to communicate with the rest of the internet? If you set up a website to talk about private life stuff with your mother and you don't want anyone else reading it, then you'd better have some access control in place if you don't want it to show up in the search engines, especially with all the SEO goodness Drupal has natively built in. Note that happens even without RDF in Drupal 5 and 6! Search engines companies put a lot of efforts to try to screenscrap your HTML pages to extract data from it, so you might as well tell them what they should understand (provided they are given access to it).

can anyone point me to any discussion about the decisions supporting drupal's adoption of it

On the decisions which led to RDF in core, Dries has always been keen on the idea. There was even an attempt to integrate RSS 1.0 back in 2001 but RSS 2.0 was later adopted. Then around 2005 several developers revived the topic by working on contributed modules around RDF. As RDFa matured and became a W3C recommendation, it became clear this was the way to go. There's also been a lot of discussion in the RDF in core issue queue.

SEO indeed!

matt_paz's picture

Yes, I'm excited about the prospects for Google Rich Snippets (and the like) ... and for the prospects for the type of faceted searching that will accompany support for RDFa. That is most certainly one of the areas that will be facilitated by this. Of course, that could be off by default too ... I believe anyone interested in SEO will be looking at systems and techniques to increase their findability ... and they will likely be savvy enough to turn this capability on if it were off by default.

D6 Core SEO < RDF?

matt_paz's picture

should the standard profile comes with SEO off?

I guess I'd also counter that Drupal's out of the box SEO support is fairly limited ... and that anyone serious about SEO in Drupal is using a contrib module or two to maximize their exposure (as they will with RDFa too). I don't want to minimize or trivialize Drupal's architectural investment there, just saying that Drupal core doesn't try to be "all things SEO" ...

Another timely article ...

matt_paz's picture

There is an probably order of magnitude difference between the actions outlined in this article and in those highlighted in our discussion, but I thought it was pretty timely and had to pass it along.

Microsoft built its browser so that users must deliberately turn on privacy settings ...

These companies now have a big say in how much information can be collected about individual users.

http://online.wsj.com/article/SB1000142405274870346730457538353043983856...

Sorry for all the hyperbole ... and for belaboring this thread. It would just be a bummer if folks are talking about Drupal in a similar context two or three years from now. Not likely, but who knows.

I'm with ya

dojo733's picture

I agree with matt_paz. And unfortunately this is the world we live in and we have to figure out how to adjust. True, it would be nice to tell a Google bot which picture is yours and stuff, but certainly one can agree that there are other sites that are not as conscious about playing nice with people's info. I imagine the average person might be disconcerted when they see a note at the bottom of a page that they have not logged in to that says, "Hey, you are from City ABC, visit this place/ad or whatever concerning your area..." Or worse yet, "Hey, First Last Name", which I've also seen. In addition, regarding setting "access control", let's everyone remember that these programs are written by humans, and humans do make mistakes. The programmers at Facebook apparently made some modifications, and so a deleted account with already-locked-down access control points suddenly became visible. Hello! Therefore, what kind of mistakes will be revealed with Semantic Web? The bigger the power, the bigger the error?

"Hey, you are from City ABC,

linclark.research's picture

"Hey, you are from City ABC, visit this place/ad or whatever concerning your area..." Or worse yet, "Hey, First Last Name"

RDFa has nothing to do with this, I'm not sure what you are getting at here.

And the point with access control is this: If you don't have access control set on your site, then Google (or whoever) is going to be able to scrape the information anyway and it will start to show up in search engines or however people want to use it. Once again, RDFa has nothing to do with that. You have the same issue whether you have RDFa on the site or not.

I'm not saying I think that the case is closed on this issue, but these aren't really relevant points.

Leveraging Drupal's access control

scor's picture

In addition, regarding setting "access control", let's everyone remember that these programs are written by humans, and humans do make mistakes.

The advantage of RDFa is that it sits within the HTML of your site and as a result relies totally on the existing access control in place (Drupal). We don't implement alternate menu items (paths) à la RSS, no, it's all in the existing HTML that your site would output if you didn't have RDFa, so we don't have to deal with any kind of extra access control: no HTML means no RDFa, period.

It's great that you are asking these questions, and please if there is more that needs to be clarified feel free to ask!

Access Control and/or Robots.txt

matt_paz's picture

I hear 'ya about access control (and/or robots.txt) ... again, for me it goes back to accommodating the lowest common denominator ... which is something I think we're both trying to accomplish.

Looking through one lens, one might suggest that a default of ON would empower users without them having to think about what it means to participate in the Semantic Web and why they might want to do it (or not). Another lens might suggest that a default of OFF would provide greater protection to, forgive me, folks that don't understand what they're getting into ... folks that might think not appreciate access controls or how to use them effectively. As Drupal gets easier to use, it will attract more users ... not just the savvy ones. I guess you could use me argument against me too ... suggesting that if I can make this leap for RDFa, maybe I should just be espousing that out-of-the-box drupal should not allow anonymous access ... and I guess I might not argue for that.

Still, I can totally appreciate why one would want the default to be ON ... for the good of both Drupal and to propagate the ideals of the semantic web more broadly ... just not sure that it is good for the newbies to get caught in an undercurrent of the semantic web (should one develop).

I believe we have similar motives ... both admirable ... just different orientations on the issue ... one of optimism/altruism and one of more skepticism/cynicism ... and while it is hard to argue against optimism, I felt like I had too ... even if it goes no where as we edge ever closer to D7's release. Just felt like I had to be on the record for a default of "OFF" ... and if nothing else, to instigate a bit of pause before proceeding down this route. I genuinely appreciate the discourse and discussion

What positive uses can we encourage?

ddegler's picture

While there are always concerns about use of data (and I agree with Lin's comment above that structuring the data can extend its availability and use), this thread argues for simple, in-context guidance to admin users so that decisions are clear and you don't have to hunt for info.

There are also things that would be good to see "baked in" to help D7 support the larger linked data movement. For example, a couple simple and useful things:

  1. RDFa expression of rights/copyright on every page. If machines are going to be scraping the data, then every data set (node) should carry the intent of the author/publisher about what appropriate uses of that data can be. This encourages site owners to consider that, express that in concise and standard ways, and provides more predictable expressions for machines to read.

  2. RDFa expression of a persistent URL. While the URL itself should remain stable, ideally, it is possible that there could be multiple views of a node, multiple aliases, and parameters that are carried for programmatic reasons. This may make the address bar URL context-dependent or changing, even if the underlying resource is persistent. Clearly expressing the persistent URL in the metadata is one way of providing a more stable data experience to the linked data web.

As has been said by others here, there are lots of weaknesses and misuses of the existing web, and some might risk being further enabled by having more structured data. So maybe we can think of other ways to support good use of RDF data from Drupal's wider community?

Semantic Web

Group organizers

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds:

Hot content this week