The movement towards semantic data
There is a large and growing movement in the front end development community centered around semantic and machine-readable content. Projects exist to develop things like markdown editors and other tools to aid the process of building semantic content. At the same time there is a nascent movement in academia to develop a WYSIWYM (what you see is what you mean) editor the scope of which is currently limited to extracting potential schema.org entites from content as a user enters it and prompting them to clarify things like meaning and taxonomy (This already exists as a plugin for TinyMCE).
The rise of effortless content authoring
On the opposite end of the spectrum there is a lot of work being done both in Drupal and in the enterprise to streamline and remove friction from the content authoring experience. This type of approach can be seen in tools like Barley and Squarespace. Here the idea of semantic data is largely ignored and the focus is mainly on giving the user the ability to easily create content that “automagically” looks great. The “great” appearance comes from under-the-hood theming, but also a significant narrowing of user options.
How Drupal has approached this problem space
We have always taken the approach of abstracting content from presentation both by separating the theme layer from the data and by using filters in the WYSIWYG to remove “messy” user generated markup. This approach has proved to be quite effective in allowing themers to have very granular control the way a site looks and site builders to be able to structure data and re-use it in a multitude of ways.
An opportunity to move in a new direction
Now that WYSIWYG editors have matured have much more robust ways of both writing cleaner, more semantic, markup and parsing pasted markup to remove unwanted cruft there is a new opportunity emerging: automatic semantic authoring. The value we are throwing away in Drupal when we filter out all the “messy” formatting that users enter is that each and every user decision to visually structure content has semantic meaning. We are already recognize this when we favor "em" and "strong" over "i" and "b" but this goes much deeper. Decisions users make about size, color, order, indentation, even font (scripts connote formality and emotion, serifs are neutral and official) are all imbued with meaning that can be parsed, indexed and processed to extract the larger meaning of the document.
The broader potential impact
Looking at this from the larger perspective of the entire web; when we consider the dominance of centralized data stores like google and their movement towards artificial intelligence there is a certain asymmetry that can be exploited.
On one side is the search engine using artificial intelligence to index and then attempt to interpret the meaning of every document on the web and feed that back to us. Setting aside the fact that massive amounts of data centralized in a few organizations is a recipe for abuse, there are other more subtle negatives. We often forget the fact that the infrastructure of search mediates all of our interactions. The engine controls the interpretation of data, we rely on that interpretation to make connections. Therefore (in most cases) we don’t have a direct connection from data source to data source.
On the other side moving towards semantic content at the point of authorship would begin to establish a common, machine-and-human-readable baseline of content/data that could be queried from anywhere by anyone. It would certainly make a search engine’s job easier but it could also make them irrelevant. When content is data you no longer need a mediator, you (or your site) can find what you want on your own. That’s the big idea, this is the first small step in that direction
To bring together a diverse group of thought leaders including Drupal developers, WYSIWYG developers, UX designers, usability researchers, content strategists, semantic data and natural language experts. and facilitate an open and divergent brainstorm session with the objective of exploring the following questions:
- How would we structure and design a system that transparently mediates the established mode of content authoring to extract and store every level of semantic meaning in the content being created?
If we can extract the data, are there ways we can further structure it based on heuristics and rules and how can we inform the user about the process as it occurs?
- Once data is extracted, and parsed, how and where could it be stored? (DB, Inline?, cached XML or YML?)
Once the data is stored what are some of the ways a user or a machine might access and use it? (drupal “pseudo-fields”? SPARQL?.)
- What are some ways we can balance a more permissive approach to author generated markup with the needs of designers and themers?
- Where should the separation fall between parts of this that operate in the WYSIWYG, in Drupal (or another CMS), in the browser or in other external libraries or applications (Mahout, OpenNLP, Solr, Word Net etc.)?