Virtual Field Collection

CHiLi.HH's picture

A brief summary of the idea:

Data within entities that do not need to be processed individually within a view (or better: are not required to be addressed individually in any case) should be stored in a single database table, rather than being spread over N tables like they would if stored by the Field API "the usual way", in order to reduce the number of (sub)queries or subsequent queries and hereby reduce the load on the site's database.


In a recent client's project - a big and heavily visited recipe site - we stumbled upon massive performance issues partially caused by the huge number of fields attached to a specific node type. As part of the result set I had the idea to reduce the number of "real" FAPI fields by storing them in a serialized manner in the database, thus reducing the number of queries made to the site's database.

In detail: in the recent scenario we had a field collection containing 5 different fields placed inside a field collection containing an additional field. Each of the field collections was set to allow unlimited numbers of entries. Additionally, two of these were actually Term- and Entity reference fields causing a bunch of subsequent queries on their own...

This data structure led to a huge number of complex queries containing joins and generating even more subsequent queries to load even a single node. But - you may guess it already - these nodes weren't only displayed "one by one", in fact they were aggregated quite often using views, which caused the situation to get even worse... Regarding the fact that the main part of the fields contained were of no use without the containing node entity itself brought up the idea to serialize these fields and to reduce the number of queries needed to load a node significantly. This idea was implemented as a custom tailored solution and it turned out to work very well. So I set out to create a more generic and reusable version of this technique.

The idea

Create a new custom field type able to hold the serialized data, provide a hook system for submodules to hook into in order to provide new field types, provide a field settings widget to create "virtual" fields and groups, an entity edit widget providing a simple, intuitive and user-friendly interface to enter the data "the usual way" and finally a field formatter to properly display the field where it should get displayed.

I started developing this about 2 weeks ago and have gone pretty far up to now...

Key facts of the actual development state:

  • The custom field type: created using Drupal's Field API utilizing a blob field with up to 4 GByte of data.
  • The hook ecosystem: fundamentally outlined and implemented.
  • The field settings widget: ready to use, already calling and implementing sub-modules.
  • The entity edit form widget: ready to use, offering form validation, error markings and messages and a centralized autocomplete feature for sub-modules to easily hook into.
  • The field formatter: yet to be done, actually just outlined and roughly implemented.

Existing sub-modules / field types:

  • (Plain) Text fields: existing and working.
  • Text areas: roughly outlined and implemented.
  • Taxonomy Term reference: existing and working with a select and an autocomplete widget.
  • Node reference: roughly outlined and implemented.
  • User reference: roughly outlined and implemented.
  • Options (checkboxes, radio sets and select fields): planned.
  • Numbers: planned.
  • File- and image uploads: planned.

Like to check it out?

If you would like to check it out, here is the link to the sandbox. Or the git clone ready to go:

git clone --branch 7.x-1.x-dev virtual_field_collection
cd virtual_field_collection

Your opinion?

I would love to hear what you think!



It's an interesting idea, but

Garrett Albright's picture

It's an interesting idea, but I think you might be solving the wrong problem. The complexity of the fields system is a feature, not a bug; it's what makes it so flexible.

I think it may be a better idea to implement caching solutions such as Boost (for simple cases) or Varnish (for more complex ones) so that all of your database queries have to run less frequently in the first place. Did you consider something like that previously, and if so, why did you dismiss it for this solution?

What if...

CHiLi.HH's picture

Hey Garret, first I'd like to thank you for your comment!

But what if caching is no solution? When you've got to deal with logged-in users for example, varnish is of nearly no use. Varnish and Boost may also be inappropriate if you have to deal with data on a high change frequency. And most caches do expire sometime and need to be rebuild - depending on your site, this may cumulate to a significant load, too.

On the other side, caching does not solve the problem - it only makes the symptoms less painful. The goal of my module is not to replace the field api - in fact, I'm using it, too. ;-) It simply provides a way to deal with the real problem by reducing the number of FAPI fields for data not needed to be that flexible. And to be honest, often we are dealing with fields that do not need to be addressed individually, right?

In our case we implemented serialized fields as part of a widely spread result set which included caching with varnish, used cloud based hosting for media files and replaced several views with custom queries (which changed the game a lot, too).


Use the module to avoid blaoted data

Kars-T's picture

They idea for this module came from a problem we had with one of our projects. We had to collect ingredients for recepies. The Ingredients where grouped for different steps. What we did at first was:

  • Ingredient Group -> Field Collection
    • Name -> Text Field
    • Ingredients -> Field Collection
      • Count / Amount
      • Unit -> Entity Reference
      • Prefix -> Text Field
      • Ingredient Name -> Taxonomy Reference
      • Suffix -> Text Field

This should result in over 16 tables (2 per field and some more stuff for the field collection) and loads of data. The Database was quickly 2 Gigabyte and views did fail as we had some queries that would need a fulltable scan. Yes it is arguable that we can optimize the queries and move any search related to solr. And we did this later. But still there was so many unneeded stuff in the DB that we wanted to loose.

So we did a first custom implementaion of the same thing the module here did. With this we could cut down the data to 2 tables and lost around 1gig of data.

You should see this module more like the matrix cck field. You can easily do a large structur of data and put it into a DB blob. Not good for searching by the DB itself or for joins but very efficient at stroage and DB strucure.

To us the module christian developed from that ideas is a big reliefe and a great advantage for some occasions. :)