Game plan for UUID support in core

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
gdd's picture

One of the key aspects of solving the configuration management conundrum is identifying how to uniquely identify data between servers. To this end I propose implementing UUIDs in core for the identification of content, and to provide an API through which UUIDs can be generated and added use by contrib modules. While we are still talking through many big picture architectural issues, I think that it is possible to get moving on this and several other tasks to get the real work started. So here is my proposed game plan for this, created after some discussion with Dave Hall (skwashd) who has put a lot of work recently into porting the UUID module to D7. He has done several implementations of this now, and I think a lot of the current code can be reused or at least serve as inspiration. (see http://drupalcode.org/project/uuid.git/tree/refs/heads/7.x-1.x) This is not set in stone by any means, its just a proposal and I'd love to hear what people think.

  1. Figure out how we want to generate UUIDs and implement an API function to generate them. I propose using the UUID module's current functionality as a starting point (http://drupalcode.org/project/uuid.git/blob/refs/heads/7.x-1.x:/uuid.inc). I'm unsure about the benefits of having multiple generators vs just the PHP implementation. On the plus side, the PECL and Windows-specific generators are much faster than the pure PHP specification. On the downside, I'm not sure if it is appropriate for core to be adding platform-specific exceptions like this. It seems like it sets a bad precedent. Ideally we can make this pluggable - we ship with PHP implementation by default and you can replace it if you want/need to.
  2. We will need some simple APIs for translating UUIDs to serial keys and vice versa.
  3. I propose that UUIDs should live alongside serial keys, with Drupal using the serial keys internally for most uses while UUIDs can be used where necessary. The UUID module has implemented a lot of this already through the use of hook_schema_alter() but obviously we can just drop the fields in. Some tables will be excluded. Off the top of my head variable and system seem like places where this will be unnecessary due to their use of unique text strings as identifiers. I'm sure there will be more and we should make a list, the general rule being anything with a 'machine name' can be excluded. This will also involve some discussion about when and where we use UUIDs internally. For instance, should Nodereference/Relation/etc refer to content based on its serial ID or its UUID? What about user IDs in the node object? Can we node_load() using the UUIDs? etc. We need to establish some guidelines around this and adjust as needed, with particular attention to performance implications.
  4. Once all that is sorted we can start with the patches adding appropriate fields to the tables in hook_schema(), and the update functions to add them in as well as (probably) a batch update to add uuids to all your existing data.
  5. We will also want to update the _save() functions to acknowledge the existence of a uuid as an indication this is not a new user/node/etc, as well as adding uuid generation to the _insert() functions. We need to make sure and not regenerate UUIDs for objects that have them already (since the assumption becomes that they are coming from an external system.) This will have to be done carefully, but may be an opportunity to start working in some entity CRUD functionality at the same time.

As we encounter bugs or roadblocks for the above along the way, file issues and mark them uuid and/or change management so we can keep a good list going of where we're at.

What have I forgotten? What part is broken? Let me know!

Comments

Also I'd like to get feedback

Dave Reid's picture

Also I'd like to get feedback on http://drupalcode.org/sandbox/davereid/1119768.git/tree/refs/heads/7.x-1.x. For D8, ideally we'd want to store the UUIDs with the base tables, but for D7 contrib I don't think it's a sustainable solution.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

Various comments

Crell's picture

+1 for moving forward on UUIDs. They're an "easy win". The existing UUID module is a decent starting point for the API portion, although with UUIDs baked into core its hook implementations of course won't be necessary. I haven't looked at Dave's code yet.

We absolutely should allow for C-level UUID generation. In fact, we should design it so that if you install the appropriate PECL module, nothing changes except that stuff gets faster. That's an approach we should be using throughout core a lot more: Model on a PECL. That provides a known, non-Drupal-specific API (easier for new people coming to Drupal) as well as a very easy way for Sites With Root(tm) to speed up their Drupal site.

As far as what gets UUIDs, I largely agree with "things that now have an auto-inc key but not things with a machine name." That's what we've been discussing since before Chicago, and it makes sense.

The question for me is how we handle entities in the general case. I do think we should aim for "all entities have a pkey and a UUID". However, for non-Drupal-native entities that may get weird. I'm not quite sure how we deal with that at the moment. Certainly for node, user, term, etc. we can just throw a UUID column into the table and call it a day.

I do think we should be able to load by entity type and UUID, even if it's not the preferred way. If not, then from a UUID you first have to do a local key lookup and then load, which would be no faster and probably slower than just loading by UUID directly.

How would entity UUIDs interact with the proposed move from CRUD to CRAP (which I also support)? The notes here about saving make sense to me.

This actually sounds like a perfect case for a sandbox fork of core, owned by Greg, that we can work against before it gets merged back to core en masse.

PECL

gdd's picture

That is what the UUID module does right now. It checks if you have PECL install and uses it, and if you don't it falls back to a homegrown PHP generator. I'm fine with that, I just don't know how everyone else feels about it.

I don't know enough about the CRUD/CRAP question to comment at this point, however I feel that we shouldn't worry about it either since it may or may not actually happen. If it happens then we will adjust as needed.

PECL++

catch's picture

I like being able to do this. The only question for me is whether we hard-code fallbacks to PECL, or allow configuration to use another backend as well (these aren't mutually exclusive).

PECL not OSSP

skwashd's picture

Ubuntu and Debian ship the OSSP UUID module as php5-uuid. The OSSP UUID module is a horrid and buggy implementation. It also isn't compatible with the official PECL module's implementation.

The D7 UUID module already has some logic to ensure that it doesn't treat OSSP UUID as the official PECL module.

In fact, we should design it

Sylvain Lecoy's picture

In fact, we should design it so that if you install the appropriate PECL module, nothing changes except that stuff gets faster. That's an approach we should be using throughout core a lot more: Model on a PECL. That provides a known, non-Drupal-specific API (easier for new people coming to Drupal) as well as a very easy way for Sites With Root(tm) to speed up their Drupal site.

Wow... I 200% agree on that, I couldn't agree more that I quoted you here: Integrate OAuth to core.

The problem with UUID comes

pounard's picture

The problem with UUID comes with the fact they are 136 bit integers and could be the worst indexes you can get on a DBMS. Using real DBMS such as PostgreSQL they could be totally fine (UUID type exists and is managed well), but using MySQL they become varchar indexes and would be the worst thing ever then.

Pierre.

Yes this is why I recommend

gdd's picture

Yes this is why I recommend keeping the serial primary keys in place. We can continue using them for almost everything in Drupal as we do now, we just leave the UUIDs in place for when we need them (which is probably not going to be very common in the grand scheme of things.)

Wise enough. Very common as

pounard's picture

Wise enough.
Very common as soon as you want multiple Drupal instances to communicate between them (or even other pieces of software).
I did UUID based content synchronization, and stuff like that in the yamm module (which actually in pretty bad shape but core code is fine).

Pierre.

entity_load_uuid()

skwashd's picture

I still think that it would be useful to implement an API for retrieving entities via UUID such as entity_load_uuid(), which would be identical to entity_load() except the second argument would be an array of UUIDs. This will mostly be used for IO interactions, not normal Drupal operation.

Agree, this is mandatory else

pounard's picture

Agree, this is mandatory else you wouldn't need UUID's at all :)

Pierre.

Yep, I have such a function

Dave Reid's picture

Yep, I have such a function in my entity_uuid sandbox. It could possibly be extended to work if the uuid columns are in the entity base tables.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

Current state of D7

skwashd's picture

I've been sitting back to see how this conversation would develop as I have spent a lot of time discussing the D7 implementation already.

I need to spend some time writing simple tests and docs for the D7 module then I think it is ready to go. I don't think it would take much effort to take this as a first cut patch for D8. The problems of missing hooks etc that I've had to deal with in D7 contrib would be avoided in D8 core. Overall I'm looking at a leaner implementation for D8, the D7 port keeps most of the existing features of the module.

I haven't benchmarked this yet, but I suspect having a hook for uuid generation is going to be slower than just calling the PHP userspace function, so for that reason I think that we should only offer something like this in core for generating the string:

<?php
/**
  * Generates a Universally Unique IDentifier.
  *
  * @return The unique identifier as a formatted 128 bit string.
  */
function drupal_uuid() {
  if (
function_exists('uuid_create')
    && !
function_exists('uuid_make')) {
    return
uuid_create(UUID_TYPE_DEFAULT);
  } elseif(
function_exists('com_create_guid')) {
    return
drupal_strtolower(trim(com_create_guid(), '{}'));
  }
 
// PHP user space code here
?>

The performance issue with UUID was discussed on IRC in relation to the D7 port - see IRC log. UUIDs are already supported in Drizzle and IIRC there are plans to add it to MariaDB - sorry I can't dig up the reference. If the CRAP initiative takes off we could move the UUID generation to the storage systems.

What about OO model? Such

pounard's picture

What about OO model? Such as:

<?php
interface UuidBackendInterface { create(); }

class
UuidBackendPecl implements UuidBackendInterface { create() { return uuid_create(SOME_TYPE); } }

class
UuidBackendFallback implements UuidBackendInterface { create() { return /* Some evil PHP code here */ } }

class
Uuid {
 
// Singleton pattern.
 
private $__instance;

  public static
getInstance() {
    if (!isset(
self::$__instance)) {
      if (
function_exists('uuid_create') && !function_exists('uuid_make')) {
        
self::$__instance = new UuidBackendPecl;
      }
      else {
       
self::$__instance = new UuidBackendFallback;
      }
    }
    return
self::$__instance;
  }
}

// Usage
$uuid = Uuid::getInstance()->create();
?>

Pierre.

I'd like to follow Drupal

gdd's picture

I'd like to follow Drupal trends in general, and see where we go with the WSCCI initiative before getting too far down the OO path. Entities and Drupal in general are still really procedural, and I don't want to diverge from that unless we're getting some real benefits from it which I don't really see. This is something that could pretty easily be refactored down the road if we decide to.

Probably not a plugin

Crell's picture

I don't know that it's wise to make uuid generation completely pluggable via the WSCCI plugin system (whatever it ends up being). For one thing, it's probably a bad idea to let every site define UUIDs in its own way. That's an area that you want some consistency across all sites to help avoid collisions and to ease maintenance. (Variable length UUIDs would be a mess to maintain.) I'd rather see a hard-coded UUID strategy (or set of strategies) that all sites use consistency.

If it were to be fully pluggable, then it should use whatever WSCCI comes up with. Which, I'm rather certain, would not involve a static instance method. ;-)

UUID is a standard, so, it's

pounard's picture

UUID is a standard, so, it's not about every site define their own UUIDs, but about providing an UUID handling library depending on what's available on the each system. Drupal can't depend on a PEAR third party library which won't be available everywhere.

Pierre.

Which UUID version?

skwashd's picture

I have spent some time reading over RFC 4122 and it seems that we need to decide on which UUID version we wish to use. The current D7 implementation is using a bastardised implementation of version 4. I'm working on a compliant version 1 implementation as a plugin.

I remember using V4 random

pounard's picture

I remember using V4 random UUIDs for pretty much all I needed, but this might not be good at all.

Pierre.

+1 for v4

Dave Reid's picture

+1 for v4

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

UUID as field

Mile23's picture

Personally, it suits me if UUID is a field attached to some other entity. This would use the existing built-in systems to allow users or modules to figure out which entities need a UUID and which don't. It also allows you to do a FieldQuery to find the entity by UUID, and also allows for multiple UUIDs per entity (which I kinda need, since I'm using Drupal to refer to external objects that have UUIDs).

So basically: Don't sweat it. Keep UUID as a field. The MySQL bottleneck pounard talks about is offloaded to the Field API, which could take advantage of PostgreSQL's built-in UUID in some way. Over time, use-cases will dictate whether this is efficient enough or not, and the onus will be on the Field API to speed up, which benefits everyone.

From the original post: "For instance, should Nodereference/Relation/etc refer to content based on its serial ID or its UUID?"

Well, see, that's the $64,000 question, so let's try some design fiction: You install Drupal, and make sure the UUID core module is enabled. UUID core module will automatically add a field to all entities called 'core_uuid', and a widget that says, 'Click here to copy node UUID to clipboard.' You start generating content and each entity has both an internal serial id and a UUID. You know that you want this site to have content synchronized with other sites, so you flip a switch on Nodereference that says: Use UUIDs. When the user erroneously enters a node id, Nodereference suggests the UUID automatically for the user's approval.

Or.... If you have a site full of content, with no core_uuids, you enable the core UUID module and it starts generating them in cron. If you switched on 'Use UUIDs,' Nodereference would then be responsible for re-filtering all the content, substituting node ids for UUIDs. This would also probably happen in cron.

Or, most obviously: Nodereference forks, and NodereferenceUUID makes core UUID its dependency.

More: "We need to make sure and not regenerate UUIDs for objects that have them already (since the assumption becomes that they are coming from an external system.) This will have to be done carefully, but may be an opportunity to start working in some entity CRUD functionality at the same time."

This is not all that tricky. Code doing synchronization is going to look up an entity based on its UUID, then compare its timestamp to figure out if it needs to be updated. Alternately, if it doesn't find the UUID, it will create a new entity and add the foreign UUID. Core UUID module will then implement hook_entity_insert() and notice that this entity already has a UUID so it won't add one.

Ultimately I think it's important that UUIDs be a feature you can turn off if you don't want to burn 36 extra characters per entity in your database. This is most easily and flexibly accomplished by implementing it as a field.

Using FieldAPI comes with a

Dave Reid's picture

Using FieldAPI comes with a performance hit - and UUIDs is not something that strikes me as needing to be a field (that needs a widget or formatter) vs it being an internal property. Having UUIDs as a field will also cause entity lookups by UUID even worse. If a module doesn't want UUIDs for its entities, then it should specify 'uuid' => FALSE in its hook_entity_info() implementation.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

No

skwashd's picture

During the D7 port there was a lot of discussion about how to implement UUID. One of the iterations was UUID as fields. This was rejected as impractical.

The current UUID implementation for D7 and is likely to be retained for D8 is that if a UUID is present when creating an entity, the provided UUID will be retained. This approach should suffice for the use case you've identified with external entities with UUIDs.

The current discussions suggest that UUIDs will be enabled by default for all core entities. Node Queue in D7 is likely to depend on UUID in contrib at some point in the future - with this functionality being carried over to D8. I don't use References so can't comment on that, but it should be pretty easy for them to switch to using UUIDs rather than nids.

Any updates on UUID core implemenation?

Dave Reid's picture

Any issues or patches that can be reviewed? There doesn't seem to be any activity yet in the uuid-8.x branch of the initiative sandbox.

Senior Drupal Developer for Lullabot | www.davereid.net | @davereid

Lack of time

skwashd's picture

The main reason why I haven't done anything on this is lack of time. I want to get the tests done for UUID for D7 contrib and get it out the door. After a discussion with heyrocker I've identified some issues with the D7 contrib > D8 core upgrade path that I need to fix. Once that is all done I plan to get include/uuid.inc into the sandbox for review. I hope to have more time by the middle of July to work on this stuff.