inline internal links and unpublishing/deleting

We encourage users to post events happening in the community to the community events group on https://www.drupal.org.
alexkb's picture

For those using things like linkit, or ckeditor_link (these modules allow you to have links to internal pages within your body), is there any module that tracks these inline links?

What I'm getting at, is that if the inline links are tracked, then we can do something about it, before the linked node is unpublished or deleted, i.e. display a warning, or ask the user to remove the link and rewrite the content of the referring page.

It seems like a fairly important thing, otherwise large sites, might find themselves with lots of broken links around the place.

There is the linkchecker module which indicates to the editor there are broken links when editing the page. But this notice isn't provided until the referring page is editing, or if someone actively monitors the broken link report.

Should a more active solution be patched into linkchecker, or is this kind of inline link relationship tracking something that a new module should do?

Comments

@alexkb - I to would like to

gmclelland's picture

@alexkb - I to would like to know the answer to that question. The only thing I have seen that does this is the file_entity and media modules. I have heard a story of someone using the rules module to parse the node body for inline links and save them to an entity reference field so that they could build a view that shows the backlinks.

There's no such module, no.

Garrett Albright's picture

There's no such module, no. One problem is that, as the developer of Pathologic, I can tell you that picking URLs out of content and determining whether they link to internal site content or not is far more difficult than it sounds, especially if we're talking about a multi-lingual site.

simple html dom

alexkb's picture

Hey Garrett Albright,

It's quite easy using the simple html dom library, although maybe I'm overlooking something with multilingual content.

I've defined an action (for rules which gets triggered when a page is published) which detects all the links, and stores them in its own lookup table.

Just working on the unpublish action now, and then might try get a reworded version for a sandbox release if people like.

Let me clarify that finding

Garrett Albright's picture

Let me clarify that finding paths in content is not the hard part, whether you use DOM objects or regular expressions (as I do; going the DOM object route has been under consideration for a long time, but the current regex approach ain't broke). The hard part is figuring out if a link points to internal content.

Let's say your site is set up at http://example.com/ and Drupal is at the root level of the web root (so no subdirectory). You have a node 13 which has the alias "my-web-page". What paths can you use to link to it?

Okay, what about links to things on your site that aren't nodes - files, Views, etc. Do you track those? How do you filter them out?

What happens when you change a node's path - from "my-web-page" to "my-cool-web-page"?

Many multi-lingual sites use a path prefix to determine language, like "en/foo" or "jp/bar." How do you know when the first part of a given path is a language prefix and not part of the path itself?

I've managed to solve some of these problems for most cases. I encourage you to check out Pathologic's source and poach anything that you might find useful. (You might even find it useful to make Pathologic a requirement for your module.)

Good point - there are lots

alexkb's picture

Good point - there are lots of edge cases to consider that I hadn't thought of.

For now, I'm just focusing on node links so we can cleanly handle unpublishing. Once you've got the links (by simple html dom), I'm using the parse_url() function to break up the url which makes it easier to focus on the important stuff.

Thinking about it more, perhaps a custom form validation handler could ensure people don't put in links via other means or formats.. we could even accept the aliased urls (if a alias lookup returns correctly) and adjust the body markup links) back to node/, which in affect will let pathologic do its magic (great module by the way!).

Definitely some good ideas here.. will have to get some code up soon. Cheers.

Ok, I've started off a very

alexkb's picture

Ok, I've started off a very raw version in a sandbox here:
https://drupal.org/sandbox/alexkb/1922328

Its not been tested, but the basics are there - we have a defined rule to detect and record the relationships of pathologic urls (i.e. node/) in the body field of nodes. It also checks for simple html dom requirements during install. There's a lot more to do, which I'll get onto in a few days:

  • add configuration to trigger the modules features without using a rule (perhaps). The configuration option could be to either inform the content editor or enforce removal of referring links during a delete or unpublish operation. Or do nothing at all of course.
  • based on configuration, set a validate handler via a hook form_alter on node deletions.
  • provide a maintenance function that lets you build the internal relationships table up from existing node data. There is code there to do some kind of population during module install, but this needs work.
  • intelligently work out which fields are rich text instead of assuming its only the body field.
  • make the internal relationship url checking more versatile as per Garretts suggestions.

Like I said, very raw. Criticisms welcome! :)

@alexkb - I think you need to

gmclelland's picture

@alexkb - I think you need to fix the link to your sandbox. I'm guessing it is this http://drupal.org/sandbox/alexkb/1922328 ?