For those using things like linkit, or ckeditor_link (these modules allow you to have links to internal pages within your body), is there any module that tracks these inline links?
What I'm getting at, is that if the inline links are tracked, then we can do something about it, before the linked node is unpublished or deleted, i.e. display a warning, or ask the user to remove the link and rewrite the content of the referring page.
It seems like a fairly important thing, otherwise large sites, might find themselves with lots of broken links around the place.
There is the linkchecker module which indicates to the editor there are broken links when editing the page. But this notice isn't provided until the referring page is editing, or if someone actively monitors the broken link report.
Should a more active solution be patched into linkchecker, or is this kind of inline link relationship tracking something that a new module should do?
Comments
@alexkb - I to would like to
@alexkb - I to would like to know the answer to that question. The only thing I have seen that does this is the file_entity and media modules. I have heard a story of someone using the rules module to parse the node body for inline links and save them to an entity reference field so that they could build a view that shows the backlinks.
There's no such module, no.
There's no such module, no. One problem is that, as the developer of Pathologic, I can tell you that picking URLs out of content and determining whether they link to internal site content or not is far more difficult than it sounds, especially if we're talking about a multi-lingual site.
The Boise Drupal Guy!
simple html dom
Hey Garrett Albright,
It's quite easy using the simple html dom library, although maybe I'm overlooking something with multilingual content.
I've defined an action (for rules which gets triggered when a page is published) which detects all the links, and stores them in its own lookup table.
Just working on the unpublish action now, and then might try get a reworded version for a sandbox release if people like.
Let me clarify that finding
Let me clarify that finding paths in content is not the hard part, whether you use DOM objects or regular expressions (as I do; going the DOM object route has been under consideration for a long time, but the current regex approach ain't broke). The hard part is figuring out if a link points to internal content.
Let's say your site is set up at http://example.com/ and Drupal is at the root level of the web root (so no subdirectory). You have a node 13 which has the alias "my-web-page". What paths can you use to link to it?
Okay, what about links to things on your site that aren't nodes - files, Views, etc. Do you track those? How do you filter them out?
What happens when you change a node's path - from "my-web-page" to "my-cool-web-page"?
Many multi-lingual sites use a path prefix to determine language, like "en/foo" or "jp/bar." How do you know when the first part of a given path is a language prefix and not part of the path itself?
I've managed to solve some of these problems for most cases. I encourage you to check out Pathologic's source and poach anything that you might find useful. (You might even find it useful to make Pathologic a requirement for your module.)
The Boise Drupal Guy!
Good point - there are lots
Good point - there are lots of edge cases to consider that I hadn't thought of.
For now, I'm just focusing on node links so we can cleanly handle unpublishing. Once you've got the links (by simple html dom), I'm using the parse_url() function to break up the url which makes it easier to focus on the important stuff.
Thinking about it more, perhaps a custom form validation handler could ensure people don't put in links via other means or formats.. we could even accept the aliased urls (if a alias lookup returns correctly) and adjust the body markup links) back to node/, which in affect will let pathologic do its magic (great module by the way!).
Definitely some good ideas here.. will have to get some code up soon. Cheers.
Ok, I've started off a very
Ok, I've started off a very raw version in a sandbox here:
https://drupal.org/sandbox/alexkb/1922328
Its not been tested, but the basics are there - we have a defined rule to detect and record the relationships of pathologic urls (i.e. node/) in the body field of nodes. It also checks for simple html dom requirements during install. There's a lot more to do, which I'll get onto in a few days:
Like I said, very raw. Criticisms welcome! :)
@alexkb - I think you need to
@alexkb - I think you need to fix the link to your sandbox. I'm guessing it is this http://drupal.org/sandbox/alexkb/1922328 ?