I'm hoping to get some local help in my continuing effort to parse and import a RSS feed. I've made progress since I presented my impasses at the last UG meeting. My latest (and I hope last) major hurtle is caused by a limitation with the XPath Feeds module described here: http://drupal.org/node/1459870#comment-7435964 . Very simply, the RSS feed I'm attempting to parse has encoded HTLM wrapped in XML. Here's a bit of the encoded HTML from the RSS feed:
<div class="field field-type-content-taxonomy field-field-institutions"> <div class="field-label"> <h3> Associated institutions:&nbsp; </h3> </div> <div class="field-items"> <div class="field-item odd"> Cincinnati Children&#039;s Hospital Medical Center </div>I can parse through the XML but when I hit "<", rather than "<", or ">", rather than ">" demarking tags of the encoded HTML, I can't parse through it.
In other words, I need a way to transform "<" to "<" and ">" to ">" in the input. I'm wondering if I can use tidy_repair_string to do that. I'm wondering if setting a configuration parameter here, in FeedsXPathParseXML.inc, might do the trick:
class FeedsXPathParserXML extends FeedsXPathParserBase {
/**
* Implements FeedsXPathParserBase::setup().
*/
protected function setup($source_config, FeedsFetcherResult $fetcher_result) {
if (!empty($source_config['exp']['tidy'])) {
$config = array(
'input-xml' => TRUE,
'wrap' => 0,
'tidy-mark' => FALSE,
);
// Default tidy encoding is UTF8.
$encoding = $source_config['exp']['tidy_encoding'];
$raw = tidy_repair_string(trim($fetcher_result->getRaw()), $config, $encoding);
}
else {
$raw = $fetcher_result->getRaw();
}But I'm unsure what parameters to set to what, or if I'm barking up the wrong tree. I'd very much welcome any help. Thanks very much.
Comments
Used str_replace, instead of tidy_repair_string, Advice?
I couldn't get what I needed from tidy_repair_string so I ended up using
str_replaceto replace the offending encoded HTML with symbolic HTML in the$rawinput RSS feed. I hacked/patched the code into theFeedsXPathParserXML.incfile as follows (the 3 lines following MAK comment below). I'd appreciate any advice on how to do this more legitimately. Should I create a formal patch? Thanks very much for any advice.class FeedsXPathParserXML extends FeedsXPathParserBase {
/
* Implements FeedsXPathParserBase::setup().
*/
protected function setup($source_config, FeedsFetcherResult $fetcher_result) {
if (!empty($source_config['exp']['tidy'])) {
$config = array(
'input-xml' => TRUE,
'wrap' => 0,
'tidy-mark' => FALSE,
);
// Default tidy encoding is UTF8.
$encoding = $source_config['exp']['tidy_encoding'];
$raw = tidy_repair_string(trim($fetcher_result->getRaw()), $config, $encoding);
}
else {
$raw = $fetcher_result->getRaw();
}
/ MAK 052213 Unencode embeded HTML so that it can be parsed using XPath */
$encoded_html = array("<", ">", """, "&nbsp;");
$unencoded_html = array("<", ">", "\"", " ");
$raw = str_replace($encoded_html, $unencoded_html, $raw);
$doc = new DOMDocument();
$use = $this->errorStart();
$success = $doc->loadXML($raw);
unset($raw);
$this->errorStop($use, $source_config['exp']['errors']);
if (!$success) {
throw new Exception(t('There was an error parsing the XML document.'));
}
return $doc;
}
protected function getRaw(DOMNode $node) {
return $this->doc->saveXML($node);
}
I need a way to transform
You need PHP's htmlspecialchars_decode().
Thanks!
Thanks!
Solution?
Milan, Did you come up with a solution for this? I'm having the same issues, but I'm not very fluent in php. Would really appreciate some guidance if you've figured it out.
I'm working with Views RSS and hopefully a template solution. Thanks!
Solution?
I did come up with a solution as I described in my May 23, 2013 comment. I used the str_replace function to change
"<" to "<",etc. The trick was to find the right spot in the XML Parser PHP code to insert the str_replace function and modify the raw feed that was in the $raw variable.