We are getting the following unexpected output when parsing HTML:
Input:
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
<meta content="width=device-width, initial-scale=1" name="viewport" />
</head>
<body about="" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# schema: http://schema.org/">
<main>
<article>
<div datatype="rdf:HTML" id="content" property="schema:description">
<p>foo</p>
<div rel="schema:hasPart" resource="#bar">
<p property="schema:description" datatype="rdf:HTML"><span>bar</span></p>
</div>
</div>
</article>
</main>
</body>
</html>
Output:
<https://dokie.li/tmp/test.html#bar> <http://schema.org/description> "<span xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/description> "\n <p xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">foo</p>\n <div rel=\"schema:hasPart\" resource=\"#bar\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:schema=\"http://schema.org/\">\n \n </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<https://dokie.li/tmp/test.html> <http://schema.org/hasPart> <https://dokie.li/tmp/test.html#bar> .
Expected ( from http://rdf.greggkellogg.net/distiller ):
<http://example.org/> <http://schema.org/description> "\n <p>foo</p>\n <div rel=\"schema:hasPart\" resource=\"#bar\">\n <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n </div>\n"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
<http://example.org/> <http://schema.org/hasPart> <http://example.org/#bar> .
<http://example.org/#bar> <http://schema.org/description> "<span>bar</span>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML> .
Note the missing markup and content inside of the div (\n <p property=\"schema:description\" datatype=\"rdf:HTML\"><span>bar</span></p>\n )
Is this a bug in rdf-ext / rdfa-streaming-parser, or does the issue perhaps lie on our end somehow? It'd be great if you can preproduce / confirm.
Originally posted by @csarven in #66
We are getting the following unexpected output when parsing HTML:
Originally posted by @csarven in #66