data.xml

parse can be extremely slow for certain input data

Details

  • Type: Defect Defect
  • Status: Resolved Resolved
  • Priority: Major Major
  • Resolution: Declined
  • Affects Version/s: None
  • Fix Version/s: None
  • Component/s: None
  • Labels:
    None

Description

I'm still doing some experiments but parse seems to take a very long time to deal with this URL http://www.cybletechnologies.com/?feed=rss2 and I wonder if it's due to huge CDATA piece containing JS code?

I'll do some more experimentation to narrow it down but wanted to get at least a placeholder bug in play in case this was a known issue.

Activity

Hide
Ryan Senior added a comment -

I profiled this. The problem looks to be with the DTD calls. I've not done a lot of stuff with DTD, but it looks like the StAX parser is making a bunch of HTTP calls for things referenced by the DTD. First it pulls in http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd, my guess is it's resolving a bunch of stuff referenced from that DTD. The parse did eventually finish but took around 10 minutes on my laptop.

If I pass in :support-dtd false to the parse call, it returns very quickly for me, around 4 milliseconds.

(parse (java.io.FileInputStream. "path/to/file.html") :support-dtd false)

I'm going to close this as the behavior seems to be correct from a StAX perspective, and it :support-dtd false seems to be a pretty reasonable work around.

Show
Ryan Senior added a comment - I profiled this. The problem looks to be with the DTD calls. I've not done a lot of stuff with DTD, but it looks like the StAX parser is making a bunch of HTTP calls for things referenced by the DTD. First it pulls in http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd, my guess is it's resolving a bunch of stuff referenced from that DTD. The parse did eventually finish but took around 10 minutes on my laptop. If I pass in :support-dtd false to the parse call, it returns very quickly for me, around 4 milliseconds.
(parse (java.io.FileInputStream. "path/to/file.html") :support-dtd false)
I'm going to close this as the behavior seems to be correct from a StAX perspective, and it :support-dtd false seems to be a pretty reasonable work around.

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated:
    Resolved: