I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.
HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one?
Best regards Jakob
2015-06-01 20:05 GMT+02:00 karl ramberg karlramberg@gmail.com:
Hi, I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by a space in a string. I guess '&' is used for other stuff than a 'and' in html and it causes error when used in plain text.
Does anybody have fix for this?
Karl
Hi, thanks for the info. I guess I need a HTMLTokenizer for what I'm doing. I had issues with   as well, with the current XMLTokenizer
Karl
On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <jakob.reschke@student.hpi.de
wrote:
I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.
HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one?
Best regards Jakob
2015-06-01 20:05 GMT+02:00 karl ramberg karlramberg@gmail.com:
Hi, I'm parsing some html docs but the XMLTokenizer chockes on a '&'
followed by
a space in a string. I guess '&' is used for other stuff than a 'and' in html and it causes
error
when used in plain text.
Does anybody have fix for this?
Karl
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look similar, but are very different. We used to use Soup[1] to parse HTML pages.
Levente
[1] http://squeaksource.com/Soup.html (watch out for versions which may not be Squeak-compatible)
On Tue, 2 Jun 2015, karl ramberg wrote:
Hi,thanks for the info. I guess I need a HTMLTokenizer for what I'm doing. I had issues with   as well, with the current XMLTokenizer
Karl
On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke jakob.reschke@student.hpi.de wrote: I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.
HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one? Best regards Jakob 2015-06-01 20:05 GMT+02:00 karl ramberg <karlramberg@gmail.com>: > Hi, > I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by > a space in a string. > I guess '&' is used for other stuff than a 'and' in html and it causes error > when used in plain text. > > Does anybody have fix for this? > > Karl
On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi leves@elte.hu wrote:
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look similar, but are very different. We used to use Soup[1] to parse HTML pages.
Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if so how does it compare to Soup?
Hi, I tested three different HTML parsers and found SOUP to work best for my needs. Thank you all.
Karl
On Tue, Jun 2, 2015 at 6:17 PM, Chris Muller asqueaker@gmail.com wrote:
On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi leves@elte.hu wrote:
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may
look
similar, but are very different. We used to use Soup[1] to parse HTML pages.
Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if so how does it compare to Soup?
squeak-dev@lists.squeakfoundation.org