Re: [squeak-dev] XMLTokenizer problem with ampersand

List overview All Threads
Download

newer

older

A Cuis/GreenNeon website online

The Trunk:...

Jakob Reschke

1 Jun 2015 1 Jun '15

11:01 p.m.

I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.

HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one?

Best regards Jakob

2015-06-01 20:05 GMT+02:00 karl ramberg karlramberg@gmail.com:

...

Hi, I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by a space in a string. I guess '&' is used for other stuff than a 'and' in html and it causes error when used in plain text.

Does anybody have fix for this?

Karl

Show replies by date

karl ramberg

2 Jun 2 Jun

12:20 a.m.

New subject: XMLTokenizer problem with ampersand

Hi, thanks for the info. I guess I need a HTMLTokenizer for what I'm doing. I had issues with &nbsp as well, with the current XMLTokenizer

Karl

On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke <jakob.reschke@student.hpi.de

...

wrote:

...

I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.

HTML is more relaxed about this, so a standalone amapersand is valid, but you would need some kind of HTMLTokenizer and I do not know whether there is such thing for Squeak. Anyone else knows one?

Best regards Jakob

2015-06-01 20:05 GMT+02:00 karl ramberg karlramberg@gmail.com:

...
Hi, I'm parsing some html docs but the XMLTokenizer chockes on a '&'

followed by

...
a space in a string. I guess '&' is used for other stuff than a 'and' in html and it causes

error

...
when used in plain text.

Does anybody have fix for this?

Karl

Levente Uzonyi

4:10 a.m.

New subject: XMLTokenizer problem with ampersand

XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look similar, but are very different. We used to use Soup[1] to parse HTML pages.

Levente

[1] http://squeaksource.com/Soup.html (watch out for versions which may not be Squeak-compatible)

On Tue, 2 Jun 2015, karl ramberg wrote:

...

Hi,thanks for the info. I guess I need a HTMLTokenizer for what I'm doing. I had issues with &nbsp as well, with the current XMLTokenizer

Karl

On Mon, Jun 1, 2015 at 11:01 PM, Jakob Reschke jakob.reschke@student.hpi.de wrote: I guess this will not help you, but a standalone ampersand is not valid XML (it is the leader for entities, if you want to have a literal ampersand in the text, the markup must be &), hence I would not expect any XML tokenizer or parser implementation to accept it.
  HTML is more relaxed about this, so a standalone amapersand is valid,
  but you would need some kind of HTMLTokenizer and I do not know
  whether there is such thing for Squeak. Anyone else knows one?

  Best regards
  Jakob

  2015-06-01 20:05 GMT+02:00 karl ramberg <karlramberg@gmail.com>:
  > Hi,
  > I'm parsing some html docs but the XMLTokenizer chockes on a '&' followed by
  > a space in a string.
  > I guess '&' is used for other stuff than a 'and' in html and it causes error
  > when used in plain text.
  >
  > Does anybody have fix for this?
  >
  > Karl

Chris Muller

6:17 p.m.

New subject: XMLTokenizer problem with ampersand

On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi leves@elte.hu wrote:

...

XMLTokenizer is not suitable to parse HTML documents. XML and HTML may look similar, but are very different. We used to use Soup[1] to parse HTML pages.

Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if so how does it compare to Soup?

[1] -- http://www.squeaksource.com/htmlcssparser.html

karl ramberg

4 Jun 4 Jun

6:23 p.m.

New subject: XMLTokenizer problem with ampersand

Hi, I tested three different HTML parsers and found SOUP to work best for my needs. Thank you all.

Karl

On Tue, Jun 2, 2015 at 6:17 PM, Chris Muller asqueaker@gmail.com wrote:

...

On Mon, Jun 1, 2015 at 9:10 PM, Levente Uzonyi leves@elte.hu wrote:

...
XMLTokenizer is not suitable to parse HTML documents. XML and HTML may

look

...
similar, but are very different. We used to use Soup[1] to parse HTML pages.

Have you used Todd Blanchard's "HTML & CSS Validating Parser" [1], if so how does it compare to Soup?

[1] -- http://www.squeaksource.com/htmlcssparser.html

3272

Age (days ago)

3275

Last active (days ago)

squeak-dev@lists.squeakfoundation.org

4 comments

4 participants

tags (0)

participants (4)

Chris Muller
Jakob Reschke
karl ramberg
Levente Uzonyi