Rewrite newsletter HTML instead of changing it #7

Open
opened 2022-08-10 14:31:55 +00:00 by jjkiers · 4 comments
Owner

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default).

The first try used kuchiki and sanitize-html-rs to clean the HTML and edit it. That didn't really work so well though.

So now, the readability crate (GitHub) should be investigated to see whether that returns reasonable output. See this issue for info on how to split up the extraction function.

If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter.

For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image.

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default). The first try used [kuchiki] and [sanitize-html-rs][sanitize] to clean the HTML and edit it. That didn't really work so well though. So now, the [readability] crate ([GitHub][readability-github]) should be investigated to see whether that returns reasonable output. See [this issue][custom-extract] for info on how to split up the extraction function. If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter. For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image. [kuchiki]: https://docs.rs/kuchiki/ "kuchiki" [sanitize]: https://docs.rs/sanitize-html-rs "sanitize-html-rs" [readability]: https://docs.rs/readability/ "readability crate" [readability-github]: https://github.com/kumabook/readability "github" [custom-extract]: https://github.com/kumabook/readability/issues/16 "this issue"
jjkiers added the
feature
label 2022-08-10 14:31:55 +00:00
Author
Owner

Apparently, there is a newer crate that implements this algorithm: mozilla-readability.

It is based loosely on Mozilla's JS version, which is actively maintained.

So another alternative might be to compile it into WASM and then use it in the code as a module.

Apparently, there is a newer crate that implements this algorithm: [mozilla-readability](https://crates.io/crates/mozilla-readability). It is based loosely on [Mozilla's JS version](https://github.com/mozilla/readability), which is actively maintained. So another alternative _might_ be to compile it into WASM and then use it in the code as a module.
Author
Owner

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement.

Example crate: html2md

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement. Example crate: [html2md](https://crates.io/crates/html2md)
Author
Owner

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted.

That is not ideal, and needs to be mitigated against.

One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted. That is not ideal, and needs to be mitigated against. One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.
Author
Owner

According to this article, there are two more great contenders:

fast_html2md:

This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage.

In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings.

The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library.

dom_smoothie:

While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.

According to [this article](https://emschwartz.me/comparing-13-rust-crates-for-extracting-text-from-html/), there are two more great contenders: [fast_html2md](https://crates.io/crates/fast_html2md): > This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage. > > In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings. > > The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library. [dom_smoothie](https://crates.io/crates/dom_smoothie): > While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: newsletter-to-web/newsletter-to-web#7
No description provided.