Rewrite newsletter HTML instead of changing it #7

New Issue

jjkiers · 2022-08-10T14:31:55Z

jjkiers commented

2022-08-10 14:31:55 +00:00

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default).

The first try used kuchiki and sanitize-html-rs to clean the HTML and edit it. That didn't really work so well though.

So now, the readability crate (GitHub) should be investigated to see whether that returns reasonable output. See this issue for info on how to split up the extraction function.

If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter.

For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image.

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default). The first try used [kuchiki] and [sanitize-html-rs][sanitize] to clean the HTML and edit it. That didn't really work so well though. So now, the [readability] crate ([GitHub][readability-github]) should be investigated to see whether that returns reasonable output. See [this issue][custom-extract] for info on how to split up the extraction function. If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter. For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image. [kuchiki]: https://docs.rs/kuchiki/ "kuchiki" [sanitize]: https://docs.rs/sanitize-html-rs "sanitize-html-rs" [readability]: https://docs.rs/readability/ "readability crate" [readability-github]: https://github.com/kumabook/readability "github" [custom-extract]: https://github.com/kumabook/readability/issues/16 "this issue"

jjkiers added the

feature

label 2022-08-10 14:31:55 +00:00

jjkiers commented

2022-12-14 10:32:02 +00:00

Apparently, there is a newer crate that implements this algorithm: mozilla-readability.

It is based loosely on Mozilla's JS version, which is actively maintained.

So another alternative might be to compile it into WASM and then use it in the code as a module.

Apparently, there is a newer crate that implements this algorithm: [mozilla-readability](https://crates.io/crates/mozilla-readability). It is based loosely on [Mozilla's JS version](https://github.com/mozilla/readability), which is actively maintained. So another alternative _might_ be to compile it into WASM and then use it in the code as a module.

jjkiers commented

2022-12-29 11:48:19 +00:00

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement.

Example crate: html2md

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement. Example crate: [html2md](https://crates.io/crates/html2md)

jjkiers commented

2022-12-29 11:57:19 +00:00

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted.

That is not ideal, and needs to be mitigated against.

One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted. That is not ideal, and needs to be mitigated against. One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.

jjkiers commented

2025-01-24 21:49:27 +00:00

According to this article, there are two more great contenders:

fast_html2md:

This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage.

In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings.

The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library.

dom_smoothie:

While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.

According to [this article](https://emschwartz.me/comparing-13-rust-crates-for-extracting-text-from-html/), there are two more great contenders: [fast_html2md](https://crates.io/crates/fast_html2md): > This library does a reasonable job transforming the HTML into markdown while being among the fastest performers and maintaining extremely low memory usage. > > In the tests above, it kept its memory footprint between 5-6kb, independent of the input size. This is impressive but unsurprising given that the underlying HTML library, lol_html, lets you tune the memory settings. > > The blog post A History of HTML Parsing at Cloudflare: Part 2 gives more detail on the history and architecture of lol_html. If you're doing any kind of HTML manipulation, I would recommend reading that post and trying out their library. [dom_smoothie](https://crates.io/crates/dom_smoothie): > While this has much higher memory usage than fast_html2md, and far fewer downloads at the time of writing, it is the only Readability implementation that correctly found the main text in the very limited subset of websites I tested. If you want to make sure you only include the text and none of the headers or other content, this might be the crate for you.

Sign in to join this conversation.