Rewrite newsletter HTML instead of changing it #7

Open
opened 2022-08-10 14:31:55 +00:00 by jjkiers · 3 comments
Owner

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default).

The first try used kuchiki and sanitize-html-rs to clean the HTML and edit it. That didn't really work so well though.

So now, the readability crate (GitHub) should be investigated to see whether that returns reasonable output. See this issue for info on how to split up the extraction function.

If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter.

For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image.

One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default). The first try used [kuchiki] and [sanitize-html-rs][sanitize] to clean the HTML and edit it. That didn't really work so well though. So now, the [readability] crate ([GitHub][readability-github]) should be investigated to see whether that returns reasonable output. See [this issue][custom-extract] for info on how to split up the extraction function. If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter. For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image. [kuchiki]: https://docs.rs/kuchiki/ "kuchiki" [sanitize]: https://docs.rs/sanitize-html-rs "sanitize-html-rs" [readability]: https://docs.rs/readability/ "readability crate" [readability-github]: https://github.com/kumabook/readability "github" [custom-extract]: https://github.com/kumabook/readability/issues/16 "this issue"
jjkiers added the
feature
label 2022-08-10 14:31:55 +00:00
Author
Owner

Apparently, there is a newer crate that implements this algorithm: mozilla-readability.

It is based loosely on Mozilla's JS version, which is actively maintained.

So another alternative might be to compile it into WASM and then use it in the code as a module.

Apparently, there is a newer crate that implements this algorithm: [mozilla-readability](https://crates.io/crates/mozilla-readability). It is based loosely on [Mozilla's JS version](https://github.com/mozilla/readability), which is actively maintained. So another alternative _might_ be to compile it into WASM and then use it in the code as a module.
Author
Owner

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement.

Example crate: html2md

Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement. Example crate: [html2md](https://crates.io/crates/html2md)
Author
Owner

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted.

That is not ideal, and needs to be mitigated against.

One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.

One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted. That is not ideal, and needs to be mitigated against. One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: newsletter-to-web/newsletter-to-web#7
No description provided.