Rewrite newsletter HTML instead of changing it #7
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
One of the goals of this project is to convert a newsletter into clean, readable text, without unnecessary clutter and without tracking (by default).
The first try used kuchiki and sanitize-html-rs to clean the HTML and edit it. That didn't really work so well though.
So now, the readability crate (GitHub) should be investigated to see whether that returns reasonable output. See this issue for info on how to split up the extraction function.
If that works, then it becomes possible to generate custom HTML for the newsletter, containing a much cleaner newsletter.
For the HTML export, it would then also become possible to add a real title, and custom buttons or so to load an image.
Apparently, there is a newer crate that implements this algorithm: mozilla-readability.
It is based loosely on Mozilla's JS version, which is actively maintained.
So another alternative might be to compile it into WASM and then use it in the code as a module.
Another alternative is to convert the HTML to Markdown first, and then convert back to HTML. That will lose most of the original styling, but could prove to be much easier to implement.
Example crate: html2md
One concern about rewriting is when the parsers can't parse the original email into some good text. In that case, the output may be severely distorted.
That is not ideal, and needs to be mitigated against.
One idea to do so would be to extract alll plain text from the emails HTML, and compare that with the length of the converted text. If it is not too different, then the conversion could be assumed to have a reasonable fidelity to the original.