Remove all markup but retain text #569

lonix1 · 2025-03-13T04:57:32Z

lonix1
Mar 13, 2025

Suppose my users can post comments, but they must not have HTML or anything other than plain text. I did this:

var s = new HtmlSanitizer();
s.AllowedTags.Clear();

If the input is: foo bar <span>123</span> baz qux
Then the output is: foo bar baz qux

How can I configure it so it gives me: foo bar 123 baz qux

I realise this creates a problem if the input is foo bar <script>alert('xss')</script> baz qux. Any advice how to handle this?

For example in HtmlRuleSanitizer there is a "tag flattening" feature.

Answered by mganss

Mar 13, 2025

For your specific use case I would suggest using AngleSharp directly instead of HtmlSanitizer:

var parser = new HtmlParser();
var html = "foo bar <span>123</span> baz <script>alert('xss')</script> qux";
var doc = parser.ParseDocument(html);
var text = doc.Body.TextContent;
// "foo bar 123 baz alert('xss') qux"

View full answer

mganss · 2025-03-13T12:01:20Z

mganss
Mar 13, 2025
Maintainer

For your specific use case I would suggest using AngleSharp directly instead of HtmlSanitizer:

var parser = new HtmlParser();
var html = "foo bar <span>123</span> baz <script>alert('xss')</script> qux";
var doc = parser.ParseDocument(html);
var text = doc.Body.TextContent;
// "foo bar 123 baz alert('xss') qux"

1 reply

lonix1 Mar 13, 2025
Author

Thank you, that is an excellent approach!

(I'm already using HtmlAgilityPack, but I'm sure it has similar functionality.)

tiesont · 2025-03-13T23:20:53Z

tiesont
Mar 13, 2025

If you really wanted to use HtmlSanitizer to accomplish this, I have this in a utility class (for pretty much the same reason):

/// <summary>
/// Sanitizes the specified HTML body fragment. Allows no markup.
/// </summary>
/// <param name="markup">An HTML body fragment.</param>
/// <param name="keepText">Whether or not to retain the text content from removed markup.</param>
/// <returns>The sanitized HTML body fragment.</returns>
public static string WhitewashMarkup(string markup, bool keepText = false)
{
    if (!string.IsNullOrWhiteSpace(markup))
    {
        var options = new HtmlSanitizerOptions
        {
            AllowedTags = new HashSet<string>()
        };

        var sanitizer = new HtmlSanitizer(options)
        {
            KeepChildNodes = keepText
        };

        markup = sanitizer.Sanitize(markup);
    }

    return markup;
}

I'm sure there's easier ways, but this has worked for me for a while now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove all markup but retain text #569

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Remove all markup but retain text #569

Uh oh!

Uh oh!

lonix1 Mar 13, 2025

Replies: 2 comments · 1 reply

Uh oh!

mganss Mar 13, 2025 Maintainer

Uh oh!

lonix1 Mar 13, 2025 Author

Uh oh!

tiesont Mar 13, 2025

lonix1
Mar 13, 2025

Replies: 2 comments 1 reply

mganss
Mar 13, 2025
Maintainer

lonix1 Mar 13, 2025
Author

tiesont
Mar 13, 2025