how to ignore the opening script tag and closing script tag

navagomeza

Member
Joined
Jan 13, 2014
Messages
12
Programming Experience
1-3
Hello,

I would like to ask for help in writing a small check for a string that contains html tags and can contains also JavaScript tags.
I want to be able to check the string and skip any content that is within the <script></script>



I am stripping the html from extra spaces, tabs and line breaks, and some of the html has inline JavaScript, which is not ideal but right now I have to work around it.

Here is en example my small method, and where I would like to add the new check:

VB.NET:
    Public Shared Function MinifyStringContent(ByVal strContent As String) As String
        Dim rtnStringContent As String = ""
        If Not IsNothing(strContent) AndAlso strContent <> "" Then
            rtnStringContent = strContent.ToString.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "")
        End If
        Return rtnStringContent.Trim()
    End Function


Thank you for your help.

navagomeza.
 
For reading/manipulating html I would in almost all cases use the free library Html Agility Pack. Having the html element tree parsed for you and be able to handle the elements as that and not string fragments is usually preferred.

Here is a sample using that library, to what I think you are asking, it selects text nodes except ones in script tags and does the same replaces:
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load("input.htm")

        For Each textnode In doc.DocumentNode.SelectNodes("//text()")
            If Not textnode.ParentNode.Name = "script" Then
                textnode.InnerHtml = textnode.InnerText.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "")
            End If
        Next
        doc.Save("output.htm")
 
Hello John, and thank you for pointing out to HtmlAgilityPack. I did not know it exist. I am attempting to use your example in my code, but I do not know how I would return the value from the function after doing a: doc.Save(string)


This is what I have been able to do with your code:


VB.NET:
    Public Shared Function MinifyStringContent(ByVal strContent As String) As String
        Dim doc As New HtmlAgilityPack.HtmlDocument
        doc.Load(strContent)
        If Not IsNothing(strContent) AndAlso strContent <> "" Then
            For Each textnode In doc.DocumentNode.SelectNodes("//text()")
                If Not textnode.ParentNode.Name = "script" Then
                    textnode.InnerHtml = textnode.InnerText.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "")
                End If
            Next
        End If
        Return doc.DocumentNode.OuterHtml
    End Function


But the code break in this part:


VB.NET:
    doc.Load(strContent)


The error reads: Illegal characters in path.


The method is called from the Render method in BasePage.vb:


VB.NET:
Protected Overrides Sub Render(ByVal writer As System.Web.UI.HtmlTextWriter)
    Dim stringWriter As System.IO.StringWriter = New System.IO.StringWriter
    Dim htmlWriter As HtmlTextWriter = New HtmlTextWriter(stringWriter)


    MyBase.Render(htmlWriter)
    Dim html As String = stringWriter.ToString()
    html = Functions.MinifyStringContent(html)
    writer.Write(html)
End Sub




Would you help me to identify what I am doing incorrectly? Thank you for your help.




navagomeza
 
Last edited:
Load method accepts a path/stream/reader. If you want to load from string content you can use LoadHtml method.

As for output, Save method accepts similar path/stream/writer, where writer is a IO.TextWriter, something that HtmlTextWriter happens to be, so you can pass that into method and save to writer if you want to: doc.save(writer)

Moved thread to ASP.Net section of forums. Someone might have a better idea than HtmlAgilityPack when operating under ASP.Net platform, which is more "native" to this kind of thing.
 
Hi John, a quick question, will this be the correct syntax to also check for the style element?

If (Not textnode.ParentNode.Name = "script") OrElse (textnode.ParentNode.Name = "style")

Thank you.
 
Hi John, a quick question, will this be the correct syntax to also check for the style element?

If (Not textnode.ParentNode.Name = "script") OrElse (textnode.ParentNode.Name = "style")

Thank you.
Here you can read to understand how those two expressions are evaluated, or not: OrElse Operator (Visual Basic)
 
I missed a Not in my check:

VB.NET:
[COLOR=#333333][I]If (Not textnode.ParentNode.Name = "script") OrElse (Not textnode.ParentNode.Name = "style")[/I][/COLOR]
 
If (Not textnode.ParentNode.Name = "script") OrElse (Not textnode.ParentNode.Name = "style")
I would say you have that wrong, if actually Name="style" then first expression=True, second expression is not evaluated and code block is entered. Your first expression excludes the second from ever happening.

It would seem more likely that you want to enter code block when Name is neither "script" nor "style":
If Not (textnode.ParentNode.Name = "script" OrElse textnode.ParentNode.Name = "style") Then

What happens when Name="script"? (True+not evaluated)=True > Not (True) = False, code block is not entered.
What happens when Name="style"? (False+True)=True > Not (True) = False, code block is not entered.
What happens when Name="div"? (False+False)=False > Not (False) = True, code block is entered.
 
Hello John, you are correct again.

What I need is to skip any elements to which which their ParentNode.Name is neither <script> nor <style>, and if not script nor style, then remove carriage return, extra spaces, tabs, new lines from the html.

Perhaps I should do better this way:


VB.NET:
                If Not (textnode.ParentNode.Name = "script" OrAlso textnode.ParentNode.Name = "style") Then
                    textnode.InnerHtml = textnode.InnerText.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "").Replace(vbNewLine, "")
                End If

Or,

VB.NET:
                If Not (textnode.ParentNode.Name = "script") Then
                    If Not (textnode.ParentNode.Name = "script") Then
                        textnode.InnerHtml = textnode.InnerText.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "").Replace(vbNewLine, "")
                    End If
                End If

What do you think? by the way, thank you for your help and patience explaining.

Best regards.
 
Hi JohnH.

I have been enjoying the html agility pack parsing for the minification of my page's html. today I ran into an issue, and that is that the closing option tag gets removed by the html agility pack, and that affects some of the other functions in the page.

Do you know if there is a configuration I need to add to my code so this tag gets included?

This is how my method currently looks:

VB.NET:
        Dim htmlDoc As New HtmlAgilityPack.HtmlDocument


        htmlDoc.OptionDefaultStreamEncoding = Encoding.UTF8
        Dim strNoSpaces As String = Regex.Replace(strContent, "\s{2,}", " ")
        htmlDoc.LoadHtml(strNoSpaces)
        If Not IsNothing(strNoSpaces) AndAlso strNoSpaces <> "" Then
            For Each textnode In htmlDoc.DocumentNode.SelectNodes("//text()")
                If Not (textnode.ParentNode.Name = "script" OrElse textnode.ParentNode.Name = "style") Then
                    textnode.InnerHtml = textnode.InnerText.Replace(vbCrLf, "").Replace("  ", " ").Replace(vbTab, "").Replace(vbNewLine, "")
                End If
            Next
        End If
        Return htmlDoc.DocumentNode.OuterHtml


Thank you much fro your help.

Ale.
 
Back
Top