Question chinese webpage

Zexor

Well-known member
Joined
Nov 28, 2008
Messages
520
Programming Experience
3-5
I am trying to copy the web source code from the webbrowser tool that contain chinese characters into a textbox. But they show up as a diamond shape with a question mark inside it. Chinese character show up fine in my textbox if i do copy and paste. But from the web source code, they all show up weird. How can i fix it?
 
Probably need to see some relevant code for how you get the text, just guessing there could be problem with that.
I also see the HtmlDocument has a DefaultEncoding property (readonly as detected) and a Encoding property that also can be set, so if there is no problem with the code these should something to look into.
 
if i use IE to open the url and view page source, everything looks right.
I use WebBrowser1.Navigate to the url and at the WebBrowser1_DocumentCompleted , i get the WebBrowser1.DocumentText
 
I loaded a Wiki page displaying Chinese chars in the WebBrowser and copied DocumentText to a TextBox.Text, and there were no problems with that preserving the content encoding, the Chinese chars displayed correctly in the TextBox.
 
ok i narrowed it down to webpages that have charset=gbk in the meta. those pages will give me the diamond shapes. for example Woqudu
 
There appears to be a mismatch of the encoding, the document reports encoding used is "gb2312", while meta and for example Firefox reports "gbk". It should be possible to convert, though I could not figure that out right away, but you may also get for example .Document.Body.OuterHtml which returns correctly encoded text.
 
what is the outerhtml? it seem to show some extra labeling and formating tags than the documenttext version and tag got capitalized.
also, when i navigate to a webpage, the WebBrowser1_DocumentCompleted section activated like 6 times instead of just once. Why is that?
 
The Document object (type HtmlDocument) expose the parsed document object tree, the example there got the Body element (HtmlElement), and it's OuterHtml is the complete Html source as the WebBrowser sees it - including the outer Body tags.
WebBrowser1_DocumentCompleted section activated like 6 times instead of just once. Why is that?
The event is raised for many stages of loading, there may also be sub-documents loading. I think the most I've seen so far was around 25 events for a "page". Check the ReadyState property, it should be raised once for state Complete.
 
Back
Top