Question Pls help! Multiple (http)Webrequests

xGhost

Member
Joined
Feb 8, 2010
Messages
21
Programming Experience
1-3
I have done quite some debugging on this one.
What I want to achieve:
1) I log in to a website & store the cookies in a container
2)With that cookie I need to request many pages and parse them.
3)Make the above efficiënt & fast.

Problem:
When I do 1 request at a time, it would take ages to request & parse x amount of pages.

So what I wanted/want:
Multiple httpwebrequests at thesame time (or executing partially serial/parallel). I know with sockets you can create an array and fire many requests at thesame time (and fastly), but I'm not using sockets.

So my first thought was:
Create multiple threads with a request with ThreadPool.QueueUserWorkItem
-> I got a lot of errors, and half of the time I saw in the html that I wasn't logged in
-> changed my (sequential) requests into asynchronious ones
-> many errors dissapeared (it happens once a while that I'm not logged in).

Conlusion:
My results are 50% more fast (then running 1 request at a time in a normal loop sequentially), here I began to think that only 2 requests at thesame time actually are executed even there are more than 1 threads started

My second thought:
Hey lets try the .net 4.0 parallel.for
-> Not many threads are actually started
-> gives thesame performance as using the threadpooling.

My conclusions are:
-> I think that just 2 requests are executed/started at thesame time, 1 on cpucore1, another on cpucore2

This is nothing like the performance of sockets even on 1 core.
So my question is how can I request and parse multiple pages at thesame time (simultaniously or mixed parallel/sequential with results like an array of sockets give). This cant be real that only 2 parallel or simultaniously connections can be made (as a request).

My structure on this is now:
login
parrallelloop(execute a method in the parallelloop)


method:
VB.NET:
        Try
            Dim httpStateRequest As HttpState = New HttpState

            httpStateRequest.httpRequest = WebRequest.Create(url)
            httpStateRequest.httpRequest.CookieContainer = cookies
            httpStateRequest.httpRequest.KeepAlive = False

            ' Get the response object
            Dim ar As IAsyncResult
            ar = httpStateRequest.httpRequest.BeginGetResponse(AddressOf HttpResponseCallback, httpStateRequest)
        Catch wex As WebException
            Console.WriteLine("Exception occurred on request: {0}", wex.Message)
        End Try
VB.NET:
    Private Sub HttpResponseCallback(ByVal ar As IAsyncResult)
        running = running & ("running & I'm busy with the request on: " & Date.Now.ToString & " in thread: " & Thread.CurrentThread.ManagedThreadId) & vbCrLf
        Try
            Dim httpRequestState As HttpState = ar.AsyncState

            ' Complete the asynchronous request
            httpRequestState.httpResponse = httpRequestState.httpRequest.EndGetResponse(ar)
            ' Read the response into a Stream object.
            'Dim httpResponseStream As Stream = httpRequestState.httpResponse.GetResponseStream()
            Dim httpResponseStreamReader As New StreamReader(httpRequestState.httpResponse.GetResponseStream())
            Dim result2 = httpResponseStreamReader.ReadToEnd.Trim
            ' Post asynchronous Read operations on stream
            httpResponseStreamReader.Close()
            Dim doc2 As New HtmlAgilityPack.HtmlDocument()
            doc2.LoadHtml(result2)
            rootNode = doc2.DocumentNode

            'do something with the result
            Return
        Catch ex As Exception
            Console.WriteLine("Exception: {0}", ex.Message)
        End Try
    End Sub

Parallelloop:
VB.NET:
Parallel.For(begin, iterations, Sub(i)
                                            counter += 1
                                            strThreads = strThreads & "I'm: " & counter & " and in thread: " & Thread.CurrentThread.ManagedThreadId & vbCrLf     
                                      executeAmethodWhichDoesAnAsyncRequest(params come here)
                                        End Sub)

A screenshot about the threads (starting/time):
2eki5qf.jpg

What you see at the first multiline textbox are threads whom are possibly put in a waiting state (since for example I'm 221 and in thread: 6 refer to I'm page 221 to be requested in thread 6)
Btw I have a dualcore 2.8 ghz & broadband connection 15+mbps
 
Last edited:
Conlusion:
My results are 50% more fast, here I began to think that only 2 requests at thesame time actually are executed even there are more than 1 threads started
Most webservers limit to two simultaneous connections from one client. (Http 1.1 protocol specification)
Multiple httpwebrequests at thesame time (or executing partially serial/parallel). I know with sockets you can create an array and fire many requests at thesame time (and fastly), but I'm not using sockets.
HttpWebRequest is wrapper class for common operations using a Socket, so you are using sockets.

With Parallel not out yet I can only guess it uses ThreadPool (looks right Task Parallel Library), that would explain that only two thread ids are used (Threadpool reuses threads). The method you call here (that calls BeginGetResponse) finish almost immediately, there is actually no need for you to "parallel" this part.

I don't think there is anything you can do to speed things up here, everything is done async and the limit is the webserver responses.
 
Thanks for the response :).

Is there not a way to do many connections fastly in a sequential way then?
In other words I could do the requests on 1 cpu, and do the parsing on another cpu core.
I remember in vb6 that you could create a (winsock) socket() var and those ran very fast, I received/parsed my results much faster then.
 
KeepAlive for Http is also default True, so if this is like it appears to be, multiple requests to same server, then you're not even wasting time connecting/disconnecting, the two underlying socket connections to this server is kept up during all requests - given that the webserver allows it (most webservers do).
 
I tested out by setting the property to true (its false in the code), but then I see that when I request a page (linked to a cookiecontainer with cookies) that I'm not logged in. If I set it to false, I'm logged in. Probably something I'm not seeing.

So what I need to do is declaring 2 head request/response objects, link the cookiecontainer to it, set the keepalive property to true. Then do all the threading with the 2 head objects? I'm a bit confused :p. (Probably because I've been looking quite some time at the code for the last hours).
 
Last edited:
As I said KeepAlive default value is True, no need to set it. I can't imagine this affecting the requests, I only mentioned it to explain this is something that can't be optimized more. Whether the request is transported over one or another socket connection does not change the request or the response being sent. KeepAlive is a dynamic thing, if during a series of request there is a delay long enough for connection to timeout, then for next request a new connection is simply established, this happens at transport level and is transparent at application level. This is something that goes on all the time during regular browsing sessions.
 
Thanks for the answers btw.

It does affect it I think. In the above code I've set it the keepalive to false. If I change false to true or comment the line, 1 out of 20 requests work. I mean... Every request work, but I'm not logged in, which results in the fact that I can't parse the page.

If I reset it to false it can be that with the first 5 requests i'm not logged in but then i see in the html result that every connection is logged in (which is also weird btw).

Before I go parsing like hell, I logged in once, I filled a cookiecontainer.
Then when I go doing the requests for the parsing, I link that container everytime to the request-object for every url.

To bad I can't say to the server that I just want 1 div in place of a whole/full page :p. Just seems slow 1.5 seconds avg on a 15+mbps broadband to request/parse one page, 5 seconds/page on a 100kpbs connection.
 
Last edited:
I'm messing a bit around with the properties of the request object. It seems that this line speeds it up bigtime.
VB.NET:
httpStateRequest.httpRequest.AutomaticDecompression = DecompressionMethods.GZip

From 102 seconds to 34 seconds for 20 pages on a ~100kpbs avg wireless line. It just really does :p. That's 1/3 the time.
 
xGhost said:
KeepAlive = False
I actually didn't notice you had that set in first post. Not setting it (True) should speed things up, this can be as much as 10-30% for multiple requests in succession.
It does affect it I think. In the above code I've set it the keepalive to false. If I change false to true or comment the line, 1 out of 20 requests work. I mean... Every request work, but I'm not logged in, which results in the fact that I can't parse the page.
Logically it does not, I'm also not able to reproduce any such effect. That you say either way some of the responses appear not logged in indicates a different problem, what that could be I don't know.
It seems that this line speeds it up bigtime.
AutomaticDecompression property causes Accept-Encoding header to be set, if server supports it it sends the content compressed, with the property set response object also automatically decompress the response stream. In some cases that may improve transmission time more than the extra time to compress+decompress the content.
 
Thanks for the responses.
Yes the cookies aint always accepted. If the keepalive = false and I do 20 requests, the result of the first 5~7 requests shows that I'm not logged in, after that I see that all the remaining requests have a successfull login (sometimes I see that 20/20 is logged in, sometimes not). If I set it to true just 1/20 results give a logged in. Yes, this is quite weird.

What I just do is:
In the login method
-> initialisation of a new cookiecontainer
-> doing the post
-> for each cookie in the response object I put it in the cookiecontainer
-> in the asynchronious method I just say after the request.create that the request.cookiecontainer = myCookiecontainer

Also If I would do a keepalive = true, I probably can't create a new request object everytime & I need to work with a global request object then (like a field), but that would bring troubles with the async request (since the request could still be doing a BeginGetResponse).

So what I need to figure out is why my cookiehandling is not 100% waterproof.


Next to that I have another idea, but I do not know if its possible with the webrequest class. When you connect to an url, the first thing what is done is looking up the IP-address, with other words a resolve. Then a handshake follows between the client and the server (with ACKs and flags bein set). Basicly in my case, can't I just do the resolve once with a getHostEntry and connect to an ip : port everytime then in place of giving an url in the webrequest.create()? that would give an performance-boost if I'm not mistaken. The only prob I got is then is finding the appropiate port for the IP and the syntax to put in the .create like for example: webrequest.create("http://" & ip & port & "/page.php"). But this is just an idea :p.

Scrap this idea, this is possible :p. Only I've found ip's, when you connect to them, (even in browser) you get as text in the html page -> Online ...
--------
AutomaticDecompression property causes Accept-Encoding header to be set, if server supports it it sends the content compressed, with the property set response object also automatically decompress the response stream. In some cases that may improve transmission time more than the extra time to compress+decompress the content


Yes, I've tested it on a slow connection and a fast connection. It gives a benefit for both connections. With the slow connection it had speed up to a 1/3 of the original time. With the fast connection it (which normally took around 30 seconds for 20 pages) gave an avg time of 23~24 seconds. It seems that this benifit is better for a slower connection.
 
Last edited:
-> for each cookie in the response object I put it in the cookiecontainer
You don't need to do this, just set a CookieContainer object for first request and reuse it for subsequent requests.
It gives a benefit for both connections.
(AutomaticDecompression) That would be different for different servers, and most importantly for different content. Compressed content for example would just be a waste of time having another round of compress/decompress. In your case, downloading text, compression would benefit if the content is of some size, but also here it depends on the processing capasity at both ends and the transfer speed. There is no general rule to this.
can't I just do the resolve once
From what I can tell with KeepAlive the internal code looks up the connection by URI, not by ip/port. Also keep in mind Http 1.1 protocol is defined with KeepAlive as default behaviour, it was one of the major improvements to Http that just makes things overall more efficient.

Btw, here are two tools that you can use to get a better view of what is going on, both with the requests and the sockets:
TCPView for Windows
Fiddler Web Debugger - A free web debugging tool
 
Based on last days research I have discovered some new information. The latest IE and FF browsers have decided to up the 2 persistant connections-per-server from 2 to 6, see for example AJAX - Connectivity Enhancements in Internet Explorer 8. FF also has a default 15 non-persistant connections-per-server. So how can this be configured with the WebRequest? By accessing the ServicePoint property. With ServicePoint Class (System.Net) you can configure the connection pooling behaviour. Number of connections is set with ConnectionLimit property, if following IE/FF this should be max 6 persistant (KeepAlive) or 15 non-persistant. KeepAlive really have some benefits, and I will post some stats and test code. Remember that with KeepAlive the server may restrict you to 2 connections regardless, but it also may not. This is the test code, the requests here are all done to same address getting a 25kb text document, first request is simplified here just getting the cookies and configuring the ServicePoint, then 50 async requests are done (I'm not posting GetResponse handler, it just reads the stream and closes).
VB.NET:
Dim u As String = "http://www.somedomain.com/"
Dim req As Net.HttpWebRequest = CType(Net.WebRequest.Create(u), Net.HttpWebRequest)
Dim cc As New Net.CookieContainer
req.CookieContainer = cc
req.Method = Net.WebRequestMethods.Http.Head
req.ServicePoint.ConnectionLimit = 8
req.ServicePoint.MaxIdleTime = 500
req.GetResponse.Close()
'watch = Stopwatch.StartNew
For i As Integer = 1 To 50
    req = CType(Net.WebRequest.Create(u), Net.HttpWebRequest)
    req.CookieContainer = cc
    'req.AutomaticDecompression = Net.DecompressionMethods.GZip
    'req.KeepAlive = False
    req.BeginGetResponse(AddressOf GetResponse, req)
Next
Commented out a few lines there that I toggled to test different settings. Used a Stopwatch to measure time for the 50 requests until last one was finished processing the response. MaxIdleTime is only meaningful for KeepAlive, since there is no human-interaction delay between requests here 500ms is more than enough to keep the connections for reuse while they all quickly close when all requests are done.

These are the stats, time in milliseconds:
3063 '6 connections, KeepAlive
4209 '2 connections, KeepAlive
5081 '2 connections, KeepAlive, gzip

9354 '6 connections, no KeepAlive
25990 '2 connections, no KeepAlive

2341 '8 connections, KeepAlive (for 500ms)
2468 '15 connection, KeepAlive (for 500ms)
4526 '15 connection, no KeepAlive
The first three with KeepAlive (3,4,5 seconds), the default 2 connections gives a fair result, using 6 connections is slightly faster, notice 2 connections with gzip is 1 second slower than without gzip.

The next two without KeepAlive, 2 connections is really slow (26 seconds), it takes a lot of time connecting. 6 connections faster but no match for KeepAlive.

Then I tried 8 and 15 connections, while KeepAlive here clearly is way out of protocol the MaxIdleTime of 500ms ensures a quick disconnect, still with the benefit of reusing the socket immediately. The 15 non-persistant connections is comparable to the default 2 connections KeepAlive at around 4+ seconds, while the difference between 8 and 15 reused sockets is insignificant. 15 is actually slower due to the costly connection time. 2.3 seconds processing 50 request with 8 "quick" persistant connections is nearly twice as fast as the default settings. If the server allows it (as this did) I would have no hesitation using 8 marked as KeepAlive, when I knew they didn't take up server resources by idling for a minute after use.

I did verify that the number of connections really were used with TcpView, and that MaxIdleTime worked as expected. It was actually with TcpView I also noticed the new IE/FF behaviour just recently.

edit: Though the above targets a single ServicePoint, these settings can also be applied globally using ServicePointManager. With this you can also see the 'connection by uri' lookup method I mentioned was used, FindServicePoint method.
 
This is some really valuable information. Really. Thanks for this.

I'll try first to do the cookiehandling right.
So what I need to do is:
to put keepalive at true
setting the servicepoint properties

Does it matter if I use a local var for the request (inside the loop)? Since with theading you'll get issues otherwise, like that there is already a beginrequest started etc.

These are really nice stats :p
 
Does it matter if I use a local var for the request (inside the loop)? Since with theading you'll get issues otherwise, like that there is already a beginrequest started etc
It is not variable that matters, it is the object referenced by it, in this case the request object returned by the Create function.
 
The cookiehandling seems solved now, the keep alive set on true seems solved aswell. As you said, with keepalive at true, it goes quite faster yes. I'll test now the servicepoint property and fire up tcpview.

I've learned alot trough this process & with this topic. Mostly information which is not so fastly accessible on the internet. Also thanks for the comparison chart.
 
Back
Top