Best way for concurrent HTTP connections?

littlebigman

Well-known member
Joined
Jan 5, 2010
Messages
75
Programming Experience
Beginner
Hello

I need to download a bunch of pages from a web server, ie. spidering. I know that servers are typically configured to only allow a couple of concurrent connections from a given IP, but that would already halve the total time to run the script instead of downloading one page at a time.

At first sight, I guess there are two ways to do this:
- multi-threading
- non-blocking, async HTTP connections

Before I go ahead and investigate, has someone already done this and could share some code?

Thank you.
 
I've never done threading before. Here's some pseudo-code I thought of. Could someone experienced with this sort of thing tell me if I'm in the right direction, or I should do things very differently?

VB.NET:
Sub Main()
	Dim ItemsToDownload() as String
	Dim index as Integer
	Dim subindex as Integer
	
	'------ 1. Get all web page ID's from DB
	ItemsToDowload = SQLite.Query("SELECT id FROM companies")
	
	'------ 2. Take ten items at a time, and create ten threads each time
	For index = 0 to ItemsToDownload.Count-1 Step 10
		URL = "www.acme.com/search.php?id=" & ItemsToDownload(index)
		
		'Launch ten threads to download web pages concurrently
		For subindex=0 to 9
			Dim t As Thread
			t = New Thread(AddressOf Me.BackgroundProcess)
			'How to pass item ID to thread?
			t.Start()
		Next subindex	
		
		'Wait for ten threads to be done
	Next index
	
End Sub

'------ 3. Routine called by threads to download/parse web page, and save infos into DB
Private Sub BackgroundProcess(URL as String)
	Dim response as String
	
	response = Download(URL)
	'use regex to extract information from web page
	'and update SQLite with infos
End Sub

Thank you for any hint.
 
The existing async functionality for sockets is usually the better option. For example as mentioned in other thread HttpWebRequest.BeginGetResponse , or the Async methods of WebClient (see WebClient Methods). Other socket classes has similar async functions that usually give better performance and is easier to handle.
 
OK, I'll check how to write a loop to fetch the next ID and launch a new asynchronous connection every few seconds to handle it. Thanks for the help.
 
I spent the evening googling for newbie-accessible examples. The closest I found is this:

BeginGetResponse Method

It seems pretty complicated just to get a web page into a string, but I didn't find a VB.Net example of how to call a WebClient object asynchronously.
 
The WebClient is simpler, its purpose is simplifying the webrequests. Generally with the async calls you call the method and handle the Completed event, or as help puts it "When the download completes, the DownloadStringCompleted event is raised". Here are code samples using WebClients DownloadStringAsync method and DownloadStringCompleted event: DownloadStringCompletedEventHandler Delegate (System.Net)
 
Thanks for the link. WebClient seems good enough for what I'm trying to do.

One last thing, though: In DownloadStringCallback2(), I can't seem to be allowed to access the UI (in this case, I'm trying to copy the web page contents into a RichTextBox widget). How can the callback function somehow return the web page to the calling function so that I can actually do something with it?


VB.NET:
Imports System.Net

Public Class Form1
    Private Shared Sub DownloadStringCallback2(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs)
        If e.Cancelled = False AndAlso e.Error Is Nothing Then
            Dim textString As String = CStr(e.Result)

            'Console.WriteLine(textString)
            'Why can't call RTB1.Text?
            RichTextBox1.
        End If
    End Sub

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim client As WebClient = New WebClient()

        'TODO : merge threading code to call 5,000 URL's concurrently

        AddHandler client.DownloadStringCompleted, AddressOf DownloadStringCallback2
        Dim uri As Uri = New Uri("http://www.google.com")

        client.DownloadStringAsync(uri)
    End Sub
End Class

Thank you.
 
Accessing the UI from a secondary thread requires you to marshall the call to the UI thread, this can be done with Control.Invoke. How to: Make Thread-Safe Calls to Windows Forms Controls
This is the short version:
VB.NET:
Private Delegate Sub SetTextCallback(ByVal text As String)

Private Sub SetText(ByVal text As String)
    Me.RichTextBox1.Text = text
End Sub
Sample call from secondary thread:
VB.NET:
Me.Invoke(New SetTextCallback(AddressOf SetText), "the text")
 
Thanks for the tip. Actually, someone told me elsewhere that I simply had to remove the Shared keyword in the async function to be able to access the UI from within:

VB.NET:
Public Class Form1
    Dim busy As Boolean = False

    [b]Private Sub[/b] AlertStringDownloaded(ByVal sender As Object, ByVal e As DownloadStringCompletedEventArgs)
        If e.Cancelled = False AndAlso e.Error Is Nothing Then
            [b]RichTextBox1.Text = CStr(e.Result)[/b]
            RichTextBox1.Refresh()
            busy = False
        End If
    End Sub

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Button1.Enabled = False

        ListBox1.Items.Clear()
        RichTextBox1.Clear()

        Dim title As Regex = New Regex("<title>(.+?)</title>")
        Dim m As Match

        Dim webClient As New WebClient

        AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded

        Dim URLArray As New ArrayList
        URLArray.Add("http://www.google.com")
        URLArray.Add("http://www.yahoo.com")

        'Why delay between update of ListBox and RichTextBox?
        Dim URL As String
        For Each URL In URLArray
            ListBox1.Items.Add("Downloading " & URL)
            ListBox1.Refresh()

            busy = True
            webClient.DownloadStringAsync(New Uri(URL))

            'Better way to wait until downloaded?
            While busy
                Application.DoEvents()
            End While
        Next

        Button1.Enabled = True

    End Sub
End Class

Thank you.
 
Oh, I forgot, WebClient automatically raises the async events in calling thread. You're right the event handler should not be Shared in your case, I didn't notice you had that there.

You should remove the 'While busy' loop. The point of the async calls is that you don't have to wait until completed, the event will notify this. The problem you're trying to avoid I think, is that one webclient can only service one request at a time, so what you should do is create one webclient for each request in the loop, instead of using only the single instance.
VB.NET:
for each url
   dim client as new webclient
   client.downloadasync
next
 
Thanks for the tip. What I'm really trying to avoid is freezing the UI, since Windows can't perform a task like downloading a web page without making the UI unusable. Using the async version of the webclient calls makes for an easier code than using a Backgroundworker object.

Someone recommended Andrew Troelsen's "Pro VB 2008 and the .NET 3.5 Platform (Windows.Net)". I'll go through this book so I stop asking newbie questions on VB.Net ;-)

Thanks again for the great help.
 
What I'm really trying to avoid is freezing the UI
The blocking loop you added is what freezes the UI, the DoEvents calls is what makes the UI reponds again to events. You need neither.
 
I meant that using synchronous/blocking routines, the UI freezes, so I had to look into using either a Timer, BackgroundWorker, or the async versions in case they were available. DownloadStringAsync() is perfect for what I need. Thank you.
 
Back
Top