Dictionary, Mutithreading, Resource Locking General Question (Long)

christuckeruf

New member
Joined
May 7, 2011
Messages
3
Programming Experience
5-10
I have a rather long question regarding the fastest and most accurate way to implement a solution to the following problem (I’ll explain my attempted solution below): I cannot describe the actual data (for business security reasons) so I will use an accurate example. Let’s say that an auto insurance company is passing all their claims for the previous month to us. Our job is to grab the data and transform it. This allows us to send the final form of the data to the warehouse so it can be efficiently searched in extremely fast ways. Each company has different ways of providing data and almost everyone has different formats. I have a stream of data that is being passed to my VB.NET code which must be manipulated before being directed to a destination SQL table. The source, for this example, is an SQL query pointed to an outside server. The data set is very large (20-200+ million records). I have to link more information to the data set by a unique key in the data. For instance, each “claim” will have a VIN number of the auto involved. I have a large lookup table (50 million records) that has more information about the VIN. For instance, it will have the manufacturer, the year, the model, etc. This information is missing from the source and must be joined with the incoming stream. The old solution was to throw the entire source into its own table and then join it with the lookup table by the key value. However, even on an 8 CPU server with 12 gigs of ram this join took hours (sometimes days with very large data sets) to complete. This was even with proper indexing and throwing the server into simple mode (in order to limit the transaction log). It necessitated extra storage and required projects to be queued on separate servers to maximize cpu cycles. As you can image, this bottleneck was frowned upon and we have been ordered to speed up turnaround time. My solution was to order the incoming data by the key ascending and creating a dictionary object in an SSIS package that holds the first 100,000 records of the lookup table and do a lookup by key. I did this to prevent the dictionary object from consuming too much RAM. The “item” in the dictionary object is an ArrayList that holds the extra values. Since both are ordered by key the first n number of incoming rows can be searched quickly. When an incoming key is greater than the maximum key in the current dictionary, the dictionary clears itself and grabs the next 100,000 records using the incoming key as the minimum value. This ensures that if there are key “gaps” in the source then the dictionary will not keep grabbing incorrect lookup data. For instance, if the source has values (1,1,1,1,2,2,2,2,6,6,6…) and the lookup table has (1,2,3,4,5,6….) the dictionary will not grab 3…then 4….then 5…etc. I have also created a last key value cache. Basically, if the new key equals the last searched key the last ArrayList is returned. This improves performance by only using the dictionary on new values. The problem is that when it “reloads” the dictionary there is a pause as the script component waits for the next 100,000 records. I had attempted to create an asynchronous thread to grab the next 100,000 records in a cache dictionary while the task was still processing the old dictionary. When the new incoming key didn’t match in the dictionary, it would check the cache dictionary and if there was a match it would clear the old dictionary and clone the cache. The problem is that with all the resource locking and thread management the cache idea took longer to finish than without it. My question is about improving my cache dictionary idea. Whenever it updates the dictionary with the cache dictionary it locks both the cache dictionary and the current dictionary. The update cache method locks the cache as it is being built. I have limited the number of asynchronous threads to one. I am using the monitor object to lock the dictionary objects. If I used a dummy synchronization object instead of locking the actual resource will it speed it up? Is using that idea good or should I stick with locking the actual objects themselves. Are there faster (and still accurate) ways of locking the resources? Also, I am cloning the dictionary by using a “for each” loop. Are there faster ways to do a deep clone of a dictionary? I have head of using an ICloneable dictionary type but I don’t know if that will speed things up all. What have your experience been? Lastly, the reason I gave such a thorough background is I wanted to know if there were faster ways of doing this. The old process was measured in hours. This new process is measured in minutes. However, I’m always looking for better implementations. Thanks for taking the time to read this message.
 
It posted without paragraph breaks and I couldn't find the edit button....

It didn't post itself. If it has no paragraphs then you didn't post it with any paragraphs. New members are not able to edit their posts. You must have at least a particular number of posts (I think 10, but not sure) before you can edit. You would repost with an explanation/apology. If you report your original post afterwards then the mods can delete it.
 
Back
Top