Alternative to dataset for large data imports

mikethecoder · Aug 23, 2008

I have plenty of programming experience but unfortunately not as much with .NET and a project has been dropped on me last minute. I can code in the language but this issue deals with performance and choosing a best solution... and not knowing the language in and out I do not know which is best.

We have a program that runs 24-7 running imports. It creates a new importer object which then takes a tab separated text file from a UNC path on our server and uses this to form a dataset to represent the data originally in the text file. After the text file data is processed into datasets for each table (4 tables)... one dataset for each table... we then run the .update method on each dataset to insert the data to an access .mdb file. So basically the process is essentially to update an existing access database file based on changes in a tab separated .txt file.

The problem is these imports run very frequently and recently we have someone with a much larger database than ever before. Datasets have been shown to be the cause when running the update. It is using too much server memory and timing out other imports. I was told to come up with an alternative to using datasets.

Any ideas what the best method would be for server performance to avoid storing the data from the txt file in datasets and updating the dataset back to the mdb file? Should the solution, since we need to avoid using any server memory, just be to run multiple SQL insert commands against it for each record?

The average import, across all 4 tables together, processes about 60,000 records. Thanks!!!!!!

cjard · Aug 26, 2008

mikethecoder said:
Should the solution, since we need to avoid using any server memory, just be to run multiple SQL insert commands against it for each record?

Yes. That's only what the dataset route is going to do anyway; the tableadapter (dataadapter) will call insert for every row it encounters that is new. You can leave out the dataset level of indirection, and just use the .Insert() of the tableadapter, or write a parameterized query yourself and set the params on every loop pass

Some observations:

Youre using a crappy database technology for the kind of load you advertise to be putting on it - using a real DB (I use Oracle) and bulked transfer methods should see a load rate of ~3000 records per sec

60,000 shouldnt be too much of a problem for a dataset; are you running with constraints or keys enabled?

mikethecoder · Aug 26, 2008

Found the problem

Unfortunately access is used for a reason... it won't be for long though. I absolutely hate it of course.

The issue is some of these tables have potentially 120,000 records or more. I thought the issue was updating... but we only update the affected records out of those 120K which is maybe 800 at MOST. The issue is, I think and correct me if wrong, filling the dataset.

We are using queries like:
daOptions = New OleDbDataAdapter("Select * from Options", connString)
daOptions.MissingSchemaAction = MissingSchemaAction.AddWithKey
daOptions.Fill(dsOptions, "Options")

When filling the dataset we are using a query that gets every record so we have to temporarily store thousands of records in memory. Sorry I am not too familiar with VB.NET but is there a way to:

1. Create a blank dataset and add records and append it to a table
2. Or somehow change the SQL query above to fill the dataset so I can keep the rest of the code as is but not lose any records. So I could almost do something like:
daOptions = New OleDbDataAdapter("Select * from Options limit 1", connString)
But when I update I will still retain all the 120K records I never collected.

Ideas? Thanks very much!!

cjard · Aug 27, 2008

Hangon.. Are you telling me that you:

Use a data adapter to download your entire database into a datatable within a dataset
Update some of the records
Persist the changes back to the DB via use of dataadapter.update

Eek. THats pretty much the worst way you can do your data access, and I can't beleive you'd adopt it.. So much so that I'd like you to clarify exactly what it is you do do, including the purpose of all the databases, how they are uspdated etc

mikethecoder · Aug 27, 2008

Yeahhhhh

Yeahhh it's one of those legacy scripts that works so no one complains. But recently a few things broke it... mainly larger databases that occurred causing memory leaks from, well, the obvious.

Knowing that's what it does now... in order to just create a bunch of rows and update without doing the ridiculous dump of the whole table in the dataset, what do you think is the most efficient way to do that?

cjard · Aug 28, 2008

read the file line by line
for each line, fire off an insert or an update statement depending on whether the line is to be inserted or updated

data stays in the db where it belongs, and is amended by an external process. this is very different to downloading the entire db into the external process, making changes, and sending the results back again

liwaste · Oct 7, 2008

cjard said:
read the file line by line
for each line, fire off an insert or an update statement depending on whether the line is to be inserted or updated

data stays in the db where it belongs, and is amended by an external process. this is very different to downloading the entire db into the external process, making changes, and sending the results back again

can you imagine the system resources utilized when you read line by line??? nd writing it each time.. so on an average 6000 writes.. tht will be too much for the server i guess.. I guess the data set soloution is the best as far as I can see

cjard · Oct 8, 2008

liwaste said:
can you imagine the system resources utilized when you read line by line???

Minimal. Read several lines if you care, but somewhere along the pipe from drive to memory you'll find a cache. Reading a file line by line from disk is a very fast and trivial op. Try it for yourself. Line by line also necessarily uses less memory resource because youre only reading and keeping one line at a time

Be careful, when presenting arguments, to use wording that is sensible and understandable.. Line by line is a disk hit, whole file is a memory hit. Both use system resources. Reading a 1 gig file into memory when you only have 512mb of ram is clearly a retarded idea with overall massively higher resource usage due to the paging.

so on an average 6000 writes.. tht will be too much for the server i guess..

You guess? So you don't actually know what youre talking about?

I guess the data set soloution is the best as far as I can see

I "guess"

you should investigate how DataTable/DataAdapter works.. You may be surprised at how ill informed your comment is.

If youre after the fastest way to load info into or out of a db, you take a look at the relevant documentation. In Oracle I would use bulked array transfer mode, though this wasnt available on a server I recently worked on so I implemented an improvement that mimicked the bulk transfer, by sending 80 records of 50 width to the server at a time, in a single VARCHAR2(4000). Dataset performance was 600 lines/second. The new method, using the direct stored procedure (which does nothing other than break apart the 4000 long string and call INSERT 80 times) gets around 8,000 lines per second.

While that's not as fast as Oracle's proper designed bulk transfer (doing 500 records at a time; i'm limited to 80) at 10,000 (an improvement over the previous 3,000 as the source PC is a proper dual cpu server on gigabit ethernet rather than a centrino laptop on 100meg) it's close enough not to warrant installing ODP.NET on a server that otherwise has no need for it

Tom · Oct 8, 2008

Mike what is the reason for filling the dataset with records from the database before the import; are you trying to compare records to see if they already exist before inserting new records?

Also this doesnt have anything to do with eliminating your speed issue for filling the dataset but you may want to take a look at the sqlDataAdapter.UpdateBatchSize property. This can increase application performance by reducing the number of round-trips to the server during the update itself.

BanditSteve · Oct 9, 2008

cjard is on target and you really need to move to a serious database ASAP. Either Oracle or SQL Server can do bulk transfers. I archive about 120,000 records per hour from an AS400 to SQL Server (2 table) and it takes about 15 seconds every 5 minutes to do this with much of that time taken by the AS400 response.

Also, to help clarify, you don't need to hold the file output in a dataset or datatable. Just build your insert string directly with a StringBuilder and run you command. Line by line is the best you are going to be able to do with Access.

Letting a dataAdapter do the updates is one of the worst ways to do it

You might also want to run a thread to read the file while doing the insert to gain a little throughput.

Oh Yeah, a great resource here:
ConnectionStrings.com - How to connect to Textfile

I haven't used Access in 8 or 9 years but can't you just do this directly with an import macro if the files are well formed?

Tom · Oct 9, 2008

First, I'm still curious to see why Mike was filling the dataset to start with, its not needed to do an update with or without a DataAdapter.

Hi Steve:

Perhaps I'm misunderstanding your post but I would have to disagree in regards to use of a dataadapter for inserts. I'm not sure if there is a difference with an Access DB, i'm speaking soley of my use of it with SQL Server.

When it comes down to it, both ways are doing a thousands of single record inserts but with setting the DataAdapter's UpdateBatchSize you can set the number of rows that are processed in each round-trip to the server.

This can be set to X (ex: 500, 1000, total record count) amount of inserts, or even set to automatically send as many inserts as the server can handle at a time.

If you and/or cjard would like we could agree upon a neutral database such as Northwinds or Pubs and a shared XML import file or something similar with dummy records, we could run some speed test comparisons using our different methods.

Just to add this is a friendly offer, I have no problem being proven wrong and always would be happy to find better and more efficient methods of programming.

BanditSteve · Oct 9, 2008

Tom,
Much of what I see on questions being asked is because people just don't know enough SQL to make full use of its power. DataAdapters are fine when things are simple and the coder doesn't need to know a lot about SQL. But my world is filled with many complex databases to serve enterprise requirements that dataAdapters are inadequate in handling.

to quote MSDN:
------------------------
.NET Framework Class Library
DbDataAdapter..::.Update Method

Updated: November 2007

Calls the respective INSERT, UPDATE, or DELETE statements for each inserted, updated, or deleted row in the DataSet.
------------------------------

My emphasis on the each . I would agree with you if you are modifying a large number of rows at one time. But then I would still probably use SqlBulkCopy. But generally, when editing, we want our changes to the db to commit as soon as possible. the dataAdapter's methods just aren't as smart in that it must check for changes, build the query, then run it and can't use stored procedures. With a command object in code in a data access layer, you can just send your changes as SqlParameters to stored procedures and get the best of both worlds.

It's probably six of one and .5 dozen of the other for speed in most cases where a dataAdapter will work. I just quit using them in 1.1 because they suck when you have normalized, relational databases and employ joins frequently. Probably 90% of my selects employ joins so a dataAdapter can't do the update. Instead I just use a re-useable DAL class for the general functions and build a 2nd one for the sprocs used by the app.

I find bulkcopy useful for archiving between different databases (i.e. db2 to SQL Server) and it would serve well for Mike's app if he is just doing inserts.

Tom · Oct 9, 2008

I hope you don't think I'm being argumentive, all my existing data imports are actually built with methods similar to what you describe above and currently does not use an adapter for it also I have no trouble understanding SQL.

Recently expermenting with the UpdateBatchSize property, my limited testing of it so far does show it to be faster and more efficient.

I'm not sure what your refering to saying a dataadapter can not use stored procedures.

to quote MSDN:
SqlDataAdapter..::.UpdateBatchSize Property(This property is new in the .NET Framework version 2.0);
Gets or sets the number of rows that are processed in each round-trip to the server. SqlDataAdapter.UpdateBatchSize Property (System.Data.SqlClient)

I do agree with SqlBulkCopy for transfering between two databases, I'm referring here to importing from files such as XML, delimited, fixed-length etc.

On a personal note, I have a flight to catch soon and might not be back online for a day or two.

Again if your willing, we can run some friendly comparison tests together and compare the results.

Have a good weekend Steve.

BanditSteve · Oct 9, 2008

No problem, I can tell you know plenty. I just think new coders should be encouraged to explore all methods.

Can't use stored procedures if you update via the adapter, can you? It's really been a long time since i've used an adapter - 2003 I think.

I'm out for a long weekend too!

Tom · Oct 13, 2008

Hi Steve, yes stored procedures can be used with data adapters.

Alternative to dataset for large data imports

mikethecoder

New member

cjard

Well-known member

mikethecoder

New member

cjard

Well-known member

mikethecoder

New member

cjard

Well-known member

liwaste

Active member

cjard

Well-known member

Tom

Well-known member

BanditSteve

Active member

Tom

Well-known member

BanditSteve

Active member

Tom

Well-known member

BanditSteve

Active member

Tom

Well-known member

Similar threads

Share this page

Latest posts