Question Large database approach

BryanBentz

New member
Joined
Mar 12, 2012
Messages
1
Programming Experience
10+
I'm faced with a (fun, actually) data mining problem; I have raw ASCII files from instruments, and I want to move that data (~400 GB) into a database, then be able to run various algorithms determining correlations of time series, etc. I would like to write the mining algorithms in Visual Basic (.net, VS 2010 right now), and be able to do visualizations with VB code I have in hand.

On the nature of the data: think of a set of several thousand devices, each recording a measurement at a given interval - so I'm talking time-series vectors. It's not more complex than that - though I may have vectors with holes, etc. - not sure what problems of that sort lurk in the data.

I spent today re-acquainting myself with VB.NET's interface to (in one case) a Microsoft Access database. What used to be fairly simple - DAO I think it was - involved tables, recordsets, etc (and that would likely be fine). Now I seem to be required to have a weird variety of generally useless objects, e.g. 'adapters', 'datasets', etc. The problem is that I know exactly what I need, and all this extraneous stuff just gets in the way (certainly in coding complexity and opaqueness, and likely in efficiency as well). If any of these mechanisms gave me a kind of virtual access to the entire dataset, and let me control caching parameters, etc.it might be great, but I found nothing along those lines. It seems like useless bloat, though I suppose it must be useful to someone.

Anyway, I tried a number of different approaches, and none seemed at all aimed at what I need to do: efficiently do math on a large dataset. I can't believe I'm the first to have this problem, but I can find no useful wisdom out there. I'd be comfortable with pretty much any underlying database mechanism: MySQL, SQL Server, MS Access, but ideally something generally SQL based (I may eventually have to transition this entire system to draw from a client's SQL database, though that's not an overriding concern now). Other than that I want simplicity and efficiency. I thought my old ODBC techniques would work, and to some extent they do, though modifying tables seemed to have bizarre problems (no errors, but not modifications either).

I do have a fairly aggressive deadline to show some algorithm results, so my focus in the short term is to get something reasonable working *in* the short term - in other words, it's less important to me to pick the 'fastest' relational database than it is to pick a database that lets me focus on coding the algorithms, not working through tedious data access coding. If this db could be any smaller, I'd have tried to do it all 'in memory' at least for proving concepts; I don't want to have to learn an entire jargon and approach just to be able to retrieve data points.

Perhaps I'll need to bite the bullet and just write something myself, a .dll perhaps just to save and restore large time series vectors. It seems a bit frightening to me that one would have to do this in this age, what with all the database systems out there, but I don't have much time to work through arcane interface logic.

I apologize for the generally frustrated tone of the above; if anyone has any suggestions I'd be very interested in hearing them.
 
What you're describing sounds like a good fit for Business Intelligence (BI) systems - apologies if you're aware of that, but I don't see any mention of it in your post.

MS SQL has their own implementation of BI in SQL Server Analysis Services (SSAS) - developing and building within SSAS is done through a set of custom Visual Studio templates known as the Business Intelligence Dev Studio (BIDS) - however I'm pretty sure that the actual version of Visual Studio is dependant on the version of SQL server you use - so SQL 2008 will install VS 2008 with the BIDS templates.

There are a number of BI alternatives out there though.
 
Back
Top