Insane Execution Times...MySQL Related?

Johnny Dubs

Member
Joined
Jan 17, 2009
Messages
7
Programming Experience
Beginner
Hi there,

Currently I'm trying to de-duplicate the contents of about half a million documents. An MD5 hash value for each document is stored in a MySQL database in a field named 'md5'.

What I've tried doing is selecting the first document in the database, reading its MD5 value, comparing it to all other MD5 values in the database, and then storing the ID of each document that matches in a field named 'dupliace_id'.

Each time a document is found with a matching md5 value, the 'duplicate_id' field of the document being used is read, the id of the duplicate document is added to that, and then the duplicate_id field of the document being used is re-updated.

My first question is whether you can run an update command in MySQL that just adds, rather than overwrites? That way, each time a matching MD5 value is found I wouldn't have to select, add to and then update the duplicate_id field each time.

The pseudo code for what I've implemented is:

VB.NET:
for i = first id to the last id of each file {

if the id isn't equal to the variable storing ignored ids

select the MD5 value for that hash where id = i



for j = first id to the last id of each file {

providing j isn't equal to i

select the MD5 value for that hash where id = j

if the j md5 = the i md5

select the duplicate_id from i

add j to duplicate_id variable

update the duplicate_id with the duplicate_id variable

add j to the variable storing ignored ids

}

}


I hope that makes sense.

Anyway, currently that's executes unbelievably slowly. I'd say it only does at most 10 different duplicate updates per hour for the first ten ids in the database (of half a million!) Admittedly that's still 5 million operations...so maybe there really is nothing I can do...but can anyone think of a way to improve this or speed it up???


Thanks for any help!
 
I might be able to help, but pseudo codes never been my thing. Reminds me of word math problems that only complicate a simple formula. Could you provide sample data and table design that has the problem and then show me what you want it to look like at the end of the operation?

Just make up the data if your worried about sharing top secret totally classified data on some deep dark project involving pancake sacrifices to a god named raph. Though I would understand it regardless.

Also, are you using direct SQL or some programing language besides SQL?
 
Back
Top