Johnny Dubs
Member
- Joined
- Jan 17, 2009
- Messages
- 7
- Programming Experience
- Beginner
Hi there,
Currently I'm trying to de-duplicate the contents of about half a million documents. An MD5 hash value for each document is stored in a MySQL database in a field named 'md5'.
What I've tried doing is selecting the first document in the database, reading its MD5 value, comparing it to all other MD5 values in the database, and then storing the ID of each document that matches in a field named 'dupliace_id'.
Each time a document is found with a matching md5 value, the 'duplicate_id' field of the document being used is read, the id of the duplicate document is added to that, and then the duplicate_id field of the document being used is re-updated.
My first question is whether you can run an update command in MySQL that just adds, rather than overwrites? That way, each time a matching MD5 value is found I wouldn't have to select, add to and then update the duplicate_id field each time.
The pseudo code for what I've implemented is:
I hope that makes sense.
Anyway, currently that's executes unbelievably slowly. I'd say it only does at most 10 different duplicate updates per hour for the first ten ids in the database (of half a million!) Admittedly that's still 5 million operations...so maybe there really is nothing I can do...but can anyone think of a way to improve this or speed it up???
Thanks for any help!
Currently I'm trying to de-duplicate the contents of about half a million documents. An MD5 hash value for each document is stored in a MySQL database in a field named 'md5'.
What I've tried doing is selecting the first document in the database, reading its MD5 value, comparing it to all other MD5 values in the database, and then storing the ID of each document that matches in a field named 'dupliace_id'.
Each time a document is found with a matching md5 value, the 'duplicate_id' field of the document being used is read, the id of the duplicate document is added to that, and then the duplicate_id field of the document being used is re-updated.
My first question is whether you can run an update command in MySQL that just adds, rather than overwrites? That way, each time a matching MD5 value is found I wouldn't have to select, add to and then update the duplicate_id field each time.
The pseudo code for what I've implemented is:
VB.NET:
for i = first id to the last id of each file {
if the id isn't equal to the variable storing ignored ids
select the MD5 value for that hash where id = i
for j = first id to the last id of each file {
providing j isn't equal to i
select the MD5 value for that hash where id = j
if the j md5 = the i md5
select the duplicate_id from i
add j to duplicate_id variable
update the duplicate_id with the duplicate_id variable
add j to the variable storing ignored ids
}
}
I hope that makes sense.
Anyway, currently that's executes unbelievably slowly. I'd say it only does at most 10 different duplicate updates per hour for the first ten ids in the database (of half a million!) Admittedly that's still 5 million operations...so maybe there really is nothing I can do...but can anyone think of a way to improve this or speed it up???
Thanks for any help!