Word frequencies in many documents

lep

New member
Joined
Jan 8, 2007
Messages
1
Programming Experience
10+
I have many documents (basically short abstracts --> 10 sentences) for which I need to first identify all unique words founds in all the documents, and then determine the frequency of words in each document.

My thoughts are to do the following to approcah this problem:

1. Assign all the words in a document to a single string.
2. Use the split command with whitespace as the delimiter, and cut up the string. Remove special characters with e.g. regular expressions.
3. Add the words from all documents (after splitting) to a field in a SQL or Access file.
4. Use qry = "SELECT DISTINCT wordfield FROM mywordtable" to generate a data reader with unique word.
5. Last, count the number of times each word is found in each document.


The trick would be how to do this using MS Analysis Services in the SQL engine, since I think that would be the fastest way to do this?????

lep
 
Back
Top