Compare two text files, get differences and missing lines

team929

Member
Joined
Dec 4, 2008
Messages
5
Programming Experience
5-10
Objective
I have two text files (actually xml files but I'll get to that in a minute). I want to compare these files and get the differences, both line by line as well as missing lines.

Input
file1:
<node>
<ele0 attr="0"/>
<ele1 attr="1"/>
<ele2 attr="2"/>
</node>

file2:
<node>
<ele0 attr="0"/>
<ele2 attr="2a"/>
<ele3 attr="3"/>
</node>

Output
This could be anything, another text(xml) file, keep in memory, output to a control, etc. But to make it more concrete, I'll be looking for something like:
File1|File2
<ele1 attr="1"/>|
<ele2 attr="2"/>|<ele2 attr="2a"/>
|<ele3 attr="3"/>

Details
I have an existing process that outputs xml output files. I am in the process of upgrading (replacing) that old process and replacing it with a new one. As a test, I'll be putting through thousands of old input files through both processes and compare both pairs of outputs. I know I can do single pair comparisons with existing off-the-shelf programs but those programs are limited to manually comparing a pair of files and do not offer mass processing (or is there one i don't know about?).

I also mentioned that these are XML files but I'd like to process them through as text. This is because XML files are, techincally, text files in xml format and since trying to process these files through as XML by using xpath queries, etc. do not give me any added value and it can cause confusion if there is more than 1 element or node with the same name and if the elements are in different order.

What I've tried already
I've tried the XML method but, again, ran into issues with element ordering differences not being picked up properly and multiple elements of the same name. I've also tried using arrays (lists, etc.) but they just become way too cumbersome. I have scalability concerns when processing thousands of files and having multiple arrays holding hundreds of thousands of lines in memory. Even if I dump each set of arrays after each file, there has to be a better way than to load the said pair of 100000 lines into memory and compare them line by line.

The Question
Is there an easier, more scalable way of doing this without resorting to arrays? I'm more than willing to do any XML method if that would give a quicker, more scalable result.

Thanks in advance all.
 
Back
Top