Re: [Challenge] large files smart compare (was: Re: Squeak for I/O and Memory Intensive tasks ) - Squeak-dev

1 Feb 2002

      So merge whole records, not a problem. It's even not that important if
you only use one field to sort, because you'll make the problem much
smaller by eliminating many trivial matches. At which point, you can use
the existing, expensive, but precise algorithm.
See Richards very interesting description of this solution, though it
depend on being comfortable with the unix toolbox.
I think for this particular problem, if you need to do this task often,
changing the algorithm will give you great results. As a general rule,
and as a better answer to your challange, exploring what work was done
on using BDB might be more generally useful.
Have fun.
Daniel
Yoel Jacobsen yoel@emet.co.il wrote:
...
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)
Content-type: text/plain; charset=us-ascii; format=flowed
Content-transfer-encoding: 7BIT
This is not correct since I need to compare entry to entry not pair to 
pair. Sort will only damage the LDIF files.
Yoel

danielv@netvision.net.il wrote:
...
...

Any good idea about how to make it practical for 450K entries (18M

lines)? What should I  use for persistence?
Assuming that the entries have to be string equal to be equal and thus
"not differences" and thus boring:

Use some generic sorting utility like unix 'sort' to sort both

inputs. They're pretty good at doing this for big files.
2. Do something akin to a phase in merge sort - read both files in a
synched manner. Any lines that match from both files, ignore. Any lines
without matchers, keep. If you have many matchers, don't keep in memory,
but write them to a file.
This should be fast, and more useful than the python code.
...
Thanks
           Yoel
Daniel
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)
Content-type: text/html; charset=us-ascii
Content-transfer-encoding: 7BIT
<html>
<head>
</head>
<body>
This is not correct since I need to compare entry to entry not pair to pair.
Sort will only damage the LDIF files.<br>
<br>
&nbsp;&nbsp;&nbsp; Yoel<br>
<br>
<a class="moz-txt-link-abbreviated" href="mailto:danielv@netvision.net.il">danielv@netvision.net.il</a> wrote:<br>
<blockquote type="cite" cite="mid:0GQQ00FGD4CDGX@mxout1.netvision.net.il">
  <blockquote type="cite">
    <pre wrap="">0. Any good idea about how to make it practical for 450K entries (18M <br>lines)? What should I  use for persistence?<br></pre>
    </blockquote>
    <pre wrap=""><!----><br>Assuming that the entries have to be string equal to be equal and thus<br>"not differences" and thus boring:<br>1. Use some generic sorting utility like unix 'sort' to sort both<br>inputs. They're pretty good at doing this for big files.<br>2. Do something akin to a phase in merge sort - read both files in a<br>synched manner. Any lines that match from both files, ignore. Any lines<br>without matchers, keep. If you have many matchers, don't keep in memory,<br>but write them to a file.<br><br>This should be fast, and more useful than the python code.<br><br></pre>
    <blockquote type="cite">
      <pre wrap="">    Thanks<br>            Yoel<br></pre>
      </blockquote>
      <pre wrap=""><!----><br>Daniel<br><br><br><br></pre>
      </blockquote>
      <br>
      </body>
      </html>
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)--