So merge whole records, not a problem. It's even not that important if you only use one field to sort, because you'll make the problem much smaller by eliminating many trivial matches. At which point, you can use the existing, expensive, but precise algorithm.
See Richards very interesting description of this solution, though it depend on being comfortable with the unix toolbox.
I think for this particular problem, if you need to do this task often, changing the algorithm will give you great results. As a general rule, and as a better answer to your challange, exploring what work was done on using BDB might be more generally useful.
Have fun. Daniel
Yoel Jacobsen yoel@emet.co.il wrote:
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ) Content-type: text/plain; charset=us-ascii; format=flowed Content-transfer-encoding: 7BIT
This is not correct since I need to compare entry to entry not pair to pair. Sort will only damage the LDIF files.
Yoel
danielv@netvision.net.il wrote:
- Any good idea about how to make it practical for 450K entries (18M
lines)? What should I use for persistence?
Assuming that the entries have to be string equal to be equal and thus "not differences" and thus boring:
- Use some generic sorting utility like unix 'sort' to sort both
inputs. They're pretty good at doing this for big files. 2. Do something akin to a phase in merge sort - read both files in a synched manner. Any lines that match from both files, ignore. Any lines without matchers, keep. If you have many matchers, don't keep in memory, but write them to a file.
This should be fast, and more useful than the python code.
Thanks Yoel
Daniel
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ) Content-type: text/html; charset=us-ascii Content-transfer-encoding: 7BIT
<html> <head> </head> <body> This is not correct since I need to compare entry to entry not pair to pair. Sort will only damage the LDIF files.<br> <br> Yoel<br> <br> <a class="moz-txt-link-abbreviated" href="mailto:danielv@netvision.net.il">danielv@netvision.net.il</a> wrote:<br> <blockquote type="cite" cite="mid:0GQQ00FGD4CDGX@mxout1.netvision.net.il"> <blockquote type="cite"> <pre wrap="">0. Any good idea about how to make it practical for 450K entries (18M <br>lines)? What should I use for persistence?<br></pre> </blockquote> <pre wrap=""><!----><br>Assuming that the entries have to be string equal to be equal and thus<br>"not differences" and thus boring:<br>1. Use some generic sorting utility like unix 'sort' to sort both<br>inputs. They're pretty good at doing this for big files.<br>2. Do something akin to a phase in merge sort - read both files in a<br>synched manner. Any lines that match from both files, ignore. Any lines<br>without matchers, keep. If you have many matchers, don't keep in memory,<br>but write them to a file.<br><br>This should be fast, and more useful than the python code.<br><br></pre> <blockquote type="cite"> <pre wrap=""> Thanks<br> Yoel<br></pre> </blockquote> <pre wrap=""><!----><br>Daniel<br><br><br><br></pre> </blockquote> <br> </body> </html>
--Boundary_(ID_vUEaGTJTEo1YvPamN1n8gQ)--
squeak-dev@lists.squeakfoundation.org