how to speed up perl processing two very big text file
时间:10-02
整理:3721RD
点击:
there two very big, text file, a few million lines each, fileA and fileB.a perl task to sort the same line content in both files
for instance, line 3 of fileA has the line read
abdce fghijklmnop\n
and in fileB
abdce fghijklmnop\n
happened to be line 30,000
the perl script is going to pick out those lines and printout
That should be easy, exhaustively searchfileB for each line of fileA
But the process time would be very long, for very big fileA and fileB
Is there a way in perl to speed up the processing.
split up one file to parallel process using multiple CPU should be one way.
Do we other ways? in perl?
please help
for instance, line 3 of fileA has the line read
abdce fghijklmnop\n
and in fileB
abdce fghijklmnop\n
happened to be line 30,000
the perl script is going to pick out those lines and printout
That should be easy, exhaustively searchfileB for each line of fileA
But the process time would be very long, for very big fileA and fileB
Is there a way in perl to speed up the processing.
split up one file to parallel process using multiple CPU should be one way.
Do we other ways? in perl?
please help
the first thing is to read the whole file into memory, assuming you have enough memory.
methods like readline would be very slow in non-SSDs.
it is very simple , perl programming skills
这个用diff 不就完了么, 干么用perl呢?
perl的算法应该是:
存储每行进2个数组:
my @lines_a ;
while (<A> ) {
push @lines_a , $_
}
my @lines_b ;
while (<B> ) {
push @lines_b , $_
}
# 然后开始比较@lines_a和 @lines_b
# 可以用regexp, 用| 连接
my $regex_a = join "|" , @lines_a ;
for my $var ( @lines_b ) {
if( $var =~ $regex_a ) {
print $var ;
}
}
在@lines_b 里面 进行 @lines_a循环肯定不行,太慢了,
相当于双循环。
Thank you very much for the response
I am going to test it out.