Tuesday, February 22, 2005

In the land of clickety click...

the rovot is king. Here is the script I wrote to collect and parse data from businessweek.com. I have to make a few more, and this one isn't fully functional, but it works. Now I just have to mess with authenticating my bot to the businessweek server so I can parse the list of URLs the bot finds. Much gratitude thus far to Ethan Zuckerman, who sent me some code to look at from his GAP project. And many, many thanks to Henry Wasserman who helped me with my Perl syntax and how to interface with the WWW. Yippee, my Perl books should be here soon.

My code is very ghetto, mostly because I don't understand some of the better HTML parsing modules. But this is the best I could do. Nerds, please tell me what you would change. If this code is useful to you, steal away. OpenSource4L. However, I replaced the HTML tage in the code with caps. Put in whichever keywords you wish:

use LWP::Simple;
use HTML::SimpleParse;
use Win32API::File 0.08 qw( :ALL );
$| = 1;

my @words = ('FOO','BAR','BLAH',);
my $ref = -1;

foreach (@words){
$ref++;
@index[$ref]=get
("http://search.businessweek.com/Search?searchTerm=@words[$ref]& skin=BusinessWeek&x=9&y=5");
$p = new HTML::SimpleParse( $index[$ref] );
open(OUTFILE, ">output[$ref].txt") or die "Can't open output.txt: $!";

$flag = 0;
$test=0;

foreach ($p->tree) {
if ($p->execute($_) =~ /Results /)
{
$flag=1;
}
if ($flag==1)
{

$test++;
print OUTFILE $p->execute($_);
if ($p->execute($_) =~ /Result page/)
{
$flag = 0;}
}

}
print "There were $test lines saved for parsing for @words[$ref] \n";
close OUTFILE;
open(INFILE, "output[$ref].txt") or die "Can't open output.txt: $!";
open(OUTFILE, ">goodies[$ref].txt") or die "Can't open goodies.txt: $!";

while ()
{
if ($_ =~ /BRACKET THEN A HREF/ )
{
($url,$BetweenTheBold) = $_ =~ /.*'(.*)'.*BOLD TAG(.*)CLOSE BOLD TAG AND ESCAPE/ ;
print OUTFILE "$url\n";
print OUTFILE "$BetweenTheBold\n";
}
elsif ($_ =~ /\d{2}/ )
{($date) = $_ =~ /-.*((January|February|September|November|December|March|April|May|June|July|August|October).{2}.*\d{4}).*/ ;
print OUTFILE "$date\n\n";
}
}
close INFILE;
close OUTFILE;
}

my $var=-1;
open(OUTFILE, ">total.txt") or die "Can't open total.txt: $!";
while ($var < $ref)

{ $var++;
open(INFILE, "goodies[$var].txt") or die "Can't open goodies.txt: $!";
while ()
{if ($_ =~ /\w/)
{print OUTFILE $_;}
}
close INFILE;
DeleteFile ("goodies[$var].txt");
DeleteFile ("output[$var].txt");
}
close OUTFILE;

No comments: