My favorites | Sign in
Project Logo
                
Search
for
Updated Nov 23, 2009 by stefan.petrea
architecture_of_music_crawler  
A pragmatic approach to writing a music crawler

A pragmatic approach to writing a mp3 crawler in perl

Nowadays mp3 search engines are pretty common and they're pretty

useful also,so why not write a crawler(which would be the heart of a search engine anyway).

To name a few popular ones:

Now of course where do we start ?

Well we need a language(but we've already chosen that one) , perl.

We need a database engine, and let that be mysql.

We will design the crawler to work on Linux.

Considerations

What we want to get is a crawler that:

  1. is reliable
  2. knows how to save time
  3. is fast
  4. is modularised
  5. will never finish its job (yes,it's one of the few cases where we need this but it's because a crawler should never idle,it should continously crawl the web for new results)
  6. is able to collect the ID3 tag information in order to build a nice database

So let's get on with building the database structure:

Database

We want 3 tables:

We quickly lay out the database structure which is quite intuitive:

-- Database: `music_crawler`
-- 

-- --------------------------------------------------------

-- 
-- Table structure for table `Albums`
-- 

CREATE TABLE `Albums` (
  `id` int(11) NOT NULL auto_increment,
  `artist_id` int(11) NOT NULL,
  `genre_id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL,
  `year` int(11) NOT NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `artist_id` (`artist_id`,`name`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=43991 ;

-- --------------------------------------------------------

-- 
-- Table structure for table `Artists`
-- 

CREATE TABLE `Artists` (
  `id` int(11) NOT NULL auto_increment,
  `name` varchar(80) default NULL,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `name` (`name`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=73157 ;

-- --------------------------------------------------------

-- 
-- Table structure for table `Mp3s`
-- 

CREATE TABLE `Mp3s` (
  `id` int(11) NOT NULL auto_increment,
  `album_id` int(11) NOT NULL,
  `artist_id` int(11) NOT NULL,
  `title` varchar(2000) NOT NULL,
  `url` varchar(2000) NOT NULL,
  `link_text` varchar(2000) NOT NULL,
  `date` datetime NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=184462 ;

So it's simple , Artists have many Albums and in turn Albums have many Mp3s.

We have treated the problem of scraping pages already here by building an iterator that would let us have easy access to them.

Main crawler

You will see the $search_string variable below has a special search string that can help google find as many mp3s as possible

  1 #!/usr/bin/perl 
  2 #use strict;
  3 use warnings;
  4 use Google;
  5 use Data::Dumper;
  6 
  7 
  8 
  9 
 10 my $search_string = shift @ARGV
 11                     || 'intitle:"index.of" "parent directory" "size" "last modified" "description" [snd] (mp4|mp3|avi) -inurl:(html|htm|mp3s|mp3|index) -gallery      -intitle:"last modified" -intitle:"intitle"';
 12 print "SEARCH STRING = \n $search_string\n";
 13 
 14 
 15 
 16 my $n = 0;
 17 do {
 18     for ( Google::get_next_page($search_string) ) {
 19         $n++;
 20         print "PROCESSING link $_\n";
 21         `perl crawl_page.pl \"$_\"`;
 22     }
 23     sleep 20+int(rand(10));
 24 } until(Google::is_last_search_page);
 25 
 26 

Now what we'll have to implement the crawl_page.pl script. This will take

a page and get all the mp3s on it and process them.

It is best to notice that we're using a PartialGET module

which we'll implement later.This module is responsible for detecting if an mp3 is

really an mp3 (the isMP3 method ) and also to actually get the mp3 ( the getURLtoPartialFile method).

We need to mention that many pages which claim to have mp3s on them are fake,

auto-generated pages with links to other sites.

This we do not want and also our crawler must not lose time on those so we will

use a $maxFakeMp3 which will help us skip the pages who have more than that amount

of fake links.

Also we need to have in mind that it is really important for the crawler to be fast

so for each mp3 that was downloaded succesfully we check how much time has passed,

and if for the first $linksTimeSample we will have a time bigger than $linksTimeSample * $maxTimePerLink, we'll decide to abort this page and additionally mp3s will have a timout set inside PartialGET module in order to interrupt the transfer.

If all goes well mp3s are inserted into the database through the aid of the module MP3::Database which has a method insertMP3 just for that.

Also we will skip the mp3 if the link to it is already in the db by using the method urlPresent which is also present in MP3::Database.

  1 #!/usr/bin/perl 
  2 use strict;
  3 use warnings;
  4 use WWW::Mechanize;
  5 use PartialGET qw/ $TimeOut/;
  6 use MP3::ID3Data;
  7 use MP3::Database;
  8 use Data::Dumper;
  9 use DateTime;
 10 
 11 # checking this link http://hammerweb.mine.nu/Mp3/?C=S;O=A
 12 
 13 my $URL = shift @ARGV;
 14 
 15 #print "asfasfasfd ".PartialGET::TimeOut;
 16 
 17 MP3::Database::connect();
 18 
 19 my $s= WWW::Mechanize->new;
 20 $s->agent_alias('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 4.10 (build 01952))');
 21 
 22 sub getMP3links {
 23     my ($url) = @_;
 24     my @mp3_links;
 25     print "ENTERED GET_MP3LINKS!\n";
 26     $s->get($url);
 27 
 28     for my $link ( $s->links ) {
 29         printf "CANDIDATE\n";
 30         printf "CANDIDATE link = %s\n",$link->url_abs;
 31         next if $link->url !~ /\.mp3$/;
 32         push @mp3_links,$link->url_abs;
 33     };
 34 
 35     print "GOT THE PAGE $url\n";
 36 
 37     return @mp3_links;
 38 }
 39 
 40 
 41 
 42 my $mp3_temp_file = "temp.mp3";
 43 
 44 
 45 my $n=0;#this counts the links processed so far 
 46 
 47 my $startTime = DateTime->now;
 48 my $endTime;
 49 
 50 my $maxTimePerLink = $TimeOut * 0.7;
 51 my $linksTimeSample = 10; # this will be used in tandem with PartialGET::TimeOut which is the timeout for a request
 52 my $fakeMp3Links = 0;
 53 my $maxFakeMp3 = 6;
 54 my $maxTimeSpent = 3600; # maximum time spent on this page
 55 
 56 for my $mp3link ( getMP3links($URL) ) {
 57     $n++;
 58     `rm $mp3_temp_file > /dev/null`;
 59     print "PROCESSING MP3 AT $mp3link\n";
 60     next if MP3::Database::urlPresent($mp3link); # if we have already processed this link we skip it
 61     my $err = PartialGET::getURLtoPartialFile($mp3link,$mp3_temp_file);
 62 
 63     if( $n == $linksTimeSample ) { #time to check if we spent too much time on a fake mp3 page
 64         $endTime = DateTime->now;
 65         if( ($endTime - $startTime)->seconds > $linksTimeSample * $maxTimePerLink ) {
 66             exit 0;
 67         };
 68     };
 69 
 70     next if $err; #err happens when file has not been fetched
 71     my $ismp3file = PartialGET::isMP3($mp3_temp_file);
 72     unless( $ismp3file ) {
 73         $fakeMp3Links++;
 74         exit 0 if $fakeMp3Links > $maxFakeMp3; # we exit if we had more than 6 fake mp3 links
 75         next;# if the file is not an mp3 no need to go further
 76     }
 77 
 78     # from down here it's clear we have a good mp3 and we analyse it and put it in the db
 79     my $taginfo = MP3::ID3Data::getData($mp3_temp_file);
 80     $taginfo->{url} = $mp3link;
 81     $taginfo->{genre} = 0 if ref $taginfo->{genre};
 82     next if $taginfo->{title} eq 'temp'; # we have an mp3 without id3 tags so we eliminate it
 83     print Dumper $taginfo;
 84     MP3::Database::insertMP3($taginfo);
 85 
 86     exit 0 if (DateTime->now - $startTime)->seconds > $maxTimeSpent;
 87 }

Ok now we need to implement the wrapper we have used for the database.

As it is not very important we will just point to where it is, it can be found here MP3::Database, but it is not essential to the exposition.

However we will discuss the PartialGET module and in particular this piece of code

sub getFirstBytes {
	my ($url,$bytes) = @_;
	my $ua = LWP::UserAgent->new;
	$ua->timeout($TimeOut);
	my $req = HTTP::Request->new('GET' => $url);
	$req->init_header('Range'=>
		sprintf("bytes=0-%s",$bytes)
	);
	my $response = $ua->request($req);
	if($response->is_success) {
		return $response->content;
	} else {
		return undef;
	}
}

The main ideea is that we do NOT need all the mp3 file in order to get to the ID3 tags

(there are 2 kinds of possible tags and we will consider both ID3v1 ID3v2)

This is intended to get the first $bytes out of the $url the ID3v1 tag resides in the first kilobyte.

We will proceed in a similar way for the ID3v2 tag which lies at the end of the file(the method getLastBytes).

So we need to use special HTTP headers in order to extract from a file exactly what we want.

We find examples of how these headers look here and here.

How to use it

We obviously have an empty database at the moment.

It would be important to have at least some artists so we will leave the main crawler

to run unattended for a while(we need a starting point).

After we get some artists we can use the following script in order to get a practically

unlimited search results.

This script we will call crawl_existing_artists.pl

  1 #!/usr/bin/perl 
  2 
  3 use strict;
  4 use warnings;
  5 use MP3::Database;
  6 use Data::Dumper;
  7 
  8 MP3::Database::connect();
  9 my @artists = MP3::Database::getAllArtists;
 10 
 11 my $qstring1 = "\'intitle:\"index of\" +\"last modified\" +\"parent directory\" +description +size +(mp3) \"%s\" \'";
 12 my $qstring2 = "\'intitle:\"index of\" +\"last modified\" +\"parent directory\" +description +size +(mp3) %s \'";
 13 
 14 for my $artist ( @artists ) {
 15     #print sprintf "URL crawled $qstring1\n" , $artist;
 16     system sprintf "perl main_crawler.pl $qstring1",$artist;
 17     system sprintf "perl main_crawler.pl $qstring2",$artist;
 18 }

Conclusions

The source code can be found here.

Use it as you wish


Comment by moritz.lenz, Sep 15, 2008

When you write or use a crawler, please make sure that it adheres to the robots.txt exclusion standard.

Comment by talexb, Sep 19, 2008

How about using the Perl builtin unlink instead of shelling out to do an rm? You get a) better error handling, b) no expense (time, memory) of starting a shell to run just one command whose output you're ignoring.

Alex ps And 'loose' should be 'lose'.

Comment by peteris.krumins, Sep 19, 2008

not really, if it is an mp3 crawler, than going beyond robots.txt is allowed!

Comment by svantes...@gmail.com, Jun 09, 2009

ID3v1 resides at the beginning of a file, ID3v2 is at the end. Not the other way around.


Sign in to add a comment
Hosted by Google Code