Export to GitHub

abot - issue #107

Suggested feature: More configurable logging w/ log4net


Posted on May 29, 2013 by Swift Giraffe

This is great - I've been toying around with it for a few hours and I'm really loving how fast it is to get up and configure.

1) My needs for a crawler are to crawl a local version of a site that has 5000+ pages, and I only really care about the non-200 responses (404, 500, etc). Abot is currently not configurable (that I could find) in this way - it will "log everything". I'd love to see a configuration option that says, "LogInformationalMessages" (bool) and another that says "LogAllNon200HttpStatusResponses" (bool). I've modded my source to have those and, when using the log4net to write to SQL, it makes it far easier to view 100 rows (that are 404/500/etc) that writing complex queries to sift through 5000+ rows each time.

2) My other request would be to atomize the logging so we can have separate columns for each piece of information. Currently a 404 looks like this in the log:

Page crawl complete, Status:[404] Url:[http://localhost:1000/MyFolder/MyFile.aspx?MyQS=165&AnotherQS=2800] Parent:[http://localhost:1000/MyFolder/123/MyFile.aspx]

That's just very, very hard to parse in SQL. I'd prefer to have it with a SQL table structure that's more like this:

CREATE TABLE dbo.log4NetResult( LineId int NOT NULL IDENTITY(1,1), Date datetime NOT NULL, Thread varchar (255) NOT NULL, Level varchar (50) NOT NULL, Logger varchar (255) NOT NULL, Message varchar (4000) NOT NULL, Exception varchar (2000) NULL, FullUrl VARCHAR(1024) NOT NULL, UriStem varchar(512) NOT NULL, UriQuery varchar(512) NULL, Port int NOT NULL, ParentPage varchar(1024) NULL, HttpStatus int NOT NULL, TimeTaken int NOT NULL, CONSTRAINT PK_log4NetResult PRIMARY KEY NONCLUSTERED (LineId) ) GO

This way I could easily query and say, "Show me all 404s in the past day" or "Show me all 404s that had a uri querystring that contains the value '123'". It would just be far easier.

Hope this is helpful. Keep up the great work!

Comment #1

Posted on May 30, 2013 by Helpful Dog

Thanks for taking the time to give feedback!

My understanding your requirement... -Store non 200 http responses in the db

Have you considered... -Subscribe to the PageCrawlCompleted event -In the event handler Check the e.CrawlPage.HttpResponse.Status to see if its ok/200 -If its non 200 then insert any data about the page that you want into your db

Seems like your trying to use the current logger (that should mostly be used for debugging) and adding a db appender which parses the log data when it would easier to just call your data access class directly from the PageCrawlCompleted event.

Comment #2

Posted on May 30, 2013 by Helpful Dog

(No comment was entered for this change.)

Comment #3

Posted on May 30, 2013 by Swift Giraffe

Good info - thanks. Got it.

Comment #4

Posted on Jun 29, 2013 by Helpful Dog

As my comment suggestion seemed to address the user's use case, marking this as wont fix.

Status: WontFix

Labels:
Type-Feature Priority-Medium