Export to GitHub

smuto - issue #3

Broken movie descriptions on filmweb causes errors in scraper


Posted on Oct 19, 2010 by Happy Panda

What steps will reproduce the problem? 1. Scrap "Niedokonczone Zycie" 2. Scrap "Inland Empire"

What is the expected output? What do you see instead? Description of movie is cut out too early, for example in case of "Niedokonczone Zycie" it contains only phrase "Stary ranczer, Einar Gilkyson (".

What version of the product are you using? On what operating system? svn 17

Please provide any additional information below. Too short description is caused by incorrect movie description on filmweb - text contains html-encoded tags, which are not removed by scraper because tag markers are encoded as special chars.

My solution is to add 'fixchars="1"' option to each expression used for selecting description - this way 'bad tags' will be decoded and removed by next regex expression.

Patch attached.

Attachments

Comment #1

Posted on Oct 27, 2010 by Swift Camel

wprowadziłem twoje propozycje z małymi małymi modyfikacjami - bardzo dzięki

Status: Fixed

Labels:
Type-Defect Priority-Medium