lxml module (Feature Request) [35874381]

Fixed

Feature Request

Status Update

No update yet.

Description

li...@durin42.com

created issue #1

Apr 8, 2008 02:17PM

It would be nice to have the lxml module (

http://codespeak.net/lxml/) available, rather than having
to use the builtin xml module. lxml is faster and uses less RAM than the stock one.

Comments

ku...@gmail.com <ku...@gmail.com> #2Apr 9, 2008 08:07AM

Agreed, however, the problem (as I see it) is that lxml's awesomeness primarily
derives from its roots in libxml. (a C library). You can get most of the API via the
standard libraries elementtree implementations. The major exception to this (AFAIK)
is html parsing support and lxml's complete implementation of xpath. If you have
another more specific feature from lxml, could you share?

As far as performance, a simple way to help with that in the short term is store your
parsed results in a global variable (to take advantage of the caching framework) or
persist them using datastore. Parsing on each request (if expensive enough to notice
a real difference between lxml and the shipped elementtree) should probably be
avoided. If its just parsing AJAX data or the like, the speed difference probably
isn't noticeable. Or, if your not super-attached to xml, simplejson is always another
option.

Just some opinions, as lxml is a pretty big module, and not something I'd expect to
see implemented in the near future, but I could be wrong.

Cheers!
Kevin

mi...@gmail.com <mi...@gmail.com> #3Apr 10, 2008 06:00PM

For my own code, it's the parsing and validation (lxml can build parsers from any of
the following: DTD, RelaxNG, XMLSchema, or Schematron).

But although I could certainly write my own code to avoid lxml, the more critical
issue is external libraries and frameworks that I use which depend on lxml. That's
actually quite a lot of unnecessary wheel-reinvention.

mi...@gmail.com <mi...@gmail.com> #4Apr 10, 2008 06:06PM

BTW, switching to ElementTree is rather costly, CPU-wise, compared to cElementTree
(which is likewise missing).

ta...@gmail.com <ta...@gmail.com> #5Apr 11, 2008 04:54AM

[Comment deleted]

dj...@gmail.com <dj...@gmail.com> #6Apr 24, 2008 02:24PM

[Comment deleted]

du...@gmail.com <du...@gmail.com> #7Apr 24, 2008 02:30PM

Please do *not* post +1 comments - they don't accomplish anything. Starring issues is how you should vote for
an issue.
Thanks!

we...@gmail.com <we...@gmail.com> #8May 8, 2008 10:26AM

Cann someone rename this request to something like "xml, xpath, xslt support (lxml)"?
I think more people would look into that and we could get more stars.

pr...@gmail.com <pr...@gmail.com> #9Jun 4, 2008 05:08AM

Agreed that this should be a request for XML and XPATH support, rather than a particular solution. I'd personally
like to see amara - like parsing of an incoming datastream which then gets marshalled into the datastore model.

ma...@gmail.com <ma...@gmail.com> #10Jun 4, 2008 05:32AM

I disagree, lxml is lxml and nothing else can replace it (see lxml.html for example)

ia...@gmail.com <ia...@gmail.com> #11Jun 4, 2008 05:41AM

Well, we could split out the features. lxml has these features:

* Fast, memory efficient parsing of XML
* Fast, memory efficient parsing of HTML
* Fast XPath
* Fast XSLT
* CSS Selector queries
* Some libraries (like lxml.html) built on top of that

There are other fast, fairly memory efficient XML parsers. There are no other fast
or memory efficient HTML parsers. I'm not sure about other XPath or XSLT
implementations, though I doubt Amara is fast.

The libraries like lxml.html could be ported, but they tend to rely on some other
features. lxml.html in particular uses XPath quite a bit, and also makes use of the
parent pointer that lxml has but ElementTree does not. And of course it is useful
because there's a good HTML parser in lxml.

It would be quite possible to create a more lxml-like model based on ElementTree.
Porting over some existing Python XPath implementation to that might work -- not as
fast, but maybe it could be fast enough. If Google provided a fast HTML parser
(maybe in the form of an HTML to HTML-as-XML translator, so you could pipe the HTML
to a subprocess (a subprocess Google provides, similar to how urlfetch works) and get
back easily parsed HTML-as-XML) then that would be a big help. I say HTML-as-XML as
there are semantic differences in XHTML that I have no interest in, and I'm more
interested in parsing HTML than "cleaning" it somehow. Something more like XHTML 5.

ab...@gmail.com <ab...@gmail.com> #12Jun 12, 2008 07:39PM

Full XPath 1.0 and XSLT 1.0 support are what I need most.

el...@gmail.com <el...@gmail.com> #13Jul 8, 2008 08:41PM

XPath 1.0 and XSLT 1.0 at least. XSLT 2.0 would be even better. XSLT 2.0 with
user-definable extensions in Python would be nirvana.

jd...@gmail.com <jd...@gmail.com> #14Jul 25, 2008 09:42AM

XSLT 2.0 would be ideal for me, but I'll be elated just to have XSLT 1.0. (That and XPath, natch.)

As of now, I need to use Amazon EC2, but for what I'm looking to do it's like taking the space shuttle for a
joyride. GAE could do the job I need far more elegantly ... if only! (Crossing fingers.)

ma...@gmail.com <ma...@gmail.com> #15Sep 10, 2008 08:49PM

I don't see any order way to validate XML schema files without spending a lot of time.

am...@gmail.com <am...@gmail.com> Sep 25, 2008 10:29PM

Assigned to am...@gmail.com.

de...@gtempaccount.com <de...@gtempaccount.com> #16Oct 6, 2008 08:59PM

co...@gmail.com <co...@gmail.com> #17Aug 29, 2009 03:55AM

I'm interested in the css selector aspect of lxml. Any implementation of CSS3
selectors will satisfy me.

ro...@gmail.com <ro...@gmail.com> #18Sep 5, 2009 07:47PM

App Engine needs decent XML tools. Please add the lxml module XPath and XSLT support!
And the overall speed and awesomeness!

dt...@gmail.com <dt...@gmail.com> #19Sep 11, 2009 08:17PM

I agree - XSLT support is sorely needed, not to mention better XPATH support.

ma...@gmail.com <ma...@gmail.com> #20Sep 21, 2009 10:19PM

Appengine need fast XPath parsing, that's my opinion

te...@gmail.com <te...@gmail.com> #21Sep 29, 2009 09:29PM

I'm signed up for a GAE account, but owing to its nature, I really can't start coding
my app until lxml is supported. I'll be waiting for this feature to be added.

ne...@gtempaccount.com <ne...@gtempaccount.com> #22Oct 1, 2009 08:58PM

I really need this too. Can't imagine a web-based system with XSLT!

ek...@gmail.com <ek...@gmail.com> #23Nov 11, 2009 05:38AM

I'd love to be able to use the lxml.html module:

http://codespeak.net/lxml/lxmlhtml.html

mi...@gmail.com <mi...@gmail.com> #24Feb 6, 2010 01:43AM

Oh look, yet another cool Python library which requires lxml that I can't use on
Google App Engine:

http://pyquery.org/

dh...@gmail.com <dh...@gmail.com> #25Feb 10, 2010 02:37PM

I'm in the same case as tedjaniszewski : a decent xml module is really needed.
I need to parse xml as well as html and perform xpath and css selector queries; this
can be done using python default modules (HTMLParser and minidom) but it would then
mean that I need to totally rewrite my code, that it would be much slower and, last
but not least, not as handy to use as lxml. Both HTMLParser and minidom are really
unconvenient.

ga...@gmail.com <ga...@gmail.com> #26Feb 10, 2010 08:30PM

It is quite easy to add elementtree to your project.
It's not the fast c version, but the api is better...
And it's reasonably fast and memory efficient.

io...@gmail.com <io...@gmail.com> #27Feb 10, 2010 09:00PM

elementtree is merely a xml parser and generator. lxml is a fully featured xml
toolkit - it has literally *everything*. However that comes at a price, the library
has tons of security holes that the gae team need to close before making it available
for gae.

Keep starring this issue.

ph...@driggle.com <ph...@driggle.com> #28Apr 11, 2010 09:25AM

Maybe you could implement the xml and html parsing functions Googlebot uses, I'd
assume they are one of the fastes and with the most fault tolerance available.

bk...@activemind.com <bk...@activemind.com> #29May 6, 2010 04:04PM

I agree with the other who say lxml support is critical. There are too many other
import libraries that rely on it.

I was planning on rolling out a new app to the AppEngine but it looks like I might
have to reconsider using Amazon Services instead.

dg...@gmail.com <dg...@gmail.com> #30May 12, 2010 03:39AM

I keep coming back to this issue when searching for how to xPath using what ever
library available in appengine but it seems to be a big lack of functionality
(libxml2 not supported, minidom doesnt cut it either)

ch...@gmail.com <ch...@gmail.com> #31May 12, 2010 03:57AM

I was able to get along with BeautifulSoup, but my app was small enough to allow
rewriting. Would really prefer to have lxml and all the modules that plug on it.

da...@gmail.com <da...@gmail.com> #32May 13, 2010 10:28AM

@dgtlmoon

There is pure-python Xpath implementation that works with standard minidom --

http://code.google.com/p/py-dom-xpath/

ga...@gmail.com <ga...@gmail.com> #33May 29, 2010 02:21PM

IMHO, support for xml with *any* full-featured library is a critical step for some
implementations of semantic apps that aims to be part of a Linked Data ecosystem.
Think RDF (

http://www.w3.org/TR/2004/REC-rdf-primer-20040210/)

Yes, there are alternatives and workarounds and "roundabouts" :) ... but it doesn't
seems to be sustainable in a long term.

Our project, e.g., implements a kind of "xml-chain", starting from raw pieces of data
provided by individual users, through a mashup of structured data stored in a remote
triple store and consumed by some front ends.

For now, the only front-end *out of the chain* is appengine, due to the lack of good
xml support.

Ok, fellows, I'm not a expert. But I think there are (and there will be) a lot of
projects with similar requirements in the next years.

And the developers looks like geeks-in-love with lxml. And I think they have good
reasons for that geek love :)

cheers

ga...@gmail.com <ga...@gmail.com> #34Jun 9, 2010 02:08AM

[Comment deleted]

ga...@gmail.com <ga...@gmail.com> #35Jun 9, 2010 02:09AM

thanks @daevaorn

py-dom-xpath rocks, and now I'm working on a wrapper that allows CSS selectors :)

http://github.com/gabrielfalcao/dominic

ma...@gmail.com <ma...@gmail.com> #36Jun 12, 2010 03:20PM

There are a lot of amazing 3rd party python-source based libraries, and custom code libraries that are based on lxml - it would be great to have it supported on app-engine (namely pyquery!!). Hope this gets pushed in soon.

ma...@gmail.com <ma...@gmail.com> #37Jun 12, 2010 03:20PM

How do we raise priority?

ry...@gtempaccount.com <ry...@gtempaccount.com> #38Jun 23, 2010 07:45PM

It's been over two years since this request was filed. Is there any chance of at least getting a response from Google regarding some kind of ETA, or no it'll never be supported, for those still considering GAE? Does the fact that it's acknowledged with a medium priority mean it will be supported.. eventually?

ma...@gmail.com <ma...@gmail.com> #39Jun 24, 2010 11:40AM

guys, in lieu of this being ultimately unsolvable, I'm considering writing a pure-python port of pyquery any takers interested in this project?

se...@gmail.com <se...@gmail.com> #40Jun 24, 2010 12:25PM

This IS solvable and the solution is port native client to the App Engine.

http://code.google.com/p/nativeclient/

Google devs are smart people I believe they are going to solve this in the near future.

m3...@gmail.com <m3...@gmail.com> #41Jul 17, 2010 07:44PM

I did find a port of PyQuery that claims to run on the html5 library. Haven't tried it, however.

http://pypi.python.org/pypi/pq.html5/0.5

pr...@gmail.com <pr...@gmail.com> #42Aug 16, 2010 12:14PM

I just found out that I cannot realize my project on GAE, because BeautifulSoup is ultimately too slow.

I wrote about using CherryPy on GAE to dynamically serve images

http://blog.dispatched.ch/2010/08/13/serving-images-dynamically-with-cherrypy-on-google-appengine/, because I am very happy with GAE.

Then I realized that BeautifulSoup is too slow, so I switched to lxml, which I also wrote about

http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/

Now I read that lxml is not supported. Would you please allow lxml or point to a Python only HTML parser that does not suck performance-wise?

jo...@gmail.com <jo...@gmail.com> #43Aug 16, 2010 02:44PM

I use html5lib for parsing HTML. It doesn't have any query support though.

Here's an example:

http://www.johntantalo.com/blog/strip-tags-with-html5lib/

la...@gmail.com <la...@gmail.com> #44Oct 7, 2010 01:42AM

It is more than 2 years since this issue was entered. Is it really such a big problem to add this library to AppEngine???

pd...@gmail.com <pd...@gmail.com> #45Nov 4, 2010 05:24PM

I'd like this too. I was surprised how slow BeautifulSoup, while nice and intuitive, is, when I profiled it.

[Deleted User] <[Deleted User]> #46Nov 15, 2010 08:28PM

I like GAE and I love Python.
I don't particularly like to parse html and xml, but I sometimes need to. And lxml is the best of breed.

If I can't have lxml, I will have to work my way around, but I hope it is possible to have a tiny answer to this question:

Are You working on letting lxml into the fine company?

no...@gmail.com <no...@gmail.com> #47Nov 16, 2010 02:24PM

lxml this is the best way to parse data form web. why gae doesnt provide it?

qb...@gmail.com <qb...@gmail.com> #48Dec 13, 2010 06:54AM

So guys - how is your solution to parse html/xml on GAE ?

[Deleted User] <[Deleted User]> #49Dec 13, 2010 04:58PM

I keep dreaming of doing it via google.docs and gdata, but I haven't had the need yet
That would probably only amount to xml, but like I said;-)

Would like to hear other approaches though - interesting subject that has to find a solution

[Deleted User] <[Deleted User]> #50Dec 13, 2010 08:59PM

Happy holidays:

http://docs.google.com/support/bin/answer.py?hl=en&answer=75507
gives us at least XPath queries!

ty...@gmail.com <ty...@gmail.com> #51Feb 2, 2011 06:41PM

The pythonic objectify API is also a valuable benefit having access to the lxml library.

http://codespeak.net/lxml/objectify.html

ni...@gmail.com <ni...@gmail.com> #52Jul 25, 2011 09:20PM

I'm thrilled that this has been accepted - anyone have a guesstimate of how long it usually takes for the google team to implement accepted requests?

pr...@google.com <pr...@google.com> #53Jul 26, 2011 08:34AM

Accepted means the engineering team is aware of it. We usually don't provide ETA for new features.

ep...@gmail.com <ep...@gmail.com> #54Jul 26, 2011 10:05AM

Some of issues was accepted two years ago yet still open, so I guess this does not mean anything.

Maybe the 2.7 runtime will changing this, who knowns. This is PaaS anyway.

pr...@google.com <pr...@google.com> #55Jul 26, 2011 10:24AM

It does mean that it got the attention of the engineering team.

The Priority should also be set accordingly.

el...@gmail.com <el...@gmail.com> #56Jul 27, 2011 12:03PM

Any chance someone on this engineering team could provide an ETA?

ro...@gmail.com <ro...@gmail.com> #57Jul 28, 2011 07:14PM

Hey guys,

I'm not a googler but hope this can help. We know that googlers never give an official ETA for features, so asking doesn't work.

But unofficially they commented during office hours on IRC that lxml (and cPickle, cjson) would be coming when the Python 2.7 runtime is released. That'll happen in the upcoming months. No, no ETA.

You can read the full transcript here:

https://groups.google.com/forum/#!searchin/google-appengine/office$20hours%7Csort:date/google-appengine/ZzUDaN226bU/TVY9Z8M06QsJ

ep...@gmail.com <ep...@gmail.com> #58Jul 29, 2011 02:40AM

That would be great, especially when ElementTree seems leaking.

bq...@google.com <bq...@google.com> Aug 19, 2011 03:56AM

Accepted by am...@gmail.com.

ap...@gmail.com <ap...@gmail.com> #59Sep 9, 2011 08:42PM

lxml is REALLY needed

ro...@gmail.com <ro...@gmail.com> #60Oct 4, 2011 06:36PM

Prerelease SDK 1.5.5 available for download!

http://groups.google.com/group/google-appengine-python/browse_thread/thread/7fd615a6502546ce
Has anyone tried this yet? :)

ci...@gmail.com <ci...@gmail.com> #61Oct 21, 2011 09:33PM

I recently converted an old application to use the new Python 2.7 appengine with lxml. It works great!

High Replication Storage is a requirement for using Python 2.7 and my old application instance didn't use that, so I needed to create a new application.

jf...@gmail.com <jf...@gmail.com> #62Nov 11, 2011 05:20PM

I am trying to use lxml with the new Python 2.7 appengine but I did not find in the documentation from where to import the module. Any idea ?

bq...@google.com <bq...@google.com> Feb 28, 2012 05:56AM

Marked as fixed.

Issue 35874381

Description

Issue summary

Comments

ku...@gmail.com <ku...@gmail.com> #2Apr 9, 2008 08:07AM

mi...@gmail.com <mi...@gmail.com> #3Apr 10, 2008 06:00PM

mi...@gmail.com <mi...@gmail.com> #4Apr 10, 2008 06:06PM

ta...@gmail.com <ta...@gmail.com> #5Apr 11, 2008 04:54AM

dj...@gmail.com <dj...@gmail.com> #6Apr 24, 2008 02:24PM

du...@gmail.com <du...@gmail.com> #7Apr 24, 2008 02:30PM

we...@gmail.com <we...@gmail.com> #8May 8, 2008 10:26AM

pr...@gmail.com <pr...@gmail.com> #9Jun 4, 2008 05:08AM

ma...@gmail.com <ma...@gmail.com> #10Jun 4, 2008 05:32AM

ia...@gmail.com <ia...@gmail.com> #11Jun 4, 2008 05:41AM

ab...@gmail.com <ab...@gmail.com> #12Jun 12, 2008 07:39PM

el...@gmail.com <el...@gmail.com> #13Jul 8, 2008 08:41PM

jd...@gmail.com <jd...@gmail.com> #14Jul 25, 2008 09:42AM

ma...@gmail.com <ma...@gmail.com> #15Sep 10, 2008 08:49PM

am...@gmail.com <am...@gmail.com> Sep 25, 2008 10:29PM

de...@gtempaccount.com <de...@gtempaccount.com> #16Oct 6, 2008 08:59PM

co...@gmail.com <co...@gmail.com> #17Aug 29, 2009 03:55AM

ro...@gmail.com <ro...@gmail.com> #18Sep 5, 2009 07:47PM

dt...@gmail.com <dt...@gmail.com> #19Sep 11, 2009 08:17PM

ma...@gmail.com <ma...@gmail.com> #20Sep 21, 2009 10:19PM

te...@gmail.com <te...@gmail.com> #21Sep 29, 2009 09:29PM

ne...@gtempaccount.com <ne...@gtempaccount.com> #22Oct 1, 2009 08:58PM

ek...@gmail.com <ek...@gmail.com> #23Nov 11, 2009 05:38AM

mi...@gmail.com <mi...@gmail.com> #24Feb 6, 2010 01:43AM

dh...@gmail.com <dh...@gmail.com> #25Feb 10, 2010 02:37PM

ga...@gmail.com <ga...@gmail.com> #26Feb 10, 2010 08:30PM

io...@gmail.com <io...@gmail.com> #27Feb 10, 2010 09:00PM

ph...@driggle.com <ph...@driggle.com> #28Apr 11, 2010 09:25AM

bk...@activemind.com <bk...@activemind.com> #29May 6, 2010 04:04PM

dg...@gmail.com <dg...@gmail.com> #30May 12, 2010 03:39AM

ch...@gmail.com <ch...@gmail.com> #31May 12, 2010 03:57AM

da...@gmail.com <da...@gmail.com> #32May 13, 2010 10:28AM

ga...@gmail.com <ga...@gmail.com> #33May 29, 2010 02:21PM

ga...@gmail.com <ga...@gmail.com> #34Jun 9, 2010 02:08AM

ga...@gmail.com <ga...@gmail.com> #35Jun 9, 2010 02:09AM

ma...@gmail.com <ma...@gmail.com> #36Jun 12, 2010 03:20PM

ma...@gmail.com <ma...@gmail.com> #37Jun 12, 2010 03:20PM

ry...@gtempaccount.com <ry...@gtempaccount.com> #38Jun 23, 2010 07:45PM

ma...@gmail.com <ma...@gmail.com> #39Jun 24, 2010 11:40AM

se...@gmail.com <se...@gmail.com> #40Jun 24, 2010 12:25PM

m3...@gmail.com <m3...@gmail.com> #41Jul 17, 2010 07:44PM

pr...@gmail.com <pr...@gmail.com> #42Aug 16, 2010 12:14PM

jo...@gmail.com <jo...@gmail.com> #43Aug 16, 2010 02:44PM

la...@gmail.com <la...@gmail.com> #44Oct 7, 2010 01:42AM

pd...@gmail.com <pd...@gmail.com> #45Nov 4, 2010 05:24PM

[Deleted User] <[Deleted User]> #46Nov 15, 2010 08:28PM

no...@gmail.com <no...@gmail.com> #47Nov 16, 2010 02:24PM

qb...@gmail.com <qb...@gmail.com> #48Dec 13, 2010 06:54AM

[Deleted User] <[Deleted User]> #49Dec 13, 2010 04:58PM

[Deleted User] <[Deleted User]> #50Dec 13, 2010 08:59PM

ty...@gmail.com <ty...@gmail.com> #51Feb 2, 2011 06:41PM

ni...@gmail.com <ni...@gmail.com> #52Jul 25, 2011 09:20PM

pr...@google.com <pr...@google.com> #53Jul 26, 2011 08:34AM

ep...@gmail.com <ep...@gmail.com> #54Jul 26, 2011 10:05AM

pr...@google.com <pr...@google.com> #55Jul 26, 2011 10:24AM

el...@gmail.com <el...@gmail.com> #56Jul 27, 2011 12:03PM

ro...@gmail.com <ro...@gmail.com> #57Jul 28, 2011 07:14PM

ep...@gmail.com <ep...@gmail.com> #58Jul 29, 2011 02:40AM

bq...@google.com <bq...@google.com> Aug 19, 2011 03:56AM

ap...@gmail.com <ap...@gmail.com> #59Sep 9, 2011 08:42PM

ro...@gmail.com <ro...@gmail.com> #60Oct 4, 2011 06:36PM

ci...@gmail.com <ci...@gmail.com> #61Oct 21, 2011 09:33PM

jf...@gmail.com <jf...@gmail.com> #62Nov 11, 2011 05:20PM

bq...@google.com <bq...@google.com> Feb 28, 2012 05:56AM

Add comment

Issue metadata