Fixed
Status Update
Comments
ku...@gmail.com <ku...@gmail.com> #2
Agreed, however, the problem (as I see it) is that lxml's awesomeness primarily
derives from its roots in libxml. (a C library). You can get most of the API via the
standard libraries elementtree implementations. The major exception to this (AFAIK)
is html parsing support and lxml's complete implementation of xpath. If you have
another more specific feature from lxml, could you share?
As far as performance, a simple way to help with that in the short term is store your
parsed results in a global variable (to take advantage of the caching framework) or
persist them using datastore. Parsing on each request (if expensive enough to notice
a real difference between lxml and the shipped elementtree) should probably be
avoided. If its just parsing AJAX data or the like, the speed difference probably
isn't noticeable. Or, if your not super-attached to xml, simplejson is always another
option.
Just some opinions, as lxml is a pretty big module, and not something I'd expect to
see implemented in the near future, but I could be wrong.
Cheers!
Kevin
derives from its roots in libxml. (a C library). You can get most of the API via the
standard libraries elementtree implementations. The major exception to this (AFAIK)
is html parsing support and lxml's complete implementation of xpath. If you have
another more specific feature from lxml, could you share?
As far as performance, a simple way to help with that in the short term is store your
parsed results in a global variable (to take advantage of the caching framework) or
persist them using datastore. Parsing on each request (if expensive enough to notice
a real difference between lxml and the shipped elementtree) should probably be
avoided. If its just parsing AJAX data or the like, the speed difference probably
isn't noticeable. Or, if your not super-attached to xml, simplejson is always another
option.
Just some opinions, as lxml is a pretty big module, and not something I'd expect to
see implemented in the near future, but I could be wrong.
Cheers!
Kevin
mi...@gmail.com <mi...@gmail.com> #3
For my own code, it's the parsing and validation (lxml can build parsers from any of
the following: DTD, RelaxNG, XMLSchema, or Schematron).
But although I could certainly write my own code to avoid lxml, the more critical
issue is external libraries and frameworks that I use which depend on lxml. That's
actually quite a lot of unnecessary wheel-reinvention.
the following: DTD, RelaxNG, XMLSchema, or Schematron).
But although I could certainly write my own code to avoid lxml, the more critical
issue is external libraries and frameworks that I use which depend on lxml. That's
actually quite a lot of unnecessary wheel-reinvention.
mi...@gmail.com <mi...@gmail.com> #4
BTW, switching to ElementTree is rather costly, CPU-wise, compared to cElementTree
(which is likewise missing).
(which is likewise missing).
ta...@gmail.com <ta...@gmail.com> #5
[Comment deleted]
dj...@gmail.com <dj...@gmail.com> #6
[Comment deleted]
du...@gmail.com <du...@gmail.com> #7
Please do *not* post +1 comments - they don't accomplish anything. Starring issues is how you should vote for
an issue.
Thanks!
an issue.
Thanks!
we...@gmail.com <we...@gmail.com> #8
Cann someone rename this request to something like "xml, xpath, xslt support (lxml)"?
I think more people would look into that and we could get more stars.
I think more people would look into that and we could get more stars.
pr...@gmail.com <pr...@gmail.com> #9
Agreed that this should be a request for XML and XPATH support, rather than a particular solution. I'd personally
like to see amara - like parsing of an incoming datastream which then gets marshalled into the datastore model.
like to see amara - like parsing of an incoming datastream which then gets marshalled into the datastore model.
ma...@gmail.com <ma...@gmail.com> #10
I disagree, lxml is lxml and nothing else can replace it (see lxml.html for example)
ia...@gmail.com <ia...@gmail.com> #11
Well, we could split out the features. lxml has these features:
* Fast, memory efficient parsing of XML
* Fast, memory efficient parsing of HTML
* Fast XPath
* Fast XSLT
* CSS Selector queries
* Some libraries (like lxml.html) built on top of that
There are other fast, fairly memory efficient XML parsers. There are no other fast
or memory efficient HTML parsers. I'm not sure about other XPath or XSLT
implementations, though I doubt Amara is fast.
The libraries like lxml.html could be ported, but they tend to rely on some other
features. lxml.html in particular uses XPath quite a bit, and also makes use of the
parent pointer that lxml has but ElementTree does not. And of course it is useful
because there's a good HTML parser in lxml.
It would be quite possible to create a more lxml-like model based on ElementTree.
Porting over some existing Python XPath implementation to that might work -- not as
fast, but maybe it could be fast enough. If Google provided a fast HTML parser
(maybe in the form of an HTML to HTML-as-XML translator, so you could pipe the HTML
to a subprocess (a subprocess Google provides, similar to how urlfetch works) and get
back easily parsed HTML-as-XML) then that would be a big help. I say HTML-as-XML as
there are semantic differences in XHTML that I have no interest in, and I'm more
interested in parsing HTML than "cleaning" it somehow. Something more like XHTML 5.
* Fast, memory efficient parsing of XML
* Fast, memory efficient parsing of HTML
* Fast XPath
* Fast XSLT
* CSS Selector queries
* Some libraries (like lxml.html) built on top of that
There are other fast, fairly memory efficient XML parsers. There are no other fast
or memory efficient HTML parsers. I'm not sure about other XPath or XSLT
implementations, though I doubt Amara is fast.
The libraries like lxml.html could be ported, but they tend to rely on some other
features. lxml.html in particular uses XPath quite a bit, and also makes use of the
parent pointer that lxml has but ElementTree does not. And of course it is useful
because there's a good HTML parser in lxml.
It would be quite possible to create a more lxml-like model based on ElementTree.
Porting over some existing Python XPath implementation to that might work -- not as
fast, but maybe it could be fast enough. If Google provided a fast HTML parser
(maybe in the form of an HTML to HTML-as-XML translator, so you could pipe the HTML
to a subprocess (a subprocess Google provides, similar to how urlfetch works) and get
back easily parsed HTML-as-XML) then that would be a big help. I say HTML-as-XML as
there are semantic differences in XHTML that I have no interest in, and I'm more
interested in parsing HTML than "cleaning" it somehow. Something more like XHTML 5.
ab...@gmail.com <ab...@gmail.com> #12
Full XPath 1.0 and XSLT 1.0 support are what I need most.
el...@gmail.com <el...@gmail.com> #13
XPath 1.0 and XSLT 1.0 at least. XSLT 2.0 would be even better. XSLT 2.0 with
user-definable extensions in Python would be nirvana.
user-definable extensions in Python would be nirvana.
jd...@gmail.com <jd...@gmail.com> #14
XSLT 2.0 would be ideal for me, but I'll be elated just to have XSLT 1.0. (That and XPath, natch.)
As of now, I need to use Amazon EC2, but for what I'm looking to do it's like taking the space shuttle for a
joyride. GAE could do the job I need far more elegantly ... if only! (Crossing fingers.)
As of now, I need to use Amazon EC2, but for what I'm looking to do it's like taking the space shuttle for a
joyride. GAE could do the job I need far more elegantly ... if only! (Crossing fingers.)
ma...@gmail.com <ma...@gmail.com> #15
I don't see any order way to validate XML schema files without spending a lot of time.
am...@gmail.com <am...@gmail.com>
de...@gtempaccount.com <de...@gtempaccount.com> #16
co...@gmail.com <co...@gmail.com> #17
I'm interested in the css selector aspect of lxml. Any implementation of CSS3
selectors will satisfy me.
selectors will satisfy me.
ro...@gmail.com <ro...@gmail.com> #18
App Engine needs decent XML tools. Please add the lxml module XPath and XSLT support!
And the overall speed and awesomeness!
And the overall speed and awesomeness!
dt...@gmail.com <dt...@gmail.com> #19
I agree - XSLT support is sorely needed, not to mention better XPATH support.
ma...@gmail.com <ma...@gmail.com> #20
Appengine need fast XPath parsing, that's my opinion
te...@gmail.com <te...@gmail.com> #21
I'm signed up for a GAE account, but owing to its nature, I really can't start coding
my app until lxml is supported. I'll be waiting for this feature to be added.
my app until lxml is supported. I'll be waiting for this feature to be added.
ne...@gtempaccount.com <ne...@gtempaccount.com> #22
I really need this too. Can't imagine a web-based system with XSLT!
ek...@gmail.com <ek...@gmail.com> #23
mi...@gmail.com <mi...@gmail.com> #24
Oh look, yet another cool Python library which requires lxml that I can't use on
Google App Engine:http://pyquery.org/
Google App Engine:
dh...@gmail.com <dh...@gmail.com> #25
I'm in the same case as tedjaniszewski : a decent xml module is really needed.
I need to parse xml as well as html and perform xpath and css selector queries; this
can be done using python default modules (HTMLParser and minidom) but it would then
mean that I need to totally rewrite my code, that it would be much slower and, last
but not least, not as handy to use as lxml. Both HTMLParser and minidom are really
unconvenient.
I need to parse xml as well as html and perform xpath and css selector queries; this
can be done using python default modules (HTMLParser and minidom) but it would then
mean that I need to totally rewrite my code, that it would be much slower and, last
but not least, not as handy to use as lxml. Both HTMLParser and minidom are really
unconvenient.
ga...@gmail.com <ga...@gmail.com> #26
It is quite easy to add elementtree to your project.
It's not the fast c version, but the api is better...
And it's reasonably fast and memory efficient.
It's not the fast c version, but the api is better...
And it's reasonably fast and memory efficient.
io...@gmail.com <io...@gmail.com> #27
elementtree is merely a xml parser and generator. lxml is a fully featured xml
toolkit - it has literally *everything*. However that comes at a price, the library
has tons of security holes that the gae team need to close before making it available
for gae.
Keep starring this issue.
toolkit - it has literally *everything*. However that comes at a price, the library
has tons of security holes that the gae team need to close before making it available
for gae.
Keep starring this issue.
ph...@driggle.com <ph...@driggle.com> #28
Maybe you could implement the xml and html parsing functions Googlebot uses, I'd
assume they are one of the fastes and with the most fault tolerance available.
assume they are one of the fastes and with the most fault tolerance available.
bk...@activemind.com <bk...@activemind.com> #29
I agree with the other who say lxml support is critical. There are too many other
import libraries that rely on it.
I was planning on rolling out a new app to the AppEngine but it looks like I might
have to reconsider using Amazon Services instead.
import libraries that rely on it.
I was planning on rolling out a new app to the AppEngine but it looks like I might
have to reconsider using Amazon Services instead.
dg...@gmail.com <dg...@gmail.com> #30
I keep coming back to this issue when searching for how to xPath using what ever
library available in appengine but it seems to be a big lack of functionality
(libxml2 not supported, minidom doesnt cut it either)
library available in appengine but it seems to be a big lack of functionality
(libxml2 not supported, minidom doesnt cut it either)
ch...@gmail.com <ch...@gmail.com> #31
I was able to get along with BeautifulSoup, but my app was small enough to allow
rewriting. Would really prefer to have lxml and all the modules that plug on it.
rewriting. Would really prefer to have lxml and all the modules that plug on it.
da...@gmail.com <da...@gmail.com> #32
@dgtlmoon
There is pure-python Xpath implementation that works with standard minidom --
http://code.google.com/p/py-dom-xpath/
There is pure-python Xpath implementation that works with standard minidom --
ga...@gmail.com <ga...@gmail.com> #33
IMHO, support for xml with *any* full-featured library is a critical step for some
implementations of semantic apps that aims to be part of a Linked Data ecosystem.
Think RDF (http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ )
Yes, there are alternatives and workarounds and "roundabouts" :) ... but it doesn't
seems to be sustainable in a long term.
Our project, e.g., implements a kind of "xml-chain", starting from raw pieces of data
provided by individual users, through a mashup of structured data stored in a remote
triple store and consumed by some front ends.
For now, the only front-end *out of the chain* is appengine, due to the lack of good
xml support.
Ok, fellows, I'm not a expert. But I think there are (and there will be) a lot of
projects with similar requirements in the next years.
And the developers looks like geeks-in-love with lxml. And I think they have good
reasons for that geek love :)
cheers
implementations of semantic apps that aims to be part of a Linked Data ecosystem.
Think RDF (
Yes, there are alternatives and workarounds and "roundabouts" :) ... but it doesn't
seems to be sustainable in a long term.
Our project, e.g., implements a kind of "xml-chain", starting from raw pieces of data
provided by individual users, through a mashup of structured data stored in a remote
triple store and consumed by some front ends.
For now, the only front-end *out of the chain* is appengine, due to the lack of good
xml support.
Ok, fellows, I'm not a expert. But I think there are (and there will be) a lot of
projects with similar requirements in the next years.
And the developers looks like geeks-in-love with lxml. And I think they have good
reasons for that geek love :)
cheers
ga...@gmail.com <ga...@gmail.com> #34
[Comment deleted]
ga...@gmail.com <ga...@gmail.com> #35
thanks @daevaorn
py-dom-xpath rocks, and now I'm working on a wrapper that allows CSS selectors :)
http://github.com/gabrielfalcao/dominic
py-dom-xpath rocks, and now I'm working on a wrapper that allows CSS selectors :)
ma...@gmail.com <ma...@gmail.com> #36
There are a lot of amazing 3rd party python-source based libraries, and custom code libraries that are based on lxml - it would be great to have it supported on app-engine (namely pyquery!!). Hope this gets pushed in soon.
ma...@gmail.com <ma...@gmail.com> #37
How do we raise priority?
ry...@gtempaccount.com <ry...@gtempaccount.com> #38
It's been over two years since this request was filed. Is there any chance of at least getting a response from Google regarding some kind of ETA, or no it'll never be supported, for those still considering GAE? Does the fact that it's acknowledged with a medium priority mean it will be supported.. eventually?
ma...@gmail.com <ma...@gmail.com> #39
guys, in lieu of this being ultimately unsolvable, I'm considering writing a pure-python port of pyquery any takers interested in this project?
se...@gmail.com <se...@gmail.com> #40
This IS solvable and the solution is port native client to the App Engine.
http://code.google.com/p/nativeclient/
Google devs are smart people I believe they are going to solve this in the near future.
Google devs are smart people I believe they are going to solve this in the near future.
m3...@gmail.com <m3...@gmail.com> #41
I did find a port of PyQuery that claims to run on the html5 library. Haven't tried it, however.
http://pypi.python.org/pypi/pq.html5/0.5
pr...@gmail.com <pr...@gmail.com> #42
I just found out that I cannot realize my project on GAE, because BeautifulSoup is ultimately too slow.
I wrote about using CherryPy on GAE to dynamically serve imageshttp://blog.dispatched.ch/2010/08/13/serving-images-dynamically-with-cherrypy-on-google-appengine/ , because I am very happy with GAE.
Then I realized that BeautifulSoup is too slow, so I switched to lxml, which I also wrote abouthttp://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
Now I read that lxml is not supported. Would you please allow lxml or point to a Python only HTML parser that does not suck performance-wise?
I wrote about using CherryPy on GAE to dynamically serve images
Then I realized that BeautifulSoup is too slow, so I switched to lxml, which I also wrote about
Now I read that lxml is not supported. Would you please allow lxml or point to a Python only HTML parser that does not suck performance-wise?
jo...@gmail.com <jo...@gmail.com> #43
I use html5lib for parsing HTML. It doesn't have any query support though.
Here's an example:http://www.johntantalo.com/blog/strip-tags-with-html5lib/
Here's an example:
la...@gmail.com <la...@gmail.com> #44
It is more than 2 years since this issue was entered. Is it really such a big problem to add this library to AppEngine???
pd...@gmail.com <pd...@gmail.com> #45
I'd like this too. I was surprised how slow BeautifulSoup, while nice and intuitive, is, when I profiled it.
[Deleted User] <[Deleted User]> #46
I like GAE and I love Python.
I don't particularly like to parse html and xml, but I sometimes need to. And lxml is the best of breed.
If I can't have lxml, I will have to work my way around, but I hope it is possible to have a tiny answer to this question:
Are You working on letting lxml into the fine company?
I don't particularly like to parse html and xml, but I sometimes need to. And lxml is the best of breed.
If I can't have lxml, I will have to work my way around, but I hope it is possible to have a tiny answer to this question:
Are You working on letting lxml into the fine company?
no...@gmail.com <no...@gmail.com> #47
lxml this is the best way to parse data form web. why gae doesnt provide it?
qb...@gmail.com <qb...@gmail.com> #48
So guys - how is your solution to parse html/xml on GAE ?
[Deleted User] <[Deleted User]> #49
I keep dreaming of doing it via google.docs and gdata, but I haven't had the need yet
That would probably only amount to xml, but like I said;-)
Would like to hear other approaches though - interesting subject that has to find a solution
That would probably only amount to xml, but like I said;-)
Would like to hear other approaches though - interesting subject that has to find a solution
ty...@gmail.com <ty...@gmail.com> #51
The pythonic objectify API is also a valuable benefit having access to the lxml library.
http://codespeak.net/lxml/objectify.html
ni...@gmail.com <ni...@gmail.com> #52
I'm thrilled that this has been accepted - anyone have a guesstimate of how long it usually takes for the google team to implement accepted requests?
pr...@google.com <pr...@google.com> #53
Accepted means the engineering team is aware of it. We usually don't provide ETA for new features.
ep...@gmail.com <ep...@gmail.com> #54
Some of issues was accepted two years ago yet still open, so I guess this does not mean anything.
Maybe the 2.7 runtime will changing this, who knowns. This is PaaS anyway.
Maybe the 2.7 runtime will changing this, who knowns. This is PaaS anyway.
pr...@google.com <pr...@google.com> #55
It does mean that it got the attention of the engineering team.
The Priority should also be set accordingly.
The Priority should also be set accordingly.
el...@gmail.com <el...@gmail.com> #56
Any chance someone on this engineering team could provide an ETA?
ro...@gmail.com <ro...@gmail.com> #57
Hey guys,
I'm not a googler but hope this can help. We know that googlers never give an official ETA for features, so asking doesn't work.
But unofficially they commented during office hours on IRC that lxml (and cPickle, cjson) would be coming when the Python 2.7 runtime is released. That'll happen in the upcoming months. No, no ETA.
You can read the full transcript here:
https://groups.google.com/forum/#!searchin/google-appengine/office$20hours%7Csort:date/google-appengine/ZzUDaN226bU/TVY9Z8M06QsJ
I'm not a googler but hope this can help. We know that googlers never give an official ETA for features, so asking doesn't work.
But unofficially they commented during office hours on IRC that lxml (and cPickle, cjson) would be coming when the Python 2.7 runtime is released. That'll happen in the upcoming months. No, no ETA.
You can read the full transcript here:
ep...@gmail.com <ep...@gmail.com> #58
That would be great, especially when ElementTree seems leaking.
bq...@google.com <bq...@google.com>
ap...@gmail.com <ap...@gmail.com> #59
lxml is REALLY needed
ci...@gmail.com <ci...@gmail.com> #61
I recently converted an old application to use the new Python 2.7 appengine with lxml. It works great!
High Replication Storage is a requirement for using Python 2.7 and my old application instance didn't use that, so I needed to create a new application.
High Replication Storage is a requirement for using Python 2.7 and my old application instance didn't use that, so I needed to create a new application.
jf...@gmail.com <jf...@gmail.com> #62
I am trying to use lxml with the new Python 2.7 appengine but I did not find in the documentation from where to import the module. Any idea ?
Description
to use the builtin xml module. lxml is faster and uses less RAM than the stock one.