Frequently Asked Questions
==========================

See also the notes on compatibility_ to ElementTree_.

.. _compatibility: compatibility.html
.. _ElementTree:   http://effbot.org/zone/element-index.htm


#) Is there a tutorial?

   There is a `tutorial for ElementTree`_ which also works for lxml.etree.
   The `API documentation`_ also contains many examples.

   .. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
   .. _`API documentation`:        api.html


#) Where can I find more documentation about lxml?

   There is a lot of documentation as lxml implements the well-known
   `ElementTree API`_ and tries to follow its documentation as closely as
   possible.  There are a couple of issues where lxml cannot keep up
   compatibility.  They are described in the compatibility_ documentation.
   The lxml specific extensions to the API are described by individual files
   in the ``doc`` directory of the distribution and on `the web page`_.

   .. _`ElementTree API`: http://effbot.org/zone/element-index.htm
   .. _`the web page`:    http://codespeak.net/lxml/#documentation


#) My application crashes! Why does lxml.etree do that?

   One of the goals of lxml is "no segfaults", so if there is no clear warning
   in the documentation that you were doing something potentially harmful, you
   have found a bug and we would like to hear about it.  Please report this
   bug to the mailing list.  See the next section on how to do that.


#) I think I have found a bug in lxml. What should I do?

   a) First, you should look at the `current developer changelog`_ to see if
      this is a known problem that has already been fixed in the SVN trunk.

      .. _`current developer changelog`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt

   b) If you are using threads, please see the following section to check if
      you touch on one of the potential pitfalls.

   c) Otherwise, we would really like to hear about it.  Please report it to
      the `mailing list`_ so that we can fix it.  It is very helpful in this
      case if you can come up with a short code snippet that demonstrates your
      problem.  Please also report the version of lxml, libxml2 and libxslt
      that you are using by calling this::

          from lxml import etree
          print "lxml.etree:       ", etree.LXML_VERSION
          print "libxml used:      ", etree.LIBXML_VERSION
          print "libxml compiled:  ", etree.LIBXML_COMPILED_VERSION
          print "libxslt used:     ", etree.LIBXSLT_VERSION
          print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION

      .. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev


#) Can I use threads to concurrently access the lxml API?

   Yes, although not carelessly.

   lxml frees the GIL (Python's global interpreter lock) internally when
   parsing from disk and memory, as long as you use either the default parser
   (which is replicated for each thread) or create a parser for each thread
   yourself.  lxml also allows concurrency during validation (RelaxNG and
   XMLSchema) and XSL transformation.  You can share RelaxNG, XMLSchema and
   XSLT objects between threads.  While you can also share parsers between
   threads, this will serialize the access to each of them, so it is better to
   copy() parsers or to use the default parser.  Note that access to the XML()
   and HTML() functions is always serialized.  If you need to parse from
   strings, use StringIO.

   Warning: You should generally avoid modifying trees in other threads than
   the one it was generated in.  Although this should work in many cases,
   there are certain scenarios where the termination of a thread that parsed a
   tree can crash the application if subtrees of this tree are moved to other
   documents.  You should be on the safe side when passing trees between
   threads if you either

   a) do not modify these trees and do not move its elements to other trees, or
   b) do not terminate threads while the trees they parsed are still in use


#) Why doesn't the ``pretty_print`` option reformat my XML output?

   Pretty printing (or formatting) an XML document means adding white space to
   the content.  These modifications are harmless if they only impact elements
   in the document that do not carry (text) data.  They corrupt your data if
   they impact elements that contain data.  If lxml cannot distinguish between
   whitespace and data, it will not alter your data.  Whitespace is therefore
   only added between nodes that do not contain data.  This is always the case
   for trees constructed element-by-element, so no problems should be expected
   here.  For parsed trees, a good way to assure that no conflicting
   whitespace is left in the tree is the ``remove_blank_text`` option::

   >>> parser = etree.XMLParser(remove_blank_text=True)
   >>> tree = etree.parse(file, parser)

   This will allow the parser to drop blank text nodes when constructing the
   tree.  If you now call a serialization function to pretty print this tree,
   lxml can add fresh whitespace to the XML tree to indent it.


#) Why can't lxml parse my XML from unicode strings?

   lxml can read Python unicode strings and even tries to support them if
   libxml2 does not.  However, if the unicode string declares an XML encoding
   internally (``<?xml encoding="..."?>``), parsing is bound to fail, as this
   encoding is most likely not the real encoding used in Python unicode.  The
   same is true for HTML unicode strings that contain charset meta tags.  Note
   that Python uses different encodings for unicode on different platforms, so
   even specifying the real internal unicode encoding is not portable between
   Python interpreters.  Don't do it.

   Python unicode strings with XML data or HTML data that carry encoding
   information are broken.  lxml will not parse them.  You must provide
   parsable data in a valid encoding.


#) How can I find out which namespace prefixes are used in a document?

   You can traverse the document (``getiterator()``) and collect the prefix
   attributes from all Elements into a set.  However, it is unlikely that you
   really want to do that.  You do not need these prefixes, honestly.  You
   only need the namespace URIs.  All namespace comparisons use these, so feel
   free to make up your own prefixes when you use XPath expressions or
   extension functions.

   The only place where you might consider specifying prefixes is the
   serialization of Elements that were created through the API.  Here, you can
   specify a prefix mapping through the ``nsmap`` argument when creating the
   root Element.  Its children will then inherit this prefix for
   serialization.


#) How can I specify a default namespace for XPath expressions?

   You can't.  In XPath, there is no such thing as a default namespace.  Just
   use an arbitrary prefix and let the namespace dictionary of the XPath
   evaluators map it to your namespace.  See also the question above.


#) What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?

   ``findall()`` is part of the original `ElementTree API`_.  It supports a
   `simple subset of the XPath language`_, without predicates, conditions and
   other advanced features.  It is very handy for finding specific tags in a
   tree.  Another important difference is namespace handling, which uses the
   ``{namespace}tagname`` notation.  This is not supported by XPath.  The
   findall, find and findtext methods are compatible with other ElementTree
   implementations and allow writing portable code that runs on ElementTree,
   cElementTree and lxml.etree.

   ``xpath()``, on the other hand, supports the complete power of the XPath
   language, including predicates, XPath functions and Python extension
   functions.  The syntax is defined by the `XPath specification`_.  If you
   need the expressiveness and selectivity of XPath, the ``xpath()`` method,
   the ``XPath`` class and the ``XPathEvaluator`` are the best choice_.

   .. _`simple subset of the XPath language`: http://effbot.org/zone/element-xpath.htm
   .. _`XPath specification`:                 http://www.w3.org/TR/xpath
   .. _choice:                                performance.html#xpath


#) Why doesn't ``findall()`` support full XPath expressions?

   It was decided that it is more important to keep compatibility with
   ElementTree_ to simplify code migration between the libraries.  The main
   difference compared to XPath is the ``{namespace}tagname`` notation used in
   ``findall()``, which is not valid XPath.

   ElementTree and lxml.etree use the same implementation, which assures 100%
   compatibility.  Note that ``findall()`` is `so fast`_ in lxml that a native
   implementation would not bring any performance benefits.

   .. _`so fast`: performance.html#tree-traversal


#) What is the difference between str(xslt(doc)) and xslt(doc).write() ?

   The str() implementation of the XSLTResultTree class (a subclass of
   ElementTree) knows about the output method chosen in the stylesheet
   (xsl:output), write() doesn't.  If you call write(), the result will be a
   normal XML tree serialization in the requested encoding.  Calling this
   method may also fail for XSLT results that are not XML trees (e.g. string
   results).

   If you call str(), it will return the serialized result as specified by the
   XSL transform.  This correctly serializes string results to encoded Python
   strings and honours ``xsl:output`` options like ``indent``.  This almost
   certainly does what you want, so you should only use ``write()`` if you are
   sure that the XSLT result is an XML tree and you want to override the
   encoding and indentation options requested by the stylesheet.


#) Why is my application so slow?

   lxml.etree is a very fast library for processing XML.  There are, however,
   `a few caveats`_ involved in the mapping of the powerful libxml2 library to
   the simple and convenient ElementTree API.  Not all operations are as fast
   as the simplicity of the API might suggest.  The `benchmark page`_ has a
   comparison to other ElementTree implementations and a number of tips for
   performance tweaking.

   .. _`a few caveats`:  performance.html#the-elementtree-api
   .. _`benchmark page`: performance.html

