Unicode
Unicode
Since early Python 2 days unicode was part of all default Python builds. Itallows developers to write applications that deal with non-ASCII charactersin a straightforward way. But working with unicode requires a basic knowledgeabout that matter, especially when working with libraries that do not supportit.
Werkzeug uses unicode internally everywhere text data is assumed, even if theHTTP standard is not unicode aware as it. Basically all incoming data isdecoded from the charset specified (per default utf-8) so that you don'toperate on bytestrings any more. Outgoing unicode data is then encoded intothe target charset again.
Unicode in Python
In Python 2 there are two basic string types: str and unicode. str maycarry encoded unicode data but it's always represented in bytes whereas theunicode type does not contain bytes but charpoints. What does this mean?Imagine you have the German Umlaut ö. In ASCII you cannot represent thatcharacter, but in the latin-1 and utf-8 character sets you can representit, but they look differently when encoded:
>>> u'ö'.encode('latin1')
'\xf6'
>>> u'ö'.encode('utf-8')
'\xc3\xb6'
So an ö might look totally different depending on the encoding which makesit hard to work with it. The solution is using the unicode type (as we didabove, note the u prefix before the string). The unicode type does notstore the bytes for ö but the information, that this is aLATINSMALLLETTEROWITHDIAERESIS.
Doing len(u'ö') will always give us the expected “1” but len('ö')might give different results depending on the encoding of 'ö'.
Unicode in HTTP
The problem with unicode is that HTTP does not know what unicode is. HTTPis limited to bytes but this is not a big problem as Werkzeug decodes andencodes for us automatically all incoming and outgoing data. Basically whatthis means is that data sent from the browser to the web application is perdefault decoded from an utf-8 bytestring into a unicode string. Data sentfrom the application back to the browser that is not yet a bytestring is thenencoded back to utf-8.
Usually this “just works” and we don't have to worry about it, but there aresituations where this behavior is problematic. For example the Python 2 IOlayer is not unicode aware. This means that whenever you work with data fromthe file system you have to properly decode it. The correct way to loada text file from the file system looks like this:
f = file('/path/to/the_file.txt', 'r')
try:
text = f.decode('utf-8') # assuming the file is utf-8 encoded
finally:
f.close()
There is also the codecs module which provides an open function that decodesautomatically from the given encoding.
Error Handling
With Werkzeug 0.3 onwards you can further control the way Werkzeug works withunicode. In the past Werkzeug ignored encoding errors silently on incomingdata. This decision was made to avoid internal server errors if the usertampered with the submitted data. However there are situations where youwant to abort with a 400 BAD REQUEST instead of silently ignoring the error.
All the functions that do internal decoding now accept an errors keywordargument that behaves like the errors parameter of the builtin string methoddecode. The following values are possible:
ignoreThis is the default behavior and tells the codec to ignore characters thatit doesn't understand silently.replaceThe codec will replace unknown characters with a replacement character(U+FFFDREPLACEMENTCHARACTER)strictRaise an exception if decoding fails.
Unlike the regular python decoding Werkzeug does not raise anUnicodeDecodeError [http://docs.python.org/dev/library/exceptions.html#UnicodeDecodeError] if the decoding failed but anHTTPUnicodeError whichis a direct subclass of UnicodeError and the BadRequest HTTP exception.The reason is that if this exception is not caught by the application buta catch-all for HTTP exceptions exists a default 400 BAD REQUEST errorpage is displayed.
There is additional error handling available which is a Werkzeug extensionto the regular codec error handling which is called fallback. Often youwant to use utf-8 but support latin1 as legacy encoding too if decodingfailed. For this case you can use the fallback error handling. Forexample you can specify 'fallback:iso-8859-15' to tell Werkzeug it shouldtry with iso-8859-15 if utf-8 failed. If this decoding fails too (whichshould not happen for most legacy charsets such as iso-8859-15) the erroris silently ignored as if the error handling was ignore.
Further details are available as part of the API documentation of the concreteimplementations of the functions or classes working with unicode.
Request and Response Objects
As request and response objects usually are the central entities of Werkzeugpowered applications you can change the default encoding Werkzeug operates onby subclassing these two classes. For example you can easily set theapplication to utf-7 and strict error handling:
from werkzeug.wrappers import BaseRequest, BaseResponse
class Request(BaseRequest):
charset = 'utf-7'
encoding_errors = 'strict'
class Response(BaseResponse):
charset = 'utf-7'
Keep in mind that the error handling is only customizable for all decodingbut not encoding. If Werkzeug encounters an encoding error it will raise aUnicodeEncodeError [http://docs.python.org/dev/library/exceptions.html#UnicodeEncodeError]. It's your responsibility to not create data that isnot present in the target charset (a non issue with all unicode encodingssuch as utf-8).