å 个æ¦å¿µæ§çä¸è¥¿
ANSCII:
æ åç ANSCII ç¼ç åªä½¿ç¨7个æ¯ç¹æ¥è¡¨ç¤ºä¸ä¸ªå符ï¼å æ¤æå¤ç¼ç 128个å符ãæ©å
ç ANSCII 使ç¨8个æ¯ç¹æ¥è¡¨ç¤ºä¸ä¸ªå符ï¼æå¤ä¹åªè½
ç¼ç 256 个å符ã
UNICODE:
使ç¨2个çè³4个åèæ¥ç¼ç ä¸ä¸ªå符ï¼å æ¤å¯ä»¥å°ä¸çä¸ææçå符è¿è¡ç»ä¸ç¼ç ã
UTF:
UNICODEç¼ç 转æ¢æ ¼å¼ï¼å°±æ¯ç¨æ¥æ导å¦ä½å° unicode ç¼ç æéåæ件åå¨åç½ç»ä¼ è¾çåèåºåçå½¢å¼ (unicode ->
str)ãåå
¶ä»çä¸äºç¼ç æ¹å¼ gb2312, gb18030, big5 å UTF çä½ç¨æ¯ä¸æ ·çï¼åªæ¯ç¼ç æ¹å¼ä¸åã
Python éé¢æ两ç§æ°æ®æ¨¡åæ¥æ¯æå符串è¿ç§æ°æ®ç±»åï¼ä¸ç§æ¯ strï¼å¦å¤ä¸ç§æ¯ unicode ï¼å®ä»¬é½æ¯ sequence çæ´¾çç±»
åï¼è¿ä¸ªå¯ä»¥åè Python Language Ref ä¸çæè¿°ï¼
Strings
The items of a string are characters. There is no separate
character type; a character is represented by a string of one item.
Characters represent (at least) 8-bit bytes. The built-in functions
chr() and ord() convert between characters and nonnegative integers
representing the byte values. Bytes with the values 0-127 usually
represent the corresponding ASCII values, but the interpretation of
values is up to the program. The string data type is also used to
represent arrays of bytes, e.g., to hold data read from a file.
(On systems whose native character set is not ASCII, strings
may use EBCDIC in their internal representation, provided the
functions chr() and ord() implement a mapping between ASCII and
EBCDIC, and string comparison preserves the ASCII order. Or perhaps
someone can propose a better rule?)
Unicode
The items of a Unicode object are Unicode code units. A
Unicode code unit is represented by a Unicode object of one item and
can hold either a 16-bit or 32-bit value representing a Unicode
ordinal (the maximum value for the ordinal is given in sys.maxunicode,
and depends on how Python is configured at compile time). Surrogate
pairs may be present in the Unicode object, and will be reported as
two separate items. The built-in functions unichr() and ord() convert
between code units and nonnegative integers representing the Unicode
ordinals as defined in the Unicode Standard 3.0. Conversion from and
to other encodings are possible through the Unicode method encode()
and the built-in function unicode().
è¿éé¢æ¯è¿ä¹å å¥ï¼
"The items of a string are characters", "The items of a Unicode object
are Unicode code units", "The string data type is also used to
represent arrays of bytes, e.g., to hold data read from a file."
ä¸äºå¥è¯´æ str å unicode çç»æåå
(item)æ¯ä»ä¹ï¼å 为å®ä»¬åæ¯ sequence ) ãsequence é»è®¤ç
__len__ å½æ°çè¿åå¼æ£æ¯è¯¥åºåç»æåå
ç个æ°ãè¿æ ·çè¯ï¼len('abcd') == 4 å len(u'ææ¯ä¸æ') == 4 å°±å¾
容æç解äºã
第ä¸å¥åè¯æ们åä»æ件è¾å
¥è¾åºçæ¶åæ¯ç¨ str æ¥è¡¨ç¤ºæ°æ®çæ°ç»ãä¸æ¢æ¯æ件æä½ï¼ææ³å¨ç½ç»ä¼ è¾çæ¶ååºè¯¥ä¹æ¯è¿æ ·çãè¿å°±æ¯ä¸ºä»ä¹ä¸ä¸ª
unicode å符串å¨åå
¥æ件æè
å¨ç½ç»ä¸ä¼ è¾çæ¶åè¦è¿è¡ç¼ç çåå äºã
Python éé¢çç¼ç å解ç ä¹å°±æ¯ unicode å str è¿ä¸¤ç§å½¢å¼çç¸äºè½¬åãç¼ç æ¯ unicode -> strï¼ç¸åçï¼è§£ç å°±
æ¯ str -> unicodeã
ä¸é¢å©ä¸çé®é¢å°±æ¯ç¡®å®ä½æ¶éè¦è¿è¡ç¼ç æè
解ç äºï¼åä¸äºåºæ¯ unicode ççï¼è¿æ ·æ们å¨å°è¿äºåºå½æ°çè¿åå¼è¿è¡ä¼ è¾æè
åå
¥æ件çæ¶åå°±
è¦èèå°å®ç¼ç æåéçç±»åã
å
³äºæ件å¼å¤´ç"ç¼ç æ示"ï¼ä¹å°±æ¯ # -*- coding: -*- è¿ä¸ªè¯å¥ãPython é»è®¤èæ¬æ件é½æ¯ ANSCII ç¼ç çï¼å½æ件
ä¸æé ANSCII ç¼ç èå´å
çå符çæ¶åå°±è¦ä½¿ç¨"ç¼ç æ示"æ¥ä¿®æ£ã
å
³äº sys.defaultencodingï¼è¿ä¸ªå¨è§£ç 没ææç¡®ææ解ç æ¹å¼çæ¶å使ç¨ãæ¯å¦ææå¦ä¸ä»£ç ï¼
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = 'ä¸æ' # 注æè¿éç str æ¯ str ç±»åçï¼èä¸æ¯ unicode
s.encode('gb18030')
è¿å¥ä»£ç å° s éæ°ç¼ç 为 gb18030 çæ ¼å¼ï¼å³è¿è¡ unicode -> str ç转æ¢ãå 为 s æ¬èº«å°±æ¯ str ç±»åçï¼å æ¤
Python ä¼èªå¨çå
å° s 解ç 为 unicode ï¼ç¶ååç¼ç æ gb18030ãå 为解ç æ¯pythonèªå¨è¿è¡çï¼æ们没æææ解ç æ¹
å¼ï¼python å°±ä¼ä½¿ç¨ sys.defaultencoding ææçæ¹å¼æ¥è§£ç ãå¾å¤æ
åµä¸ sys.defaultencoding æ¯
ANSCIIï¼å¦æ s ä¸æ¯è¿ä¸ªç±»åå°±ä¼åºéã
æ¿ä¸é¢çæ
åµæ¥è¯´ï¼æç sys.defaultencoding æ¯ ansciiï¼è s çç¼ç æ¹å¼åæ件çç¼ç æ¹å¼ä¸è´ï¼æ¯ utf8 çï¼æ
以åºéäº:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position
0: ordinal not in range(128)
对äºè¿ç§æ
åµï¼æ们æ两ç§æ¹æ³æ¥æ¹æ£é误ï¼
ä¸æ¯æç¡®çæç¤ºåº s çç¼ç æ¹å¼
#! /usr/bin/env python
# -*- coding: utf-8 -*-
s = 'ä¸æ'
s.decode('utf-8').encode('gb18030')
äºæ¯æ´æ¹ sys.defaultencoding 为æ件çç¼ç æ¹å¼
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys) # Python2.5 åå§ååä¼å é¤ sys.setdefaultencoding è¿ä¸ªæ¹æ³ï¼æ们éè¦éæ°è½½å
¥
sys.setdefaultencoding('utf-8')
str = 'ä¸æ'
str.encode('gb18030')
温馨提示:内容为网友见解,仅供参考
python如何去掉\900
二是更改 sys.defaultencoding 为文件的编码方式 ! \/usr\/bin\/env python -*- coding: utf-8 -*- import sys reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入 sys.setdefaultencoding('utf-8')str = '中文'str.encode('gb18030')
python输出存在多少无效字符(python没有输出)
\\\ufef这种字符可能是不可见的,需要你先处理一下。为什么python语言这里会显示空格错误无效字符?在我这里没有出现这个错误,你可以试试把编辑器的默认编码格式换成utf8试试。python的pyttsx3报错无效的类字符串可以用int(guess)来转,比如ifint(guess)secret 当然最好你还要检查输入是否为整数,或...
python 基数为10的int()的文本无效:“11\\\ufef”,怎么修改啊...
\\\ufef 这种字符可能是不可见的,需要你先处理一下。
Python中encoding='utf-8-sig'是什么意思
在Python 2.x版本中,由于默认使用ASCII编码处理文本文件,处理非ASCII字符时可能遇到问题。为解决此问题,Python 2.x引入了utf-8编码并将其设置为默认编码。然而,由于utf-8不包含BOM,因此在处理文本文件时需要额外添加BOM以确定文件编码。Python 3.x引入了utf-8-sig编码格式,包含特殊字节序列\,...
易语言好用还是python语言好用?
至于reload(sys)是因为Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入。四、操作不同文件的编码格式的文件 建立一个文件test.txt,文件格式用ANSI,内容为:abc中文用python来读取# coding=gbkprint open("Test.txt").read()结果:abc中文把文件格式改成UTF-8:结果:abc...
python输出存在多少无效字符(python没有输出)
sys.setdefaultencoding方法在python导入site.py后就删除了(具体代码查看site.py就可以看到)因此如果想用的话可以再重新load进入 总结:u=u'unicode编码文字'g=u.encode('gbk')#转换为gbk格式 printg#此时为乱码,因为当前环境为utf-8,gbk编码文字为乱码 str=g.decode('gbk').encode('utf-8')#以...
python用三位数给百家姓排序
f.write(u'\'.encode('utf8'))csv_writer = ucsv.writer(f, encoding = 'gbk')inform = os.listdir(file_dir)natsort.natsorted(inform)def my_function(lis): #输入一个名字的列表 pin=Pinyin()result=[]for item in lis:result.append(...