m**********r 发帖数: 122 | 1 有一个文件夹里有大概1000个文件。我有以下的Python语句调用后出现下面的错误。应
该是涉及到特殊字符的问题,我试了其他的方法,都不能解决问题。
DIR = 'C:\Users\Desktop\data\rec.sport.hockey'
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
x_train = vectorizer.fit_transform(posts)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240:
invalid start byte
Traceback (most recent call last):
File "C:/Users/PycharmProjects/Project3/demo10.py", line 16, in
x_train = vectorizer.fit_transform(posts)
File "C:UsersAppDataRoamingPythonPython27site-packagessklearnfeature_
extractiontext.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:UsersAppDataRoamingPythonPython27site-packagessklearnfeature_
extractiontext.py", line 739, in _count_vocab
for feature in analyze(doc):
File "C:UsersAppDataRoamingPythonPython27site-packagessklearnfeature_
extractiontext.py", line 236, in
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:UsersAppDataRoamingPythonPython27site-packagessklearnfeature_
extractiontext.py", line 113, in decode
doc = doc.decode(self.encoding, self.decode_error)
File "C:Python27libencodingsutf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240:
invalid start byte
DIR = 'C:\Users\Desktop\data\rec.sport.hockey'
posts = [codecs.open(os.path.join(DIR,f),'r','utf-8') for f in os.listdir(
DIR)]
x_train = vectorizer.fit_transform(posts)
Traceback (most recent call last):
File "C:/Users/PycharmProjects/Project3/demo10.py", line 15, in
posts = [codecs.open(os.path.join(DIR,f),'r','utf-8') for f in os.
listdir(DIR)]
File "C:Python27libcodecs.py", line 878, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 24] Too many open files: 'C:\Users\Desktop\data\rec.sport.
hockey\53909' | Y****a 发帖数: 243 | 2 我不确定,只是几个建议
试试UTF-16呢?
看看你的路径里是不是少了一个/
file打开用完之后及时关上 | h*********d 发帖数: 109 | 3
【在 m**********r 的大作中提到】 : 有一个文件夹里有大概1000个文件。我有以下的Python语句调用后出现下面的错误。应 : 该是涉及到特殊字符的问题,我试了其他的方法,都不能解决问题。 : DIR = 'C:\Users\Desktop\data\rec.sport.hockey' : posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)] : x_train = vectorizer.fit_transform(posts) : UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 240: : invalid start byte : Traceback (most recent call last): : File "C:/Users/PycharmProjects/Project3/demo10.py", line 16, in : x_train = vectorizer.fit_transform(posts)
| b******g 发帖数: 88 | 4 取决于设计,出现特殊字符的文件多少以及是否重要,要不然就encode,要不然就忽略
异常
except UnicodeDecodeError: |
|