I have trouble reading the csv file by python. My csv file has Korean and numbers.
Below is my python code.
import csv
import codecs
csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))
for row in csvreader: print(row)First, there was a UnicodeDecodeError when I enter "for row in csvreader" line in the above code.
So I used the code below then the problem seemed to be solved
csvreader = csv.reader(codecs.open('1.csv', 'rU', 'utf-16'))Then I ran into NULL byte error. Then I can't figure out what's wrong with the csv file.
[update] I don't think I changed anything from the previous code but my program shows "UnicodeError: UTF-16 stream does not start with BOM"
When I open the csv by excel I can see the table in proper format (image attached at the botton) but when I open it in sublime Text, below is a snippet of what I get.
504b 0304 1400 0600 0800 0000 2100 6322
f979 7701 0000 d405 0000 1300 0802 5b43
6f6e 7465 6e74 5f54 7970 6573 5d2e 786d
6c20 a204 0228 a000 0200 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000If you need more information about my file, let me know!
I appreciate your help. Thanks in advance :)
csv file shown in excel
csv file shown in sublime text
4 Answers
The problem is that your input file apparently doesn’t start with a BOM (a special character that gets recognizably encoded differently for little-endian vs. big-endian utf-16), so you can’t just use “utf-16” as the encoding, you have to explicitly use “utf-16-le” or “utf-16-be”.
If you don’t do that, codecs will guess, and if it guesses wrong, it’ll try to read each code point backward and get illegal values.
If your posted sample starts at an even offset and contains a bunch of ASCII, it’s little-ending, so use the -le version. (But of course it’s better to look at what it actually is than to guess.)
2Now that you’ve included more of the file in your question, that isn’t a CSV file at all. My guess is that it’s an old-style binary XLS file, but that’s just a guess. If you’re just renaming spam.xls to spam.csv, you can’t do that; you need to export it to CSV format. (If you need help with that, ask on another site that offers help with Excel instead of with programming.)
If you can’t do that for some reason, there are libraries on PyPI to parse XLS files—but if you wanted CSV, and you can export CSV, that’s a better idea.
1After hours of struggling with such an issue, I came to learn that Excel exports data in Multiple CSV formats.
From Excel, please make sure to use 'CSV UTF-8 (Comma delimited)' option while exporting. (You often may want to use this type than the other CSV options).
Once you are sure of the UTF-type, in this case, 'UTF-8', go back to your python script and change encoding to 'UTF-8', though I found skipping this parameter also works.
with open('schools_dataset.csv', encoding='utf-8') as csv_file:
# continue opening the file The file begins with a PKZIP signature so it is actually an XLSX file.
This is great because instead of a CSV file, where you would have to know the character encoding, headers, column types, delimiter, text quoting and escape rules, and line endings, you can just open it and programs can see the structure of the data.