a “code” for translating between binary and characters
e.g. 1101001 == i
ASCII
American Standard Code for Information Interchange
first published 1963
characteristics: 7-bit, 127 characters, some control characters
ASCII Table
Where are they used?
reading/writing text based files
console/terminal/shell display
data (e.g. CSV), source code, config files
What encodings exist?
many!
ASCII (American Standard Code for Information Interchange)
Windows-1252
UTF-8 (Unicode Transformation Format – 8-bit)
created in 1992
uses 1-4 8-bit bytes: capable of encoding all 1,112,064 Unicode characters
What happened? Why is it a problem?
localized/nationalized versions of ASCII would share most characters in common, but assign other locally useful characters to several code points reserved for “national use”
without a standard to stick to, lots of overlap occurred
new standards like Windows-1252 and UTF-8 did the same… used ASCII as a basis for backwards-compatibility sake, and then added more on top of it, which made overlaps possible/likely
BOM (byte order mark) - potential solution, but has it’s own problems
now it’s very hard to know 100% for sure what encoding was used to encode a file!
What problems am I likely to see?
text you read in is garbled, i.e. Mojibake
characters replaced by very similar looking alternate characters
text you see on screen in the Console or Viewer doesn’t match what you read in or write out
comparing strings fails to match when it “should”
What can I do about it?
Set default text encoding (for saving) to UTF-8
Set file encoding explicitly when reading or writing files