Text Encoding

CJ Yetman

Communicating with computers

many!
ASCII (American Standard Code for Information Interchange)
Windows-1252
UTF-8 (Unicode Transformation Format – 8-bit)
- created in 1992
- uses 1-4 8-bit bytes: capable of encoding all 1,112,064 Unicode characters

localized/nationalized versions of ASCII would share most characters in common, but assign other locally useful characters to several code points reserved for “national use”
without a standard to stick to, lots of overlap occurred
new standards like Windows-1252 and UTF-8 did the same… used ASCII as a basis for backwards-compatibility sake, and then added more on top of it, which made overlaps possible/likely
BOM (byte order mark) - potential solution, but has it’s own problems
now it’s very hard to know 100% for sure what encoding was used to encode a file!

text you read in is garbled, i.e. Mojibake
characters replaced by very similar looking alternate characters
text you see on screen in the Console or Viewer doesn’t match what you read in or write out
comparing strings fails to match when it “should”

live_demos/2025-10-21_text_encoding_demo/encodings_experiment.R

ASCII article on Wikipedia: https://en.wikipedia.org/wiki/ASCII
ASCII table shown above: https://en.wikipedia.org/wiki/ASCII#/media/File:ASCII_Table_(suitable_for_printing).svg
UNICODE
Mojibake - https://en.wikipedia.org/wiki/Mojibake