Clean Text in Python3

A pain in the ass. This post summarizes “best” approaches to clean text data in Python3.

It will not cover depreciated syntax in Python2. For example string.maketans has a different usage in python2 — it is not discussed here.

Is that a string or a Unicode?

Reference here.

When you see a string in python2, there are two possibilities:

  • ASCII strings: Every character in a string is a byte. Look up the hexadecimal value in the ASCII table.
  • Unicode strings: every character in a string is one or more than one byte. Look up the hexiadecimal value in the Unicode table – there are many of them. The most popular one is UTF-8.
    • Example: 猫 is represened in three bytes in Unicode. when Python2 reads this, it gots it wrong – it thinks there are 5 characters but there is in fact just three… As python2 use ascii to decode.
      • To produce the correct representation, use x.decode(‘utf-8’)

String vs sequence of bytes in Python2

  • String is a sequence of Unicode codepoints… they are abstract concepts and cannot be stored on disk. They are a sequence of characters.
  • bytes are actual numbers… they can be stored on disk.
  • Anything has to be mapped to a byte to be stored on a computer
  • To map a codepoint to a byte, use Unicode encoding
  • To convert a byte to a string, use decoding .

String vs sequences of bytes in Python3

  • In sum, in Python 3, str represents a Unicode string, while the bytes type is a sequence of bytes. This naming scheme makes a lot more intuitive sense.

Encode vs Decode

  • To represent a unicode string as a string of bytes is known as encoding

Remove punctuation

The best answer (I think) from Stackoverflow :


import string

# Thanks to Martijn Pieters for this improved version

# This uses the 3-argument version of str.maketrans
# with arguments (x, y, z) where 'x' and 'y'
# must be equal-length strings and characters in 'x'
# are replaced by characters in 'y'. 'z'
# is a string (string.punctuation here)
# where each character in the string is mapped
# to None
translator = str.maketrans('', '', string.punctuation)

# This is an alternative that creates a dictionary mapping
# of every character from string.punctuation to None (this will
# also work)
#translator = str.maketrans(dict.fromkeys(string.punctuation))

s = 'string with "punctuation" inside of it! Does this work? I hope so.'

# pass the translator to the string's translate method.
print(s.translate(translator))

The code above removes punctuation and just delete it.

Replace punctuation with a blank

This method uses regular expression… I find it to be better than using a translator!

import re
re.sub(r'[^\w]', ' ', text)

Dealing with all kinds of blanks

Again borrowing from this awesome stackoverflow post…

# Remove leading and ending spaces, use str.strip():
sentence = ' hello  apple'
sentence.strip()
>>>'hello  apple'


# Remove all spaces, use str.replace():

sentence = ' hello  apple'
sentence.replace(" ", "")
>>> 'helloapple'

# Remove duplicated spaces, use str.split(), then join the words together

sentence = ' hello  apple'
" ".join(sentence.split())
>>>'hello apple'

Summary of workflow:

  1. Decide if that’s a string or a unicode, or a sequence of bytes. This will decide whether need to encode or not. Ultimately we want a string, i.e. “str” type.
  2. Import re module. MHOP this is the most convenient one to remove punctuations.
    1. re.sub can replace all numbers and punctuations with blanks. Thus, will not join two unrelated words together if they are connected by punctuations.
  3. Use s.lower() to change a string to lower case… Note here s is a string! not a unicode object.
  4. Use s.strip() to strip out the excessive blanks!
  5. Use . join to join the list of words together with only one blank between them, if needed.
Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *