Text Categorization or Classification


Though the automated classification (categorization) of texts has been flourishing in the last decade or so, has a history, which dates back to about 1960. The incredible increase in online documents, which has been mostly due to the expanding internet, has renewed the interest in automated document classification and data mining. While text classification in the beginning was based mainly on heuristic methods, i.e. applying a set of rules based on expert knowledge, nowadays the focus has turned to fully automatic learning and even clustering methods.

Definition of Text Classification:

Let C = { c1, c2, ... cm} be a set of categories (classes) and D = { d1, d2, ... dn} a set of documents.

The task of the text classification consists in assigning to each pair ( ci, dj ) of C x D (with 1 i m and 1 j n) a value of 0 or 1, i.e. the value 0, if the document dj doesn't belong to ci

This mapping is sometimes refered to as the decision matrix:



d1

...

dj

...

dn

c1

a11

...

a1j

...

a1n

...

...

...

...

...

...

ci

ai1

...

aij

...

ain

...

...

...

...

...

...

cm

am1

...

amj

...

amn


The main approaches to solve this task are:
  • Naive Bayes
  • Support Vector Machine
  • Nearest Neighbour


More about this topic in our chapter Text Categorization and Classification of our Python Course, where you can also find an implementation of a Naive Bayes Classifier in Python.

You can find an interesting and exhaustive bibliography on this topic:
Articles on text classification


If your are interested in writing your own text classification system and if you are looking for a seminar with an expert both in Python and in natural language text processing, you can attend one of my courses on "Natural Language Processing" with Python at Bodenseo. xyz as a symbol for natural language processing


The class on Text Classification is taught at our training centre in Toronto as well: Pleas check our website Python-training-courses.com


Trainings in Toronto

© Copyright 1996 - 2012, Bernd Klein
My German site