Text Categorization or Classification


Though the automated classification (categorization) of texts has been flourishing in the last decade or so, is a history, which dates back to about 1960. The incredible increase in online documents, which has been mostly due to the expanding internet, has renewed the interst in automated document classification and data mining. While text classification in the beginning was based mainly on heuristic methods, i.e. applying a set of rules based on expert knowledge, nowadays the focus has turned to fully automatic learning and even clustering methods.


Definition of Text Classification:


Let C = { c1, c2, ... cm} be a set of categories (classes) and D = { d1, d2, ... dn} a set of documents.

The task of the text classification consists in assigning to each pair ( ci, dj ) of C x D (with 1 i m and 1 j ≤ n) a value of 0 or 1, i.e. the value 0, if the document dj doesn't belong to ci


This mapping is sometimes refered to as the decision matrix:



d1

...

dj

...

dn

c1

a11

...

a1j

...

a1n

...

...

...

...

...

...

ci

ai1

...

aij

...

ain

...

...

...

...

...

...

cm

am1

...

amj

...

amn





The main approaches to solve this task are:
  • Naive Bayes
  • Support Vector Machine
  • Nearest Neighbour
You can find an interesting and exhaustive bibliography on this topic:
Articles on text classification

Just some image



Whatever you do will be insignificant, but it is very important that you do it.
(Mahatma Gandhi)


Bernd Klein

© Copyright 1996 - 2009, Bernd Klein
My German site