I’ll start with a quote from wikipedia.
Pattern Recognition is a sub-topic of machine learning. It is „the act of taking in raw data and taking an action based on the category of the data“. Most research in pattern recognition is about methods for supervised learning and unsupervised learning. Pattern recognition aims to classify data (patterns) based either on a priori knowledge or on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space. This is in contrast to pattern matching, where the pattern is rigidly specified.
Note the use of the words data, observation and measurement. This the ghost of Lord Kelvin, speaking from afar. Note, too, the use of appropriate. The multidimensional space is chosen in advance by careful judgement. It exists prior to the data. Almost automatically we are led to assume measurements on the real axis, embeddings in a real vector space. The data points form a discrete point cloud. All the power of Matlab is just a fingertip away.
But beware! Let’s take a more general view.
A classifier is a map f from a large topological space X into a small discrete set of symbols denoted Y. We can regard Y as a finite topological space with discrete topology.
A function f: X⟶Y is, by definition, continuous iff the inverse image of every open set is open. Since every set in Y is open, the inverse images of the points in Y must be open.
This condition is also sufficient because every inverse image is the union of inverse images of points. Let’s rephrase slightly: Every point x∊X has an open neighborhood f^{-1}(f(x)) where f is constant. A function f is continuous iff it is locally constant.
Sets of examples, e.g. points in X with prescribed images in Y, are often called „training examples“, and many learning algorithms start from them. I strongly prefer the term „reference points“ or „reference examples“ because I’m convinced that you cannot learn from examples alone. In order to learn anything you have to interact with the world, to move around in X, to explore the spaces of the unknown. The much discussed No Free Lunch“-Theorems convey the same message.
Any example worth mentioning has a small neighborhood in which the class does not change. In other words, it is a germ in the sheaf of locally constant functions f: X⟶Y. Any learning algorithm must grow these germs, extend them to large open subsets, large enough to be useful. The simple nearest-neighbor method takes the given metric on X and extend the classifier to open balls with increasing radii. In our example, x_{3} and x_{4} belong to the same class. The disks will glue into one domain where f is constant. Domains of different classes will be separated by a border. If we take X as R^{n}, with euclidean distance, the borders will be parts of hyperplanes. Eventually, X is split up into Voronoi Cells.
The images to the left show one potential problem with this approach: The letters are points in X = R^{23×19} with the usual metric. The distance between two versions of ‚e‘ is larger than the distance between ‚e‘ and ‚c‘. Although the metric of X will give the correct topology, it fails miserably at larger distances.
The tangential space methods use a modified metric, adapted to local geometry. The spheres of euclidean distance become ellipsoids, the first few principal axes will span a tangent space. If we do this for many points, they will fit together to form a submanifold of X. Although it will be nonsingular at our reference points, there may be singular points where they will not fit well. Even dimension may vary from point to point. The usual notion of a manifold does not cover this behavior.
Instead of modifying the metric we can simply add reference points. If an instance of ‚e‘ gets misclassified, we simply add another example. Memory is cheap,after all. We can even use object constancy to infer class membership, so we can generate lots of examples fast.
The borders, however, start looking strange. As the number of examples grows, the rate of misclassification drops slowly, if at all. As exception upon exception occurs, the borders exhibit a fine structure. Covering the interior with disks is, hmm, difficult. Do we really know that category borders are smooth with codimension one?
If we do not even try to map the category borders in fine detail, if we smooth out the fractal structure, we have to resort either to probability or to fuzzy logic. Instead of a continuous indicator function f: X ⟶ Y = {0,1} ( a discrete set ) we have a continuous function f: X ⟶ [0,1] ⊂ R. It measures either a degree of set membership or the probability of the event „it is an e“.
The two-dimensional renderings of the space X are misleading. The images of ‚c‘ and ‚e‘ shown above have 23×19 grayscale pixels each. They are points in a real affine space of dimension 437. Our intuitions of distance, volume and connectedness, derived from three-space, are simply wrong in this case. ( Look here for some illustrations and comments.) The spheres surrounding each reference point extend in 437 dimensions. Most of the volume is concentrated in a thin shell just below the surface. We do not know anything about deep space far from any example. As far as experience goes, these parts of our space do not exist.
Do we really want to extend our classifier to the entire space X? After all, a random point of X looks rather dull. It is just low-contrast noise. Two independent points look the same although they are separated by a large distance. This is wrong.
We are left with an apparent dilemma.
I am not sure about the „correct“ category, however. A classifier is a nonexpanding map on a large subset of X to Y. The category of non expanding maps looks most hopeful.
Next: Factorization.