I’ve made some screenshots of my simplistic OCR application. It serves as a testbed/driver for the pattern engine.
On close inspection you should notice that the bounding boxes jump around from each run of the program to the next. Some of this variation comes from sensor noise, but most of it is added deliberately. Removal of this variability is usually the first step in an OCR system. It is misguided because it will destroy the glueing data. Without microsaccades, mammalian vision is impossible.
Each rectangular box gives an input to the pattern engine.
The locally constant functions h0 to hn-1 calculate an address vector of type
const uint64_t x[n]
The dark-blue color of the boxes indicates that the memory units are empty. Let’s operate the engine in unsupervised mode. Whenever the output is undefined, we create a new symbol. The light-blue boxes indicate places where the majority approaches 100%. Additional iterations will occasionally create new symbols, but most of the time an existing symbol will spread. If the veto threshold β is not zero, symbols may become extinct, too!
A few iterations later:
The light-blue areas have majority > 80%. Yellow and green indicate 60% and 40%. The red areas have at least 20% veto. Most of them exhibit just a few votes and one dissenting opinion. The dark-blue capital letters are so rare that the inital germs are growing slowly. Eventually, they’ll fuse into a few classes, too.
So far, the engine has run completely unsupervised. It has clustered the letters into stable categories. We can assign conventional labels to equivalence classes with mouse and keyboard. The lightblue boxes turn white, other colors remain unchanged.
The final state after millions of examples shows only white and lightblue. The singularities have been squeezed out. You should also note that the occasional oversegmented ’n‘ does not get misclassified. The partial strokes are so characteristic and common that they’ve got a large class of their own.
This kind of performance is delivered by a single pattern engine with nine fixed maps hi: X ⟶ Xi and about 4.6E5 memory entries. In order to separate „long s“ and „f“, the resolution must be set to about 8×8 pixels. This is inefficient for common letters like ‚e‘, which are easily recognized at a resolution of 5×5 pixels, and there are still dozens of different categories for less common letters.
A hierarchical system obviates this difficulty. Instead of a single pattern engine, we use several smaller ones whose receptive fields cover the bounding box. They specialize on the morphemes that constitute a letter. Each output is a locally constant function, and different products of subsets of these functions form the input for a single pattern engine at the second level. The receptive fields of the pattern engines at the lower level exhibit large overlap, there may even be engines whose receptive fields are identical. A small amount of randomness in symbol creation, however, leads to singularities developing at different points of the input space. Forming products of outputs at this level allows the second layer to resolve the singularities.
In this way, wholeness may arise from parts. The higher level will often derive the correct result ( and a distress signal that may prove useful,too ! ) even if a letter is partially obscured by a blot.
The screenshot below shows a more advanced program version.
It is a true multilayer engine whose internal layers are initialized with random data. After a few thousand unsupervised runs through sample pages, only a few hundred cluster-codes survive at the third layer, and the vote is mostly unanimous. The last layer is just a QMap<uint64_t,QString>; it maps hashcodes to QStrings.