Approaches
There are two main approaches to detecting any human readable representation of a discrete quantity within text.
- Detect well known and stable patterns in the input stream and by adjacency determine the output stream.
- Windowing through the text in the input stream and directly detect the quantities.
There are other approaches and there are hybrids of these two or one of these and the other approaches, but these two are the theoretically most straightforward and likely to produce both reliability and accuracy.
Re-entrant Learning
Whether the training involves re-entrant learning techniques, such as reinforcement, is a tangential issue that this answer will not address, but know that whether all training is solely a deployment component or whether adaptation and/or convergence occurs in real time is an architectural decision to be made.
Practical Concerns
Practically, the outputs of each recognition are as follows.
- Starting index
- Ending index
- Integer year or null
- Integer day of year or null
- Integer hour in military time or null
- Minute or null
- Second or null
- Time zone or null
- Probability the recognition unit was correctly identified
- Probability the recognition produced accurate results
Also practically, the input must either be from within one particular locale's norms in terms of
- Calendar,
- Time,
- Written language,
- Character encoding, and
- Collation,
... or ...
- The learning must occur using training sets that include the locales that will be encountered during system use
... or ...
so that Filipino and Icelandic names for the first month of the year enter the artificial network as the same binary pattern.
**Date and Time Specifically*
In the case of 1. above, which is semi-heuristic in nature, and assuming that the locale is entirely en-US.utf-8, the CASE INSENSITIVE patterns for a PCRE library or equivalent to use as a search orientation heuristic include the following.
(^|[^0-9a-z])(19|20|21)[0-9][0-9])([^0-9a-z]|$)
(^|[^0-9a-z])(Mon|Monday|Tue|Tues|Tuesday|Wed|Wednesday|Thu|Thur|Thurs|Thursday|Fri|Friday|Sat|Saturday|Sun)([^0-9a-z]|$)
(^|[^0-9a-z])(Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sep|Sept|September|Oct|October|Nov|November|Dec|December)([^0-9a-z]|$)
(^|[^0-9a-z])(Today|Yesterday|Tomorrow)([^0-9a-z]|$)
(^|[^0-9])[AP]M|[AP][.]M[.]|Noon|Midnight)([^0-9a-z]|$)
(^|[^a-z])(0?[1-9])(:[0-5][0-9]){1,2}([^a-z]|$)
There should be others for time, hyphenated or slash delimited dates, or time zone.
The positions and normalized encoding of these date and time artifacts are then substituted into the artificial network inputs instead of the original text in the stream, reducing redundancy and improving both the speed of training and the resulting accuracy and reliability of recognition.
In the case of 2. above, the entire burden of recognition is left to the artificial network. The advantage is less reliance on date and time conventions. The disadvantage is a much larger burden placed on training data variety and training epochs, meaning a much higher burden on computing resources and the pacience of the stake holder for the project.
Windowing
An overlapping windowing strategy is necessary. Unlike FFT spectral analysis in real time, the windowing must be rectangular, because the size of the window is the width of the input layer of the artificial network. Experimenting with the normalization of input such that the encoding of text and data and time components entering the input layer could greatly vary the results in terms of training speed, recognition accuracy, reliability, and adaptability to varying statistical distributions of date and time instances and relationships.