5

I want to explore and experiment the ways in which I could use a neural network to identify patterns in text.

examples:

  1. Prices of XYZ stock went down at 11:00 am today
  2. Retrieve a list of items exchanged on 03/04/2018
  3. Show error logs between 3 - 5 am yesterday.
  4. Reserve a flight for 3rd October.
  5. Do I have any meetings this Friday?
  6. Remind to me wake up early tue, 4th sept

This is for a project so I am not using regular expressions. Papers, projects, ideas are all welcome but I want to approach feature extraction/pattern detection to have a model trained which can Identify patterns that it has already seen.

  • 1
    Can you clarify; the patterns that you are looking for, are they fixed in size (always extend across the same number of time-steps)? Also, have you looked at sliding/neural nets for timeseries prediction....approaches? – quintumnia Sep 02 '18 at 11:48
  • "are they fixed in size" no, they could be in any of the generally used formats, I'll that that up in the question. " have you looked at sliding/neural nets for timeseries prediction" no, I am fairly new to nlp :) – Amresh Venugopal Sep 02 '18 at 11:55
  • 1
    Sharing your research helps everyone. Tell us what you've tried and why it didn’t meet your needs instead of giving us examples based on what you assume. So this demonstrates that you’ve taken the time to try to help yourself, it saves us from reiterating obvious answers, and most of all it helps you get a more specific and relevant answer – quintumnia Sep 02 '18 at 17:21
  • This question seems to be too broad. In the body of the post, you talk about "patterns". In the title, you talk about "datetime patterns". Moreover, as suggested above, [before asking a question, it's expected that users do a little bit of research](https://ai.stackexchange.com/help/how-to-ask). – nbro Jan 06 '22 at 13:13

3 Answers3

1

If want to use deep learning approaches, you should look to recurrent neural networks (RNN). Recurrent networks will take into account temporal dependencies and could detect thatn this in this Friday belong to datetime but not in this apple.

As a simple model, you could create a model with a bidirectional LSTM layer (a type of RNN):

  • Input: the sequences of characters.
  • Output: whether the character belongs to datetime or not.

The longest part will gather many sentences with its corresponding solution to create a training/testing dataset. Keras might be a good framework to start playing around and with many examples.

Daniel GL
  • 119
  • 4
  • Most RNNs for NLP use word embeddings as input feature. It would be challenging to design such embeddings involving numbers, hyphens or other punctuation. Maybe a character-level RNN is needed. – user12075 Sep 10 '18 at 20:37
  • 1
    @DouglasDaseeco, the question didn't involve getting the value of the time. With the same structure, the output could be enriched to codify absolute (03/04/2018) and relative (this Friday) dates. – Daniel GL Sep 11 '18 at 08:35
1

Approaches

There are two main approaches to detecting any human readable representation of a discrete quantity within text.

  1. Detect well known and stable patterns in the input stream and by adjacency determine the output stream.
  2. Windowing through the text in the input stream and directly detect the quantities.

There are other approaches and there are hybrids of these two or one of these and the other approaches, but these two are the theoretically most straightforward and likely to produce both reliability and accuracy.

Re-entrant Learning

Whether the training involves re-entrant learning techniques, such as reinforcement, is a tangential issue that this answer will not address, but know that whether all training is solely a deployment component or whether adaptation and/or convergence occurs in real time is an architectural decision to be made.

Practical Concerns

Practically, the outputs of each recognition are as follows.

  • Starting index
  • Ending index
  • Integer year or null
  • Integer day of year or null
  • Integer hour in military time or null
  • Minute or null
  • Second or null
  • Time zone or null
  • Probability the recognition unit was correctly identified
  • Probability the recognition produced accurate results

Also practically, the input must either be from within one particular locale's norms in terms of

  • Calendar,
  • Time,
  • Written language,
  • Character encoding, and
  • Collation,

... or ...

  • The learning must occur using training sets that include the locales that will be encountered during system use

... or ...

  • Much of the locale specific syntax must be normalized to a general date and time language such as this:

    जनवरी --> D_01

    Enero --> D_01

    Janúar --> D_01

so that Filipino and Icelandic names for the first month of the year enter the artificial network as the same binary pattern.

**Date and Time Specifically*

In the case of 1. above, which is semi-heuristic in nature, and assuming that the locale is entirely en-US.utf-8, the CASE INSENSITIVE patterns for a PCRE library or equivalent to use as a search orientation heuristic include the following.

(^|[^0-9a-z])(19|20|21)[0-9][0-9])([^0-9a-z]|$)
(^|[^0-9a-z])(Mon|Monday|Tue|Tues|Tuesday|Wed|Wednesday|Thu|Thur|Thurs|Thursday|Fri|Friday|Sat|Saturday|Sun)([^0-9a-z]|$)
(^|[^0-9a-z])(Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|August|Sep|Sept|September|Oct|October|Nov|November|Dec|December)([^0-9a-z]|$)
(^|[^0-9a-z])(Today|Yesterday|Tomorrow)([^0-9a-z]|$)
(^|[^0-9])[AP]M|[AP][.]M[.]|Noon|Midnight)([^0-9a-z]|$)
(^|[^a-z])(0?[1-9])(:[0-5][0-9]){1,2}([^a-z]|$)

There should be others for time, hyphenated or slash delimited dates, or time zone.

The positions and normalized encoding of these date and time artifacts are then substituted into the artificial network inputs instead of the original text in the stream, reducing redundancy and improving both the speed of training and the resulting accuracy and reliability of recognition.

In the case of 2. above, the entire burden of recognition is left to the artificial network. The advantage is less reliance on date and time conventions. The disadvantage is a much larger burden placed on training data variety and training epochs, meaning a much higher burden on computing resources and the pacience of the stake holder for the project.

Windowing

An overlapping windowing strategy is necessary. Unlike FFT spectral analysis in real time, the windowing must be rectangular, because the size of the window is the width of the input layer of the artificial network. Experimenting with the normalization of input such that the encoding of text and data and time components entering the input layer could greatly vary the results in terms of training speed, recognition accuracy, reliability, and adaptability to varying statistical distributions of date and time instances and relationships.

Douglas Daseeco
  • 7,423
  • 1
  • 26
  • 62
0

If you dont want to use machine learning you may use date time parser in python. Few examples are given below. It will return you formatted date time from given string. It works with all languages.

>>> import dateparser
>>> dateparser.parse('12/12/12')
datetime.datetime(2012, 12, 12, 0, 0)
>>> dateparser.parse(u'Fri, 12 Dec 2014 10:55:50')
datetime.datetime(2014, 12, 12, 10, 55, 50)
>>> dateparser.parse(u'Martes 21 de Octubre de 2014')  # Spanish (Tuesday 21 October 2014)
datetime.datetime(2014, 10, 21, 0, 0)
>>> dateparser.parse(u'Le 11 Décembre 2014 à 09:00')  # French (11 December 2014 at 09:00)
datetime.datetime(2014, 12, 11, 9, 0)
>>> dateparser.parse(u'13 января 2015 г. в 13:34')  # Russian (13 January 2015 at 13:34)
datetime.datetime(2015, 1, 13, 13, 34)
>>> dateparser.parse(u'1 เดือนตุลาคม 2005, 1:00 AM')  # Thai (1 October 2005, 1:00 AM)
datetime.datetime(2005, 10, 1, 1, 0)
Patel Sunil
  • 175
  • 1
  • 9