Modulated string searching

Alberto Apostolico, Péter L. Erdos, I. Miklós, Johannes Siemons

Research output: Contribution to journalArticle

Abstract

In his 1987 paper entitled Generalized String Matching Abrahamson introduced the concept of pattern matching with character classes and provided the first efficient algorithm to solve this problem. The best known solution to date is due to Linhart and Shamir (2009). Another broad yet comparatively less intensively studied class of string matching problems is numerical string searching, such as for instance "less-than" or L1-norm string searching. The best known solutions for problems in this class are based on FFT convolution after some suitable re-encoding. The present paper introduces modulated string searching as a unified framework for string matching problems where the numerical conditions can be combined with some Boolean/numerical decision conditions on the character classes. One example problem in this class is the locally bounded L1-norm matching problem with parameters b and τ: here the pattern "matches" a text of same length if their L1-distance is at most b and if furthermore there is no position where the text element and pattern element differ by more than the local bound τ. A more general setup is that where the pattern positions contain character classes and/or each position has its own private local bound. While the first variant can clearly be handled by adaptation of the classic FFT method, the second one is far too complicated for this treatment. The algorithm we propose in this paper can solve all such problems efficiently. The proposed framework contains two nested procedures. The first one, based on Karatsuba's fast multiplication algorithm, solves pattern matching with character classes within time O(nm0.585), where n and m are the text and pattern length respectively (under some reasonable conventions). This is slightly better than the complexity of Abrahamson's algorithm for generalized string matching but worse than algorithms based on FFT. The second procedure, which works as a plug-in within the first one and is tailored to the specific problem variant at hand, solves the numerical and/or Boolean matching problem with high efficiency. Some of the previously known constructions can be adapted to match or outperform several (but not all) problem variations handled by the construction proposed here. The latter aims to be a general tool that provides a unified solution for all problems of this kind.

Original languageEnglish
Pages (from-to)23-29
Number of pages7
JournalTheoretical Computer Science
Volume525
DOIs
Publication statusPublished - Mar 13 2014

Fingerprint

Strings
String Matching
Matching Problem
Fast Fourier transforms
Pattern matching
L1-norm
Pattern Matching
Convolution
Plug-in
Class
High Efficiency
Multiplication
Encoding
Efficient Algorithms
Character
Text
Framework

Keywords

  • Karatsuba's fast multiplication algorithm
  • Locally bounded L-norm string matching on character classes
  • Pattern matching with character classes
  • Truncated L-norm string matching on character classes

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Modulated string searching. / Apostolico, Alberto; Erdos, Péter L.; Miklós, I.; Siemons, Johannes.

In: Theoretical Computer Science, Vol. 525, 13.03.2014, p. 23-29.

Research output: Contribution to journalArticle

Apostolico, Alberto ; Erdos, Péter L. ; Miklós, I. ; Siemons, Johannes. / Modulated string searching. In: Theoretical Computer Science. 2014 ; Vol. 525. pp. 23-29.
@article{9348691ffe3d4b4aa3528a337f36947e,
title = "Modulated string searching",
abstract = "In his 1987 paper entitled Generalized String Matching Abrahamson introduced the concept of pattern matching with character classes and provided the first efficient algorithm to solve this problem. The best known solution to date is due to Linhart and Shamir (2009). Another broad yet comparatively less intensively studied class of string matching problems is numerical string searching, such as for instance {"}less-than{"} or L1-norm string searching. The best known solutions for problems in this class are based on FFT convolution after some suitable re-encoding. The present paper introduces modulated string searching as a unified framework for string matching problems where the numerical conditions can be combined with some Boolean/numerical decision conditions on the character classes. One example problem in this class is the locally bounded L1-norm matching problem with parameters b and τ: here the pattern {"}matches{"} a text of same length if their L1-distance is at most b and if furthermore there is no position where the text element and pattern element differ by more than the local bound τ. A more general setup is that where the pattern positions contain character classes and/or each position has its own private local bound. While the first variant can clearly be handled by adaptation of the classic FFT method, the second one is far too complicated for this treatment. The algorithm we propose in this paper can solve all such problems efficiently. The proposed framework contains two nested procedures. The first one, based on Karatsuba's fast multiplication algorithm, solves pattern matching with character classes within time O(nm0.585), where n and m are the text and pattern length respectively (under some reasonable conventions). This is slightly better than the complexity of Abrahamson's algorithm for generalized string matching but worse than algorithms based on FFT. The second procedure, which works as a plug-in within the first one and is tailored to the specific problem variant at hand, solves the numerical and/or Boolean matching problem with high efficiency. Some of the previously known constructions can be adapted to match or outperform several (but not all) problem variations handled by the construction proposed here. The latter aims to be a general tool that provides a unified solution for all problems of this kind.",
keywords = "Karatsuba's fast multiplication algorithm, Locally bounded L-norm string matching on character classes, Pattern matching with character classes, Truncated L-norm string matching on character classes",
author = "Alberto Apostolico and Erdos, {P{\'e}ter L.} and I. Mikl{\'o}s and Johannes Siemons",
year = "2014",
month = "3",
day = "13",
doi = "10.1016/j.tcs.2013.10.013",
language = "English",
volume = "525",
pages = "23--29",
journal = "Theoretical Computer Science",
issn = "0304-3975",
publisher = "Elsevier",

}

TY - JOUR

T1 - Modulated string searching

AU - Apostolico, Alberto

AU - Erdos, Péter L.

AU - Miklós, I.

AU - Siemons, Johannes

PY - 2014/3/13

Y1 - 2014/3/13

N2 - In his 1987 paper entitled Generalized String Matching Abrahamson introduced the concept of pattern matching with character classes and provided the first efficient algorithm to solve this problem. The best known solution to date is due to Linhart and Shamir (2009). Another broad yet comparatively less intensively studied class of string matching problems is numerical string searching, such as for instance "less-than" or L1-norm string searching. The best known solutions for problems in this class are based on FFT convolution after some suitable re-encoding. The present paper introduces modulated string searching as a unified framework for string matching problems where the numerical conditions can be combined with some Boolean/numerical decision conditions on the character classes. One example problem in this class is the locally bounded L1-norm matching problem with parameters b and τ: here the pattern "matches" a text of same length if their L1-distance is at most b and if furthermore there is no position where the text element and pattern element differ by more than the local bound τ. A more general setup is that where the pattern positions contain character classes and/or each position has its own private local bound. While the first variant can clearly be handled by adaptation of the classic FFT method, the second one is far too complicated for this treatment. The algorithm we propose in this paper can solve all such problems efficiently. The proposed framework contains two nested procedures. The first one, based on Karatsuba's fast multiplication algorithm, solves pattern matching with character classes within time O(nm0.585), where n and m are the text and pattern length respectively (under some reasonable conventions). This is slightly better than the complexity of Abrahamson's algorithm for generalized string matching but worse than algorithms based on FFT. The second procedure, which works as a plug-in within the first one and is tailored to the specific problem variant at hand, solves the numerical and/or Boolean matching problem with high efficiency. Some of the previously known constructions can be adapted to match or outperform several (but not all) problem variations handled by the construction proposed here. The latter aims to be a general tool that provides a unified solution for all problems of this kind.

AB - In his 1987 paper entitled Generalized String Matching Abrahamson introduced the concept of pattern matching with character classes and provided the first efficient algorithm to solve this problem. The best known solution to date is due to Linhart and Shamir (2009). Another broad yet comparatively less intensively studied class of string matching problems is numerical string searching, such as for instance "less-than" or L1-norm string searching. The best known solutions for problems in this class are based on FFT convolution after some suitable re-encoding. The present paper introduces modulated string searching as a unified framework for string matching problems where the numerical conditions can be combined with some Boolean/numerical decision conditions on the character classes. One example problem in this class is the locally bounded L1-norm matching problem with parameters b and τ: here the pattern "matches" a text of same length if their L1-distance is at most b and if furthermore there is no position where the text element and pattern element differ by more than the local bound τ. A more general setup is that where the pattern positions contain character classes and/or each position has its own private local bound. While the first variant can clearly be handled by adaptation of the classic FFT method, the second one is far too complicated for this treatment. The algorithm we propose in this paper can solve all such problems efficiently. The proposed framework contains two nested procedures. The first one, based on Karatsuba's fast multiplication algorithm, solves pattern matching with character classes within time O(nm0.585), where n and m are the text and pattern length respectively (under some reasonable conventions). This is slightly better than the complexity of Abrahamson's algorithm for generalized string matching but worse than algorithms based on FFT. The second procedure, which works as a plug-in within the first one and is tailored to the specific problem variant at hand, solves the numerical and/or Boolean matching problem with high efficiency. Some of the previously known constructions can be adapted to match or outperform several (but not all) problem variations handled by the construction proposed here. The latter aims to be a general tool that provides a unified solution for all problems of this kind.

KW - Karatsuba's fast multiplication algorithm

KW - Locally bounded L-norm string matching on character classes

KW - Pattern matching with character classes

KW - Truncated L-norm string matching on character classes

UR - http://www.scopus.com/inward/record.url?scp=84895930908&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84895930908&partnerID=8YFLogxK

U2 - 10.1016/j.tcs.2013.10.013

DO - 10.1016/j.tcs.2013.10.013

M3 - Article

AN - SCOPUS:84895930908

VL - 525

SP - 23

EP - 29

JO - Theoretical Computer Science

JF - Theoretical Computer Science

SN - 0304-3975

ER -