Module BC_StrMatch


module BC_StrMatch: sig  end
String matching algorithms
Author(s): Hideo Bannai

exception NotFound


substring matching



type suffix_tree
suffix tree type (not implemented)


type cdawg
compact direct acyclic word graph

val suffix_tree : string -> suffix_tree
creates a suffix_tree from a string (a dummy function which is not implemented)
val cdawg : string -> cdawg
creates a CDAWG from a string. The implementation is based on a persistent algorithm (used in mascdawg), and the run time is NOT linear to the length of the string. The text should consiste of printable ASCII characters. (At least it shouldn't contain \000)
val strstr : string -> string -> int
Substring matching. text -> pattern -> position return -1 if there is no match
val strstr_match : string -> string -> bool
Substring matching text -> pattern -> (position != -1)


VLDC matching



type mascdawg
minimum all suffix compact direct acyclic word graph

val mascdawg : string -> mascdawg
creates a MASCDAWG from a string. The run time is linear in terms of the number of states in the resulting graph. The text should consiste of printable ASCII characters. (At least it shouldn't contain \000)
val vldc_match : mascdawg -> ?k:int -> string -> bool
vldc_match mascdawg pattern returns true if the pattern matches (substring matching with wildcard) the text represented by mascdawg. The pattern should consist of printable ASCII characters (At least it shouldn't contain \000) '*' in the pattern represents a wildcard, which matches k or more characters. (Escaping '*' is currently not supported)

k : the minimum match length for a wild card.


approximate matching



type astrstr_mismatch = {
   sub : bool;
   ins : bool;
   del : bool;
}
mismatch types


type astrstr_pathtype =
| Text
| Pat
| Both
type for `backtrack path' of approximate matching

val string_of_astrstrmismatch : astrstr_mismatch -> string
convert the mismatch type to string
val astrstr_get_all_hit : int -> astrstr_mismatch -> string -> string -> int32 array array
Return match matrix mismatch_allowance -> text -> pattern -> match matrix of patterns in the string
val astrstr_mismatch_table : astrstr_mismatch -> string -> string -> int array array
Return match matrix mismatch_type -> text -> pattern -> min_mismatch table
val astrstr_path : astrstr_mismatch ->
string ->
string -> int * (int * int * int) * astrstr_pathtype list
  mismatch_type -> text -> pattern ->
  minimum mismatch * (text_beginx, textend, pat_beginy)
  text_begnix: a value from 0 to String.length text (inclusive)
                where in the text the match starts (text starts from 1)
                a 0 means that the first (few) letters of the pattern
                were deleted so that the suffix of the pattern matches
                the prefix of the test.
  pat_beginy: either 0 or 1. a 0 means that the first letter
               of the pattern was deleted when matching from text_beginx.
               1 means otherwise. note that this value and text_beginx cannot 
               be 0 at the same time.

val astrstr_match : int -> astrstr_mismatch -> string -> string -> bool
mismatch_allowance -> text -> pattern -> bool
val find_best_mismatch : astrstr_mismatch ->
int ->
string list ->
string list ->
(int -> int -> int -> int -> float) ->
string -> float * float * int * astrstr_mismatch
finds the best mismatch for a given string, for dividing two sets of strings according to some score function. The score function is assumed to follow the following property: (scoref xmax ymax xmax y > scoref scoref xmax ymax xmax y' when y < y') the astrstr_mismatch in the input is a mismatch mask, where only the combination of mismatch operations set to true can be performed.


subsequence matching



type dasg
direct acyclic subsequence graph

val dasg_aux : string -> int array -> string -> dasg
alphabet -> character map -> text -> DASG
val dasg : string -> dasg
text -> DASG
val dasg_match : dasg -> string -> bool
DASG -> pattern -> bool
val subseq_match : string -> string -> bool
text -> pattern -> bool


episode matching
(subsequence matching with length constraints)


type edasg
episode DASG

val edasg_aux : string ->
int array ->
int array array -> int array array -> string -> edasg
alphabet character map -> text -> EDASG
val edasg : string -> edasg
text -> EDASG
val edasg_mint : edasg -> string -> int
text(EDASG) -> pattern -> minimum threshold for match
val edasg_match : int -> edasg -> string -> bool
threshold -> text(EDASG) -> if match length was less than threshold
val episode : string -> string -> int
text -> pattern -> minimum size window threshold
val episode_match : int -> string -> string -> bool
threshold -> text -> pattern -> minwindow > threshold?
val find_best_mint : edasg list ->
edasg list ->
(int -> int -> int -> int -> float) ->
string -> int * (float * int * int) * (float * int * int)
finds the best threshold for a given string, for dividing two sets of strings with episode matching, according to some score function. The score function is assumed to follow the following property: (scoref xmax ymax xmax y > scoref scoref xmax ymax xmax y' when y < y')
val dasg_of_edasg : edasg -> dasg
conversion of EDASG to DASG
val edasg_to_dasag : edasg -> dasg
conversion of EDASG to DASG
val find_bestpat : pd:'a ->
nd:'b ->
alph:string ->
maxlen:int ->
scoref:'c ->
comppat:('d -> 'd -> int) ->
copypat:('d -> 'd) ->
maxval:float ->
makepat:('a -> 'b -> 'c -> string -> float * float * 'd) -> float * 'd
branch-bound algorithm for finding best pattern two divide two lists with pattern based on strings
  'a: type of text1
  'b: type of text2
  'c: type of pattern
  a pattern is some data type which is based on a string. 
           (e.g.: Approximate pattern: (mismatch,string))
  pd  : positive data  nd: negative data
  alph: a string containing all characters of the alphabet to generate patterns from
  maxlen: maximum length of pattern
  scoref: score function which calculates a score (float) and is given arguments:
          let pdy: the list of elements of pd, which matches with matcher
          let ndy: the list of elements of pd, which matches with matcher
          the length of lists in order: pd nd pdy ndy.
          A higher score is judged to be better.
  comppat: comparison function for a pattern, used in case the score is the same (e.g. shorter patterns are better)
  copypat: copy function for a pattern (it should copy any non-persistent data types contained (e.g.: String.copy))
  makepat: a function which takes a string and makes the best pattern which distinguishes the given texts. returns (pdmatchcount, ndmatchcount, score, pattern)