module BC_StrMatch: sig end
String matching algorithms
Author(s): Hideo Bannai
exception NotFound
type suffix_tree
suffix tree type (not implemented)
type cdawg
compact direct acyclic word graph
val suffix_tree : string -> suffix_tree
creates a suffix_tree from a string (a dummy function which is not implemented)
val cdawg : string -> cdawg
creates a CDAWG from a string. The implementation is based on a persistent
algorithm (used in mascdawg), and the run time is NOT linear to the
length of the string.
The text should consiste of printable ASCII characters.
(At least it shouldn't contain \000)
val strstr : string -> string -> int
Substring matching.
text -> pattern -> position
return -1 if there is no match
val strstr_match : string -> string -> bool
Substring matching
text -> pattern -> (position != -1)
type mascdawg
minimum all suffix compact direct acyclic word graph
val mascdawg : string -> mascdawg
creates a MASCDAWG from a string. The run time is linear in terms
of the number of states in the resulting graph.
The text should consiste of printable ASCII characters.
(At least it shouldn't contain \000)
val vldc_match : mascdawg -> ?k:int -> string -> bool
vldc_match mascdawg pattern returns true if the
pattern matches (substring matching with wildcard)
the text represented by mascdawg.
The pattern should consist of printable ASCII characters
(At least it shouldn't contain \000)
'*' in the pattern represents a wildcard, which matches k or more characters.
(Escaping '*' is currently not supported)
k : the minimum match length for a wild card.
type astrstr_mismatch = {
|
sub : bool; |
|
ins : bool; |
|
del : bool; |
}
mismatch types
type astrstr_pathtype =
type for `backtrack path' of approximate matching
val string_of_astrstrmismatch : astrstr_mismatch -> string
convert the mismatch type to string
val astrstr_get_all_hit : int -> astrstr_mismatch -> string -> string -> int32 array array
Return match matrix
mismatch_allowance -> text -> pattern -> match matrix of patterns in the string
val astrstr_mismatch_table : astrstr_mismatch -> string -> string -> int array array
Return match matrix
mismatch_type -> text -> pattern -> min_mismatch table
val astrstr_path : astrstr_mismatch ->
string ->
string -> int * (int * int * int) * astrstr_pathtype list
mismatch_type -> text -> pattern ->
minimum mismatch * (text_beginx, textend, pat_beginy)
text_begnix: a value from 0 to String.length text (inclusive)
where in the text the match starts (text starts from 1)
a 0 means that the first (few) letters of the pattern
were deleted so that the suffix of the pattern matches
the prefix of the test.
pat_beginy: either 0 or 1. a 0 means that the first letter
of the pattern was deleted when matching from text_beginx.
1 means otherwise. note that this value and text_beginx cannot
be 0 at the same time.
val astrstr_match : int -> astrstr_mismatch -> string -> string -> bool
mismatch_allowance -> text -> pattern -> bool
val find_best_mismatch : astrstr_mismatch ->
int ->
string list ->
string list ->
(int -> int -> int -> int -> float) ->
string -> float * float * int * astrstr_mismatch
finds the best mismatch for a given string, for dividing two sets
of strings according to some score function.
The score function is assumed to follow the following property:
(scoref xmax ymax xmax y > scoref scoref xmax ymax xmax y' when y < y')
the astrstr_mismatch in the input is a mismatch mask, where only the combination
of mismatch operations set to true can be performed.
type dasg
direct acyclic subsequence graph
val dasg_aux : string -> int array -> string -> dasg
alphabet -> character map -> text -> DASG
val dasg : string -> dasg
text -> DASG
val dasg_match : dasg -> string -> bool
DASG -> pattern -> bool
val subseq_match : string -> string -> bool
text -> pattern -> bool
(subsequence matching with length constraints)
type edasg
episode DASG
val edasg_aux : string ->
int array ->
int array array -> int array array -> string -> edasg
alphabet character map -> text -> EDASG
val edasg : string -> edasg
text -> EDASG
val edasg_mint : edasg -> string -> int
text(EDASG) -> pattern -> minimum threshold for match
val edasg_match : int -> edasg -> string -> bool
threshold -> text(EDASG) -> if match length was less than threshold
val episode : string -> string -> int
text -> pattern -> minimum size window threshold
val episode_match : int -> string -> string -> bool
threshold -> text -> pattern -> minwindow > threshold?
val find_best_mint : edasg list ->
edasg list ->
(int -> int -> int -> int -> float) ->
string -> int * (float * int * int) * (float * int * int)
finds the best threshold for a given string, for dividing two sets
of strings with episode matching, according to some score function.
The score function is assumed to follow the following property:
(scoref xmax ymax xmax y > scoref scoref xmax ymax xmax y' when y < y')
val dasg_of_edasg : edasg -> dasg
conversion of EDASG to DASG
val edasg_to_dasag : edasg -> dasg
conversion of EDASG to DASG
val find_bestpat : pd:'a ->
nd:'b ->
alph:string ->
maxlen:int ->
scoref:'c ->
comppat:('d -> 'd -> int) ->
copypat:('d -> 'd) ->
maxval:float ->
makepat:('a -> 'b -> 'c -> string -> float * float * 'd) -> float * 'd
branch-bound algorithm for finding best pattern two divide two lists with
pattern based on strings
'a: type of text1
'b: type of text2
'c: type of pattern
a pattern is some data type which is based on a string.
(e.g.: Approximate pattern: (mismatch,string))
pd : positive data nd: negative data
alph: a string containing all characters of the alphabet to generate patterns from
maxlen: maximum length of pattern
scoref: score function which calculates a score (float) and is given arguments:
let pdy: the list of elements of pd, which matches with matcher
let ndy: the list of elements of pd, which matches with matcher
the length of lists in order: pd nd pdy ndy.
A higher score is judged to be better.
comppat: comparison function for a pattern, used in case the score is the same (e.g. shorter patterns are better)
copypat: copy function for a pattern (it should copy any non-persistent data types contained (e.g.: String.copy))
makepat: a function which takes a string and makes the best pattern which distinguishes the given texts. returns (pdmatchcount, ndmatchcount, score, pattern)