FASTA files. The FASTA family of file formats has different incompatible descriptions (1, 2, 3, 4, etc.). Roughly FASTA files are in the format:
# comment # comment ... >description sequence >description sequence ...
Comment lines are allowed at the top of the file. Usually comments
start with a '#' but sometimes with a ';' character. The Biocaml_fasta.fmt
properties allow configuring which is allowed during parsing and
printing.
Description lines begin with the '>' character. Various conventions are used for the content but there is no requirement. We simply return the string following the '>' character.
Sequences are most often a sequence of characters denoting
nucleotides or amino acids, and thus an item
's sequence
field
is set to a string. Sequences may span multiple lines.
However, sequence lines sometimes are used to provide quality
scores, either as space separated integers or as ASCII encoded
scores. To support the former case, we provide the
Biocaml_fasta.sequence_to_int_list
function. For the latter case, see modules
Phred_score
and Solexa_score
.
FASTA files are used to provide both short sequences and very big
sequences, e.g. a genome. In the latter case, the main API of this
module, which returns each sequence as an in-memory string, might
be too costly. Consider using instead the Biocaml_fasta.read0
function which
does not merge multiple sequence lines into one string. This API
is slightly more difficult to use but perhaps a worthwhile
trade-off.
Some FASTA files include very large sequences on a single line. This is discouraged and not well supported by this module. Functions in this module require memory proportional to the length of a line. Thus, a whole chromosomal sequence on a single line will consume a large amount of memory. This might not be a problem given the RAM on most computers.
Format Specifiers:
Variations in the format are controlled by the following settings,
all of which have a default value. These properties are combined
into the Biocaml_fasta.fmt
type for convenience and the defaults into
Biocaml_fasta.default_fmt
.
allow_sharp_comments
: Allow comment lines beginning with a '#'
character. Default: true.allow_semicolon_comments
: Allow comment lines beginning with a
';' character. Default: false.allow_sharp_comments
and allow_semicolon_comments
allows both. Setting both to false disallows comment
lines.
allow_empty_lines
: Allow lines with only whitespace anywhere in
the file. Default: false.comments_only_at_top
: Allow comments only at the top of the
file. If false, comment lines can occur anywhere but only the ones
at the top are returned. The rest are ignored. Default: true.max_line_length
: Require sequence lines to be shorter than given
length. None means there is no restriction. Note this does not
restrict the length of an item
's sequence
field because this
can span multiple lines. Default: None.alphabet
: Require sequence characters to be at most those in
given string. None means any character is allowed. Default: None.module Biocaml_fasta:
sig
typeheader = private
string list
type
item = private {
|
description : |
|
sequence : |
type
fmt = {
|
allow_sharp_comments : |
|
allow_semicolon_comments : |
|
allow_empty_lines : |
|
comments_only_at_top : |
|
max_line_length : |
|
alphabet : |
val default_fmt : fmt
val sequence_to_int_list : string -> int list Core.Std.Or_error.t
typeitem0 = private
[< `Comment of string
| `Description of string
| `Empty_line
| `Partial_sequence of string ]
item0
is more raw than item
. It is useful for parsing files
with large sequences because you get the sequence in smaller
pieces.
`Comment _
- Single comment line without the final
newline. Initial comment char is retained.`Empty_line
- Got a line with only whitespace characters. The
contents are not provided.`Description _
- Single description line without the initial
'>' nor final newline.`Partial_sequence _
- Multiple sequential partial sequences
comprise the sequence of a single item
.val parse_item0 : ?allow_sharp_comments:bool ->
?allow_semicolon_comments:bool ->
?allow_empty_lines:bool ->
?max_line_length:int ->
?alphabet:string ->
Biocaml_internal_utils.Line.t -> item0 Core.Std.Or_error.t
val read0 : ?start:Biocaml_internal_utils.Pos.t ->
?allow_sharp_comments:bool ->
?allow_semicolon_comments:bool ->
?allow_empty_lines:bool ->
?max_line_length:int ->
?alphabet:string ->
Pervasives.in_channel ->
item0 Core.Std.Or_error.t Biocaml_internal_utils.Stream.t
end