Module Biocaml_fasta (.ml)

FASTA files. The FASTA family of file formats has different incompatible descriptions (1, 2, 3, 4, etc.). Roughly FASTA files are in the format:

    # comment
    # comment
    ...
    >description
    sequence
    >description
    sequence
    ...
   

Comment lines are allowed at the top of the file. Usually comments start with a '#' but sometimes with a ';' character. The Biocaml_fasta.fmt properties allow configuring which is allowed during parsing and printing.

Description lines begin with the '>' character. Various conventions are used for the content but there is no requirement. We simply return the string following the '>' character.

Sequences are most often a sequence of characters denoting nucleotides or amino acids, and thus an item's sequence field is set to a string. Sequences may span multiple lines.

However, sequence lines sometimes are used to provide quality scores, either as space separated integers or as ASCII encoded scores. To support the former case, we provide the Biocaml_fasta.sequence_to_int_list function. For the latter case, see modules Phred_score and Solexa_score.

FASTA files are used to provide both short sequences and very big sequences, e.g. a genome. In the latter case, the main API of this module, which returns each sequence as an in-memory string, might be too costly. Consider using instead the Biocaml_fasta.read0 function which does not merge multiple sequence lines into one string. This API is slightly more difficult to use but perhaps a worthwhile trade-off.

Some FASTA files include very large sequences on a single line. This is discouraged and not well supported by this module. Functions in this module require memory proportional to the length of a line. Thus, a whole chromosomal sequence on a single line will consume a large amount of memory. This might not be a problem given the RAM on most computers.

Format Specifiers:

Variations in the format are controlled by the following settings, all of which have a default value. These properties are combined into the Biocaml_fasta.fmt type for convenience and the defaults into Biocaml_fasta.default_fmt.

Setting both allow_sharp_comments and allow_semicolon_comments allows both. Setting both to false disallows comment lines.


module Biocaml_fasta: 
sig
type header = private string list 
A header is a list of comment lines.
type item = private {
   description : string;
   sequence : string;
}
type fmt = {
   allow_sharp_comments : bool;
   allow_semicolon_comments : bool;
   allow_empty_lines : bool;
   comments_only_at_top : bool;
   max_line_length : int option;
   alphabet : string option;
}
val default_fmt : fmt
val sequence_to_int_list : string -> int list Core.Std.Or_error.t
Parse a space separated list of integers.

Low-level Parsing


type item0 = private [< `Comment of string
| `Description of string
| `Empty_line
| `Partial_sequence of string ]
An item0 is more raw than item. It is useful for parsing files with large sequences because you get the sequence in smaller pieces.

  • `Comment _ - Single comment line without the final newline. Initial comment char is retained.
  • `Empty_line - Got a line with only whitespace characters. The contents are not provided.
  • `Description _ - Single description line without the initial '>' nor final newline.
  • `Partial_sequence _ - Multiple sequential partial sequences comprise the sequence of a single item.

val parse_item0 : ?allow_sharp_comments:bool ->
?allow_semicolon_comments:bool ->
?allow_empty_lines:bool ->
?max_line_length:int ->
?alphabet:string ->
Biocaml_internal_utils.Line.t -> item0 Core.Std.Or_error.t
val read0 : ?start:Biocaml_internal_utils.Pos.t ->
?allow_sharp_comments:bool ->
?allow_semicolon_comments:bool ->
?allow_empty_lines:bool ->
?max_line_length:int ->
?alphabet:string ->
Pervasives.in_channel ->
item0 Core.Std.Or_error.t Biocaml_internal_utils.Stream.t

Input/Output


val read : ?start:Biocaml_internal_utils.Pos.t ->
?fmt:fmt ->
Pervasives.in_channel ->
(header *
item Core.Std.Or_error.t Biocaml_internal_utils.Stream.t)
Core.Std.Or_error.t
val with_file : ?fmt:fmt ->
string ->
f:(header ->
item Core.Std.Or_error.t Biocaml_internal_utils.Stream.t ->
'a Core.Std.Or_error.t) ->
'a Core.Std.Or_error.t
end