FASTA files. The FASTA family of file formats has different incompatible descriptions (1, 2, 3, etc.). Roughly FASTA files are in the format:
# comment
# comment
...
>header
sequence
>header
sequence
...
where the sequence may span multiple lines, and a ';' may be used instead of '#' to start comments.
Header lines begin with the '>' character. It is often considered that all characters until the first whitespace define the name of the content, and any characters beyond that define additional information in a format specific to the file provider.
Sequence are most often a sequence of characters denoting
nucleotides or amino acids. However, sometimes FASTA files provide
quality scores, either as ASCII encoded, e.g. as supported by
modules Biocaml_phred_score and Biocaml_solexa_score, or as space-separated integers.
Thus, the FASTA format is really a family of formats with a fairly loose specification of the header and content formats. The only consistently followed meaning of the format is:
sequence to generically
mean either kind of data found in the sequence lines, char_seq
to mean specifically a sequence of characters, and int_seq to
mean specifically a sequence of integers.module Biocaml_fasta: sigtypechar_seq =string
typeint_seq =int list
type 'a item = {
|
header : |
|
sequence : |
module Tags: sigtype t = {
|
forbid_empty_lines : |
|
only_header_comment : |
|
sharp_comments : |
|
semicolon_comments : |
|
max_items_per_line : |
|
sequence : |
val char_sequence_default : tchar_seq).val int_sequence_default : tint_seq).val is_char_sequence : t -> boolt.sequence.val is_int_sequence : t -> boolt.sequence.endmodule Error: sigFasta module
are defined here. Type t is the union of all errors, and subsets
of this are defined as needed to specify precise return types for
various functions.
`empty_line pos - an empty line was found in a position pos
where it is not allowed.`incomplete_input (lines,s) - the input ended
prematurely. Trailing contents, which cannot be used to fully
construct an item, are provided: lines is the complete lines
parsed and s is any final string not ending in a newline.`malformed_partial_sequence s - indicates that s could not
be parsed into a valid (partial) sequence value.`sequence_is_too_long s - indicates that s is longer than
allowed by `max_items_per_line.`unnamed_char_seq x - a char_seq value x was found without
a preceding header section.`unnamed_int_seq x - an int_seq value x was found without
a preceding header section.typestring_to_raw_item =[ `empty_line of Biocaml_pos.t
| `incomplete_input of Biocaml_pos.t * string list * string option
| `malformed_partial_sequence of Biocaml_pos.t * string
| `sequence_is_too_long of Biocaml_pos.t * string ]
Biocaml_fasta.raw_item.typet =[ `empty_line of Biocaml_pos.t
| `incomplete_input of Biocaml_pos.t * string list * string option
| `malformed_partial_sequence of Biocaml_pos.t * string
| `sequence_is_too_long of Biocaml_pos.t * string
| `unnamed_char_seq of Biocaml_fasta.char_seq
| `unnamed_int_seq of Biocaml_fasta.int_seq ]
val sexp_of_string_to_raw_item : string_to_raw_item -> Sexplib.Sexp.tval string_to_raw_item_of_sexp : Sexplib.Sexp.t -> string_to_raw_itemendIn_channel Functions exception Error of Error.t
val in_channel_to_char_seq_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(char_seq item, [> Error.t ])
Core.Result.t Stream.tchar_seq item results.val in_channel_to_int_seq_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(int_seq item, [> Error.t ])
Core.Result.t Stream.tint_seq item results.val in_channel_to_char_seq_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel -> char_seq item Stream.tchar_seq items. Comments are
discarded. Stream.next will raise Error _ in case of any error.val in_channel_to_int_seq_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel -> int_seq item Stream.tint_seq items. Comments are
discarded. Stream.next will raise Error _ in case of any error.
A raw_item represents an intermediate level of parsing, between
a plain string and an item. Working with raw_items can be
useful for various reasons. You may want to work with comments,
which are not kept in items. Also, the extra work required to
parse to an item may be unnecessary for the analysis you will
do, so using raw_items can be more efficient.
type'araw_item =[ `comment of string | `header of string | `partial_sequence of 'a ]
`comment _ - a single comment line without the final
newline.`header _ - a single header line without the initial '>',
whitespace following this, nor final newline.`partial_sequence _ - either a sequence of characters,
represented as a string, or a sequence of space separated
integers, represented by an int list. The value does not
necessarily carry the complete content associated with a
header. It may be only part of the sequence, which can be useful
for files with large sequences (e.g. genomic sequence
files).val in_channel_to_char_seq_raw_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(char_seq raw_item, [> Error.t ])
Core.Result.t Stream.tchar_seq raw_item results.val in_channel_to_int_seq_raw_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(int_seq raw_item, [> Error.t ])
Core.Result.t Stream.tint_seq raw_item results.val in_channel_to_char_seq_raw_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
char_seq raw_item Stream.tchar_seq raw_items. Comments are
discarded. Stream.next will raise Error _ in case of any error.val in_channel_to_int_seq_raw_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
int_seq raw_item Stream.tint_seq raw_items. Comments are discarded.
Stream.next will raise Error _ in case of any error.val char_seq_raw_item_to_string : char_seq raw_item -> stringraw_item to a string (ignore comments). End-of-line
characters are included.val int_seq_raw_item_to_string : int_seq raw_item -> stringraw_item to a string (ignore comments). End-of-line
characters are included.module Transform: sigchar_seq Itemsval string_to_char_seq_raw_item : ?filename:string ->
?tags:Biocaml_fasta.Tags.t ->
unit ->
(string,
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item, [> Biocaml_fasta.Error.t ])
Core.Result.t)
Biocaml_transform.tval char_seq_raw_item_to_item : unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item,
(Biocaml_fasta.char_seq Biocaml_fasta.item,
[> `unnamed_char_seq of Biocaml_fasta.char_seq ])
Core.Result.t)
Biocaml_transform.tchar_seq raw_items into char_seq
items. Comments are discared.char_seq Itemsval char_seq_item_to_raw_item : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.item,
Biocaml_fasta.char_seq Biocaml_fasta.raw_item)
Biocaml_transform.tchar_seq items into a stream of char_seq
raw_items, where lines are cut at items_per_line
characters (where items_per_line is defined with the
`max_items_per_line _ tag, if not specified the default is
80).val char_seq_raw_item_to_string : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item, string) Biocaml_transform.tchar_seq items. Comments will be ignored if
neither of the tags `sharp_comments or
`semicolon_comments is provided.int_seq Itemsval string_to_int_seq_raw_item : ?filename:string ->
?tags:Biocaml_fasta.Tags.t ->
unit ->
(string,
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item, [> Biocaml_fasta.Error.t ])
Core.Result.t)
Biocaml_transform.tval int_seq_raw_item_to_item : unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item,
(Biocaml_fasta.int_seq Biocaml_fasta.item,
[> `unnamed_int_seq of Biocaml_fasta.int_seq ])
Core.Result.t)
Biocaml_transform.tint_seq raw_items into int_seq
items. Comments are discared.int_seq Itemsval int_seq_item_to_raw_item : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.item,
Biocaml_fasta.int_seq Biocaml_fasta.raw_item)
Biocaml_transform.tint_seq items into a stream of int_seq
raw_items, the default line-cutting threshold is 27
(c.f. Biocaml_fasta.Tags.t).val int_seq_raw_item_to_string : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item, string) Biocaml_transform.tint_seq items. Comments will be ignored if no
*_comments tag is provided.endmodule Random: sigtypespecification =[ `non_sequence_probability of float | `tags of Biocaml_fasta.Tags.t ]
'a raw_item
values is a list of specification values. `non_sequence_probability f means that the output will not be a `partial_sequence _ item with probability f.`tags t specifies which Tags.t should be respected.val specification_of_string : string ->
(specification list,
[> `fasta of [> `parse_specification of exn ] ])
Core.Std.Result.tspecification from a string. Right now, the DSL is
based on S-Expressions. : [> specification ] list -> Biocaml_fasta.Tags.t optionTags.t in the specification, if any.val unit_to_random_char_seq_raw_item : [> specification ] list ->
((unit, Biocaml_fasta.char_seq Biocaml_fasta.raw_item) Biocaml_transform.t,
[> `inconsistent_tags of [> `int_sequence ] ])
Core.Result.tchar_seq
raw_item values according to the specification.endval sexp_of_char_seq : char_seq -> Sexplib.Sexp.tval char_seq_of_sexp : Sexplib.Sexp.t -> char_seqval sexp_of_int_seq : int_seq -> Sexplib.Sexp.tval int_seq_of_sexp : Sexplib.Sexp.t -> int_seqval sexp_of_item : ('a -> Sexplib.Sexp.t) -> 'a item -> Sexplib.Sexp.tval item_of_sexp : (Sexplib.Sexp.t -> 'a) -> Sexplib.Sexp.t -> 'a itemval sexp_of_raw_item : ('a -> Sexplib.Sexp.t) -> 'a raw_item -> Sexplib.Sexp.tval raw_item_of_sexp : (Sexplib.Sexp.t -> 'a) -> Sexplib.Sexp.t -> 'a raw_itemend