FASTA files. The FASTA family of file formats has different incompatible descriptions (1, 2, 3, etc.). Roughly FASTA files are in the format:
# comment # comment ... >header sequence >header sequence ...
where the sequence may span multiple lines, and a ';' may be used instead of '#' to start comments.
Header lines begin with the '>' character. It is often considered that all characters until the first whitespace define the name of the content, and any characters beyond that define additional information in a format specific to the file provider.
Sequence are most often a sequence of characters denoting
nucleotides or amino acids. However, sometimes FASTA files provide
quality scores, either as ASCII encoded, e.g. as supported by
modules Biocaml_phred_score
and Biocaml_solexa_score
, or as space-separated integers.
Thus, the FASTA format is really a family of formats with a fairly loose specification of the header and content formats. The only consistently followed meaning of the format is:
sequence
to generically
mean either kind of data found in the sequence lines, char_seq
to mean specifically a sequence of characters, and int_seq
to
mean specifically a sequence of integers.module Biocaml_fasta:
sig
typechar_seq =
string
typeint_seq =
int list
type 'a
item = {
|
header : |
|
sequence : |
module Tags:
sig
type
t = {
|
forbid_empty_lines : |
|
only_header_comment : |
|
sharp_comments : |
|
semicolon_comments : |
|
max_items_per_line : |
|
sequence : |
val char_sequence_default : t
char_seq
).val int_sequence_default : t
int_seq
).val is_char_sequence : t -> bool
t.sequence
.val is_int_sequence : t -> bool
t.sequence
.end
module Error:
sig
Fasta
module
are defined here. Type t
is the union of all errors, and subsets
of this are defined as needed to specify precise return types for
various functions.
`empty_line pos
- an empty line was found in a position pos
where it is not allowed.`incomplete_input (lines,s)
- the input ended
prematurely. Trailing contents, which cannot be used to fully
construct an item, are provided: lines
is the complete lines
parsed and s
is any final string not ending in a newline.`malformed_partial_sequence s
- indicates that s
could not
be parsed into a valid (partial) sequence value.`sequence_is_too_long s
- indicates that s
is longer than
allowed by `max_items_per_line.`unnamed_char_seq x
- a char_seq
value x
was found without
a preceding header section.`unnamed_int_seq x
- an int_seq
value x
was found without
a preceding header section.typestring_to_raw_item =
[ `empty_line of Biocaml_pos.t
| `incomplete_input of Biocaml_pos.t * string list * string option
| `malformed_partial_sequence of Biocaml_pos.t * string
| `sequence_is_too_long of Biocaml_pos.t * string ]
Biocaml_fasta.raw_item
.typet =
[ `empty_line of Biocaml_pos.t
| `incomplete_input of Biocaml_pos.t * string list * string option
| `malformed_partial_sequence of Biocaml_pos.t * string
| `sequence_is_too_long of Biocaml_pos.t * string
| `unnamed_char_seq of Biocaml_fasta.char_seq
| `unnamed_int_seq of Biocaml_fasta.int_seq ]
val sexp_of_string_to_raw_item : string_to_raw_item -> Sexplib.Sexp.t
val string_to_raw_item_of_sexp : Sexplib.Sexp.t -> string_to_raw_item
end
In_channel
Functions exception Error of Error.t
val in_channel_to_char_seq_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(char_seq item, [> Error.t ])
Core.Result.t Stream.t
char_seq item
results.val in_channel_to_int_seq_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(int_seq item, [> Error.t ])
Core.Result.t Stream.t
int_seq item
results.val in_channel_to_char_seq_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel -> char_seq item Stream.t
char_seq item
s. Comments are
discarded. Stream.next
will raise Error _
in case of any error.val in_channel_to_int_seq_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel -> int_seq item Stream.t
int_seq item
s. Comments are
discarded. Stream.next
will raise Error _
in case of any error.
A raw_item
represents an intermediate level of parsing, between
a plain string and an item
. Working with raw_item
s can be
useful for various reasons. You may want to work with comments,
which are not kept in item
s. Also, the extra work required to
parse to an item
may be unnecessary for the analysis you will
do, so using raw_item
s can be more efficient.
type'a
raw_item =[ `comment of string | `header of string | `partial_sequence of 'a ]
`comment _
- a single comment line without the final
newline.`header _
- a single header line without the initial '>',
whitespace following this, nor final newline.`partial_sequence _
- either a sequence of characters,
represented as a string, or a sequence of space separated
integers, represented by an int list
. The value does not
necessarily carry the complete content associated with a
header. It may be only part of the sequence, which can be useful
for files with large sequences (e.g. genomic sequence
files).val in_channel_to_char_seq_raw_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(char_seq raw_item, [> Error.t ])
Core.Result.t Stream.t
char_seq raw_item
results.val in_channel_to_int_seq_raw_item_stream : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
(int_seq raw_item, [> Error.t ])
Core.Result.t Stream.t
int_seq raw_item
results.val in_channel_to_char_seq_raw_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
char_seq raw_item Stream.t
char_seq raw_item
s. Comments are
discarded. Stream.next
will raise Error _
in case of any error.val in_channel_to_int_seq_raw_item_stream_exn : ?buffer_size:int ->
?filename:string ->
?tags:Tags.t ->
Pervasives.in_channel ->
int_seq raw_item Stream.t
int_seq raw_item
s. Comments are discarded.
Stream.next
will raise Error _
in case of any error.val char_seq_raw_item_to_string : char_seq raw_item -> string
raw_item
to a string (ignore comments). End-of-line
characters are included.val int_seq_raw_item_to_string : int_seq raw_item -> string
raw_item
to a string (ignore comments). End-of-line
characters are included.module Transform:
sig
char_seq
Itemsval string_to_char_seq_raw_item : ?filename:string ->
?tags:Biocaml_fasta.Tags.t ->
unit ->
(string,
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item, [> Biocaml_fasta.Error.t ])
Core.Result.t)
Biocaml_transform.t
val char_seq_raw_item_to_item : unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item,
(Biocaml_fasta.char_seq Biocaml_fasta.item,
[> `unnamed_char_seq of Biocaml_fasta.char_seq ])
Core.Result.t)
Biocaml_transform.t
char_seq raw_item
s into char_seq
item
s. Comments are discared.char_seq
Itemsval char_seq_item_to_raw_item : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.item,
Biocaml_fasta.char_seq Biocaml_fasta.raw_item)
Biocaml_transform.t
char_seq item
s into a stream of char_seq
raw_item
s, where lines are cut at items_per_line
characters (where items_per_line
is defined with the
`max_items_per_line _
tag, if not specified the default is
80).val char_seq_raw_item_to_string : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.char_seq Biocaml_fasta.raw_item, string) Biocaml_transform.t
char_seq item
s. Comments will be ignored if
neither of the tags `sharp_comments
or
`semicolon_comments
is provided.int_seq
Itemsval string_to_int_seq_raw_item : ?filename:string ->
?tags:Biocaml_fasta.Tags.t ->
unit ->
(string,
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item, [> Biocaml_fasta.Error.t ])
Core.Result.t)
Biocaml_transform.t
val int_seq_raw_item_to_item : unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item,
(Biocaml_fasta.int_seq Biocaml_fasta.item,
[> `unnamed_int_seq of Biocaml_fasta.int_seq ])
Core.Result.t)
Biocaml_transform.t
int_seq raw_item
s into int_seq
item
s. Comments are discared.int_seq
Itemsval int_seq_item_to_raw_item : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.item,
Biocaml_fasta.int_seq Biocaml_fasta.raw_item)
Biocaml_transform.t
int_seq item
s into a stream of int_seq
raw_item
s, the default line-cutting threshold is 27
(c.f. Biocaml_fasta.Tags.t
).val int_seq_raw_item_to_string : ?tags:Biocaml_fasta.Tags.t ->
unit ->
(Biocaml_fasta.int_seq Biocaml_fasta.raw_item, string) Biocaml_transform.t
int_seq item
s. Comments will be ignored if no
*_comments
tag is provided.end
module Random:
sig
typespecification =
[ `non_sequence_probability of float | `tags of Biocaml_fasta.Tags.t ]
'a raw_item
values is a list of specification
values. `non_sequence_probability f
means that the output will not be a `partial_sequence _
item with probability f
.`tags t
specifies which Tags.t
should be respected.val specification_of_string : string ->
(specification list,
[> `fasta of [> `parse_specification of exn ] ])
Core.Std.Result.t
specification
from a string
. Right now, the DSL is
based on S-Expressions.[> specification ] list -> Biocaml_fasta.Tags.t option
: Tags.t
in the specification, if any.val unit_to_random_char_seq_raw_item : [> specification ] list ->
((unit, Biocaml_fasta.char_seq Biocaml_fasta.raw_item) Biocaml_transform.t,
[> `inconsistent_tags of [> `int_sequence ] ])
Core.Result.t
char_seq
raw_item
values according to the specification.end
val sexp_of_char_seq : char_seq -> Sexplib.Sexp.t
val char_seq_of_sexp : Sexplib.Sexp.t -> char_seq
val sexp_of_int_seq : int_seq -> Sexplib.Sexp.t
val int_seq_of_sexp : Sexplib.Sexp.t -> int_seq
val sexp_of_item : ('a -> Sexplib.Sexp.t) -> 'a item -> Sexplib.Sexp.t
val item_of_sexp : (Sexplib.Sexp.t -> 'a) -> Sexplib.Sexp.t -> 'a item
val sexp_of_raw_item : ('a -> Sexplib.Sexp.t) -> 'a raw_item -> Sexplib.Sexp.t
val raw_item_of_sexp : (Sexplib.Sexp.t -> 'a) -> Sexplib.Sexp.t -> 'a raw_item
end