FASTQ files. The FASTQ file format is repeated sequence of 4 lines:
@name sequence +comment qualities ...
The name line begins with an @ character, which is omitted in the
parsed Biocaml_fastq.item
type provided by this module. Any spaces after the
@ are retained, but the specification implies that there shouldn't
be any such spaces. Trailing whitespace is also retained since you
should not normally have such files.
The comment line, which begins with a +, is handled similarly. The purpose of the comment line is unclear and it is rarely used. Also, "comment" may not be the correct term for this line.
The name line may be structured into two parts: a sequence
identifier and an optional description. We provide a function
Biocaml_fastq.split_name
to parse such a value. However, an item
's name
field contains the unparsed string because it is unclear whether
fastq files really follow this. Also the format of the description
is unspecified. When it is provided, usually it has some
additional structure, so the minimal amount of parsing done by
Biocaml_fastq.split_name
isn't too useful anyway.
Illumina uses a systematic format for the name line that serves as
a unique sequence identifier. Use
Biocaml_fastq.Illumina.sequence_id_of_string
to parse an item
's name
field when you have fastq files produced by Casava version >=
1.8. Earlier versions of Casava returned a different format, which
is not currently supported in this module (it could be easily
added).
The qualities line is returned as a plain string, but it is
required to be decodable as either Phred or Solexa scores. Modules
Phred_score
and Solexa_score
can be used to parse as needed.
Older FASTQ files allowed the sequence and qualities strings to
span multiple lines. This is discouraged and is not supported by
this module.
module Biocaml_fastq:
sig
type
item = {
|
name : |
|
sequence : |
|
comment : |
|
qualities : |
val split_name : string -> string * string option
item
's name
field, i.e. that it doesn't contain a leading @
char.module MakeIO:functor (
Future
:
Future.S
) ->
sig
val read :Future.Reader.t ->
Biocaml_fastq.item Core.Std.Or_error.t Future.Pipe.Reader.tval write :Future.Writer.t ->
Biocaml_fastq.item Future.Pipe.Reader.t -> unit Future.Deferred.tval write_file :?perm:int ->
?append:bool ->
string -> Biocaml_fastq.item Future.Pipe.Reader.t -> unit Future.Deferred.tend
include ??
module Illumina:
sig
typesurface =
[ `Bottom | `Top ]
type
tile = private {
|
surface : |
|||
|
swath : |
(* | 1, 2, or 3 | *) |
|
number : |
(* | 1 - 99, but usually 1 - 8 | *) |
val tile_of_string : string -> tile Core.Std.Or_error.t
tile_of_string "2304"
parses toval tile_to_string : tile -> string
tile_of_string
.type
sequence_id = private {
|
instrument : |
|
run_number : |
|
flowcell_id : |
|
lane : |
|
tile : |
|
x_pos : |
|
y_pos : |
|
read : |
|
is_filtered : |
|
control_number : |
|
index : |
val sequence_id_of_string : string -> sequence_id Core.Std.Or_error.t
item
's name
field,
i.e. that it doesn't contain a leading @ char.end
val item_to_string : item -> string
item
values to strings that can be dumped
to a file, i.e. they contain full-lines, including all
end-of-line characters.val name_of_line : ?pos:Biocaml_internal_utils.Pos.t ->
Biocaml_internal_utils.Line.t -> string Core.Std.Or_error.t
val sequence_of_line : ?pos:Biocaml_internal_utils.Pos.t -> Biocaml_internal_utils.Line.t -> string
val comment_of_line : ?pos:Biocaml_internal_utils.Pos.t ->
Biocaml_internal_utils.Line.t -> string Core.Std.Or_error.t
val qualities_of_line : ?pos:Biocaml_internal_utils.Pos.t ->
?sequence:string ->
Biocaml_internal_utils.Line.t -> string Core.Std.Or_error.t
qualities sequence line
parses given qualities line
in the
context of a previously parsed sequence
. The sequence
is
needed to assure the correct number of quality scores are
provided. If not provided, this check is omitted.val item_of_sexp : Sexplib.Sexp.t -> item
val sexp_of_item : item -> Sexplib.Sexp.t
item
's name
field, i.e. that it doesn't contain a leading @
char.tile_of_string "2304"
parses totile_of_string
.item
's name
field,
i.e. that it doesn't contain a leading @ char.item
values to strings that can be dumped
to a file, i.e. they contain full-lines, including all
end-of-line characters.qualities sequence line
parses given qualities line
in the
context of a previously parsed sequence
. The sequence
is
needed to assure the correct number of quality scores are
provided. If not provided, this check is omitted.end