This general submission help provides a general overview of generation, submission, and validation of seqspec documents to the IGVF data portal.
If you are submitting raw sequencing data (such as Illumina, PacBio, NanoPore, etc.), the corresponding sequence specification YAML configuration files (seqspec) should be submitted along with the sequence files. Furthermore, the seqspec YAML files are required to uniformly process single-cell assays (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.).
This type of machine-readable YAML files describes your genomic library sequence and structure for standardized data processing, see Bioinformatics, Volume 40, Issue 4, April 2024. The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of
.
Detailed documentations on seqspec installation, generation, validation, and usage are included in seqspec docs. Additional tutorials are available from a 2024 seqspec Jamboree session with all content at Google Drive.
It is important to note that seqspec submission requirements vary based on whether the sequencing data will be processed by the single-cell uniform pipeline. Please refer to corresponding subsections in the section below on seqspec submission.
The seqspec code repository (for the YAML file generation and other functionalities) is available on a DACC fork of the seqspec repo on IGVF-DACC GitHub seqspec repo fork. The seqspec code base can also be installed via Python pip install git+https://github.com/IGVF-DACC/seqspec.git@v25-05-06
. Furthermore, please check seqspec JSON schema for a list of accepted seqspec enums, such as assay terms, region types, primer IDs, etc.
For any additional help generating a YAML file, please contact Sina Booeshaghi and Lior Pachter. For any other additional question on submitting a Configuration File, please contact your wranglers.
The seqspec YAML files are required to uniformly process single-cell assays (see assay terms below). The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of
. The seqspec YAML files need to be gzipped prior to submission.
If a seqspec has onlist file references in its library_spec
sections that have the property region_type:barcode
(see section above), the submitters must also add the same list of files to the linked Measurement Sets' onlist_files
and onlist_method
.
Schema for submitting seqspec YAMLS for single cell and non-single assays are illustrated in the subsection "Schema for submitting seqspec YAML files for single cell and non-single assays" below.
There are a few important seqspec YAML styles guidelines to follow for the generation and submission of seqspec YAML files.
read_id
, file_id
, and url
for FASTQ and onlist files, if used in a seqspec, must use IGVF data portal accessions.library_spec
portion will be different, a new seqspec describing this new library structure is needed./assay-terms/OBI_0002762/
, single-nucleus ATAC-seq/assay-terms/OBI_0003109/
, single-nucleus RNA sequencing assay/assay-terms/OBI_0002631/
, single-cell RNA sequencing assay/assay-terms/OBI_0002764/
, single-cell ATAC-seq/assay-terms/OBI_0003660/
, in vitro CRISPR screen using single-cell RNA-seqfile_set
of a seqspec YAML file matches the file_set
of the sequence files in the seqspec_of
content.library_spec
example below), please use existing onlist files on the IGVF data portal can be found under Tabular Files with content_type: barcode onlist
. If your onlist file(s) are not on the portal, please work with your wranglers to submit them to the portal first. Additionally, the following guidelines must be followed when referencing barcode onlist files.region_type:index5, onlist:!Onlist
), there isn't a strict format requirement as long as the referenced barcode onlist files are on the portal under Tabular File with a content type of barcode onlist
(Accepted onlist examples: IGVFFI0791WXDC and IGVFFI4565KANH).attachment
to the document object. The document_type
in this case will be library structure seqspec
. This Document can be then linked to individual Sequence Files using the property seqspec_document
on the sequence file object. Any seqspec file submitted via this option will not be throughly checked by the IGVF check files validation system. Nonetheless, the submitters are encouraged to self-validate (refer to the last section) to ensure that there are no errors in the library_spec
section of the seqspec YAML file.!Read
sections in the sequence_spec
section of the seqspec YAML file as much as possible. The !File
subsections under !Read
sections and the Onlist
info under library_spec
are optional and can be left blank.The example seqspec used is IGVFFI1157AYPH.
This section describes the experimental setup of one or more sequencing runs. Some information listed in this section is also collected as metadata when submitting sequence files and the corresponding file sets to the data portal. Therefore, the seqspec validation check omits this section except the seqspec_version
.
seqspec_version
in this section is expected to be v0.3.0
if a seqspec YAML file will be submitted to the IGVF data portal.-modality
section should only contain one modality. Please do not combine multiple modalities (e.g., RNA and ATAC) into the same seqspec YAML file.This section describes the relevant read FASTQ files generated by sequencing runs. When referencing sequencing FASTQ files in this section, the files and the associated read_id
must 1) use IGVF accessions, and 2) have valid IGVF file download URLs.
The library_spec section describes in details the structures and regions of sequencing libraries.
It is recommended that submitters self-validate their seqspec YAML files before submitting to the IGVF data portal. There are 2 levels of validations done on the IGVF data portal.
Level 1 validation: It is applied to all submitted seqspec YAML files. The process includes seqspec schema check, content check, and read FASTQ file URLs validation.
Level 2 validation: It is currently only applicable to seqspec YAML files used for single cell assays (see the assay terms above), in which the onlist file URLs will be validated.
There are 2 options to validate seqspec files on your own before uploading them to the portal.
# To validate on Level 1 (schema, content, and fastq file references)
seqspec check -s igvf_onlist_skip yaml.gz
# To validate on Level 2 (schema, content, fastq, and onlist file references)
seqspec check -s igvf yaml.gz
Install IGVF-DACC checkfiles at https://github.com/IGVF-DACC/checkfiles.git
and follow the instructions at how to run local checkfiles. This option runs the same file validation system as the IGVF data portal on your local computers. If you are using this only for validating seqspec yaml files, you may simply install all the requirements using the command in Step 1. You may skip installing additional dependencies. Follow the commands for local checkfiles validation of seqspec yaml files listed under the section titled "Validate seqspec yaml file while skip onlist files check".
# Clone the repo
git clone https://github.com/IGVF-DACC/checkfiles.git
# Install requirements after creating a virtual enviroment
pip install -r src/checkfiles/requirements.txt
# Run seqspec yaml validation on Level 1 (schema, content, and fastq file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --onlist_skip --md5sum f1859dd9d60554a8f8ab63b65b458267
# Run seqspec yaml validation on Level 2 (schema, content, fastq, and onlist file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --md5sum f1859dd9d60554a8f8ab63b65b458267