Sequence Specification (seqspec) overview

This general submission help provides a general overview of generation, submission, and validation of seqspec documents to the IGVF data portal.

General information

If you are submitting raw sequencing data (such as Illumina, PacBio, NanoPore, etc.), the corresponding sequence specification YAML configuration files (seqspec) should be submitted along with the sequence files. Furthermore, the seqspec YAML files are required to uniformly process single-cell assays (such as scRNA-seq, 10X multiome, SHARE-seq, MULTI-seq, etc.).

This type of machine-readable YAML files describes your genomic library sequence and structure for standardized data processing, see Bioinformatics, Volume 40, Issue 4, April 2024. The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of.

Detailed documentations on seqspec installation, generation, validation, and usage are included in seqspec docs. Additional tutorials are available from a 2024 seqspec Jamboree session with all content at Google Drive.

It is important to note that seqspec submission requirements vary based on whether the sequencing data will be processed by the single-cell uniform pipeline. Please refer to corresponding subsections in the section below on seqspec submission.

Schema for submitting seqspec YAML files for single cell and non-single assays

seqspec submission schema

Seqspec YAML files generation and submission

Generation

The seqspec code repository (for the YAML file generation and other functionalities) is available on a DACC fork of the seqspec repo on IGVF-DACC GitHub seqspec repo fork. The seqspec code base can also be installed via Python pip install git+https://github.com/IGVF-DACC/seqspec.git@v25-05-06. Furthermore, please check seqspec JSON schema for a list of accepted seqspec enums, such as assay terms, region types, primer IDs, etc.

For any additional help generating a YAML file, please contact Sina Booeshaghi and Lior Pachter. For any other additional question on submitting a Configuration File, please contact your wranglers.

Submission

The seqspec YAML files are required to uniformly process single-cell assays (see assay terms below). The files should be submitted as configuration files and linked to the corresponding sequence files metadata as seqspec_of. The seqspec YAML files need to be gzipped prior to submission.

If a seqspec has onlist file references in its library_spec sections that have the property region_type:barcode (see section above), the submitters must also add the same list of files to the linked Measurement Sets' onlist_files and onlist_method.

Schema for submitting seqspec YAMLS for single cell and non-single assays are illustrated in the subsection "Schema for submitting seqspec YAML files for single cell and non-single assays" below.

Key seqspec yaml file information

There are a few important seqspec YAML styles guidelines to follow for the generation and submission of seqspec YAML files.

  1. The seqspec YAML file must be v0.3.0.
  2. All references to read_id, file_id, and url for FASTQ and onlist files, if used in a seqspec, must use IGVF data portal accessions.
  3. Labs are still expected to submit one seqspec per library structure regardless of whether their experiments are single cell or not. This means if the library_spec portion will be different, a new seqspec describing this new library structure is needed.

Submit seqspec for single cell uniform pipeline assays

  1. If a seqspec YAML file is used in single cell assays, it must follow the style of one seqspec YAML file per sequencing run. The following terms currently fall under single cell experiments and associated data will be analyzed using the single cell uniform pipeline.
    • /assay-terms/OBI_0002762/, single-nucleus ATAC-seq
    • /assay-terms/OBI_0003109/, single-nucleus RNA sequencing assay
    • /assay-terms/OBI_0002631/, single-cell RNA sequencing assay
    • /assay-terms/OBI_0002764/, single-cell ATAC-seq
    • /assay-terms/OBI_0003660/, in vitro CRISPR screen using single-cell RNA-seq
  2. Please make sure that the file_set of a seqspec YAML file matches the file_set of the sequence files in the seqspec_of content.
  3. The seqspec YAML files must be gzipped prior to submission.
  4. If barcode onlist files are used in seqspec YAMLs (see library_spec example below), please use existing onlist files on the IGVF data portal can be found under Tabular Files with content_type: barcode onlist. If your onlist file(s) are not on the portal, please work with your wranglers to submit them to the portal first. Additionally, the following guidelines must be followed when referencing barcode onlist files.
    1. If a barcode onlist file is listed under a seqspec section annotated region_type:barcode, onlist:!Onlist, it is expected that this onlist file follows this guidelines: 1) it is submitted to the IGVF portal, 2) no header, 3) only have one column, and 4) one barcode per line (Accepted onlist example: IGVFFI0791WXDC).

    seqspec barcode region onlist

    1. For other sections (e.g., sections that are region_type:index5, onlist:!Onlist), there isn't a strict format requirement as long as the referenced barcode onlist files are on the portal under Tabular File with a content type of barcode onlist (Accepted onlist examples: IGVFFI0791WXDC and IGVFFI4565KANH).

Submitting seqspecs for NON-single cell uniform pipeline assays

  1. Seqspec YAML files for non-single cell assays should be submitted as Documents. You may still generate the seqspec using the standard method. However, please convert the resulting .yaml file to a .txt extension prior to uploading it as Document on the IGVF portal.
  2. The file will need to be uploaded as an attachment to the document object. The document_type in this case will be library structure seqspec. This Document can be then linked to individual Sequence Files using the property seqspec_document on the sequence file object. Any seqspec file submitted via this option will not be throughly checked by the IGVF check files validation system. Nonetheless, the submitters are encouraged to self-validate (refer to the last section) to ensure that there are no errors in the library_spec section of the seqspec YAML file.
  3. Submitters are encouraged to complete the !Read sections in the sequence_spec section of the seqspec YAML file as much as possible. The !File subsections under !Read sections and the Onlist info under library_spec are optional and can be left blank.

Example seqspec

The example seqspec used is IGVFFI1157AYPH.

Section 1: Assay information

This section describes the experimental setup of one or more sequencing runs. Some information listed in this section is also collected as metadata when submitting sequence files and the corresponding file sets to the data portal. Therefore, the seqspec validation check omits this section except the seqspec_version.

Important info for the !Assay section

  1. The seqspec_version in this section is expected to be v0.3.0 if a seqspec YAML file will be submitted to the IGVF data portal.
  2. The -modality section should only contain one modality. Please do not combine multiple modalities (e.g., RNA and ATAC) into the same seqspec YAML file.

Seqspec version

Section 2: sequence_spec

This section describes the relevant read FASTQ files generated by sequencing runs. When referencing sequencing FASTQ files in this section, the files and the associated read_id must 1) use IGVF accessions, and 2) have valid IGVF file download URLs.

Seqspec FASTQ file reference

Section 3: library_spec

The library_spec section describes in details the structures and regions of sequencing libraries.

library_spec overview

Validating generated seqspec yaml files

It is recommended that submitters self-validate their seqspec YAML files before submitting to the IGVF data portal. There are 2 levels of validations done on the IGVF data portal.

Level 1 validation: It is applied to all submitted seqspec YAML files. The process includes seqspec schema check, content check, and read FASTQ file URLs validation.

Level 2 validation: It is currently only applicable to seqspec YAML files used for single cell assays (see the assay terms above), in which the onlist file URLs will be validated.

There are 2 options to validate seqspec files on your own before uploading them to the portal.

Option 1: Using seqspec native tool

# To validate on Level 1 (schema, content, and fastq file references)
seqspec check -s igvf_onlist_skip yaml.gz

# To validate on Level 2 (schema, content, fastq, and onlist file references)
seqspec check -s igvf yaml.gz

Option 2: Using IGVF local checkfiles

Install IGVF-DACC checkfiles at https://github.com/IGVF-DACC/checkfiles.git and follow the instructions at how to run local checkfiles. This option runs the same file validation system as the IGVF data portal on your local computers. If you are using this only for validating seqspec yaml files, you may simply install all the requirements using the command in Step 1. You may skip installing additional dependencies. Follow the commands for local checkfiles validation of seqspec yaml files listed under the section titled "Validate seqspec yaml file while skip onlist files check".

# Clone the repo
git clone https://github.com/IGVF-DACC/checkfiles.git

# Install requirements after creating a virtual enviroment
pip install -r src/checkfiles/requirements.txt

# Run seqspec yaml validation on Level 1 (schema, content, and fastq file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --onlist_skip --md5sum f1859dd9d60554a8f8ab63b65b458267

# Run seqspec yaml validation on Level 2 (schema, content, fastq, and onlist file references)
python src/checkfiles/checkfiles_local.py --input_file_path src/tests/data/seqspec_valid.yaml.gz --file_format yaml --content_type seqspec --md5sum f1859dd9d60554a8f8ab63b65b458267