File Validation

Files submitted to the IGVF data portal are validated with Checkfiles, a tool that ensures the integrity and correctness of uploaded files. Checkfiles verifies new or updated files in the AWS S3 bucket by comparing their size and MD5 checksum (for both gzipped and uncompressed versions) with the metadata provided during submission. Additionally, it performs specific validations for certain file formats. For detailed information on these format-specific checks, refer to the Checkfiles GitHub README.

If a file fails the validation process, its upload_status will be set to invalidated and patched with a validation_error_detail detailing the reason for failure. To avoid submission errors, files can be validated locally prior to uploading, ensuring they are properly formatted to meet the validation requirements.

For some tabular file content types, focus groups within the consortium have established standards that every file of these types is expected to meet. Checkfiles uses predefined schemas (linked below) to validate with frictionless to ensure that submitted files with the corresponding content_type include all required fields and that the data associated with each field meets the expected format. While most files are validated using a .json schema format, Bed files follow a separate schema format (.as) tailored specifically to their requirements standardized by UCSC genome browser. Permitted bed file formats: bed3, bed3+, bed5, bed6, bed6+, bed9, bed9+, bed12, mpra_starr.

These standards, developed collaboratively by the various focus groups, are still a work in progress. To view the list of these standards and track their development, refer to this document.

Tabular File Format Standards

guide RNA sequences

prime editing guide RNA sequences

MPRA sequence designs

Bed File Format Standards

mpra_starr