Delivering Your Data

We define two Tiers in terms of delivering your NGS Data: the Silver Tier and the Gold Tier.

1 Silver Tier

This is the default / standard tier. We provide your NGS data through download links. These links and the data will be available for 90 days. You need to store the data on your own data storage solution. After 90 days, the data will be deleted from the Genomics Core servers. The data storage costs are included in your sequencing project.

2 Gold Tier

In contrast with the Silver Tier, in which the client is intrinsically responsible for having its own data storage solution, we can also host a storage service for your NGS data, by means of a Google Cloud Bucket. The provisioned Bucket is under our administration, and we provide you the necessary access to it.

The principal investigator (pi) is responsible for requesting access to the bucket for his/her collaborators. You will be billed on a quarterly basis for the storage usage. We provide a cost simulator to support you in defining the optimal storage policy. The storage policy can be modified when your requirements change.

Disclaimer

Even though we provide a storage service for you, this does not include additional backups. If you accidentally delete your data, it will be lost beyond recovery and we cannot be held responsible.

Switching

If at any point you would like to switch from a Gold Tier to a Silver Tier, then make sure to back your data up before making the switch. As soon as you make the switch, the files will be deleted according to the Silver Tier policy. Data recovery is not possible by any means. And at the end of the quarter, you will receive the bill for the usage period in the Gold Tier. Example: on May 19th, you switch from Gold to Silver, your bill in July will include the storage usage for the month April and May 1st to 19th.

What is a Google Cloud Bucket?

A Bucket is what Google Cloud defines as a container that holds your data. Google Cloud stores Buckets in their cloud storage services called Google Cloud Storage. The files stored inside a Bucket are called Objects or Blobs. There is no limit on the size of the Bucket, nor on the amount of Objects a Bucket can hold.

In a Bucket, four Storage Classes are available:

STORAGE CLASS	STORAGE COST	RETRIEVAL COST(*)	RETENTION(**)	Average Usage Frequency	TYPE
STANDARD	highest	NA	NA	more than once per month	HOT
NEARLINE		lowest	30 days	once per month	ARCHIVE
COLDLINE			90 days	about once every 3 months	ARCHIVE
ARCHIVE	lowest	highest	365 days	up to once a year	ARCHIVE

(*) In archive storage types, a retrieval fee is included as there is a certain retrieval cost to the operation. In terms of latency this is negligible.

(**) In archive storage types, a minimum retention period is implemented. Deleting objects before the retention period has been exceeded will introduce early deletion costs. This cost is the remained of the storage cost as if it was still present. For example, an object is deleted after 50 days in Coldline; 90-50 == 40 days --> an additional 40 days of coldline storage cost is included as an early deletion fee. Costs are expressed in EUR/xBytes or EUR/xBytes/duration.

Download cost is a fee that is incurred for every download and is expressed in EUR/xBytes, regardless of storage class
Standard is the default / hot storage class, with the highest storage cost, but has no retrieval costs. This is recommended for frequently used data.
Nearline is an archive storage class with only a minor reduction in storage cost. We typically do not use it as the cost benefits are very limited for our use cases.
Archive is the cheapest in terms of storage cost. The retrieval costs are however the highest. This is recommended for archival purposes, with infrequent usages, typically once or twice a year at the most.

Cloud Storage Policy

As there is no size limit to the storage container, the storage costs also do not have a limit. In order to control and reduce the storage costs, we advise you to define a Storage Policy tailored to your requirements. Based on the aforementioned table, this boils down to keeping data in Standard when you frequently access the data, and reduce it to an archive type of storage class afterwards. Typically, data is frequently accessed in the first month or 2, and the frequency drops dramatically afterwards, with a potential (partial or full) retrieval of the dataset after 6 to 12 months. Therefore, the base storage policy we advise, based on the pricing model, is:

Month 0-3 in Standard
Month 3-12 in Coldline
Month 12-inf in Archive

This will allow you to download multiple times without incurring retrieval costs in the first 3 months, while afterwards reducing the costs significantly. If you need your dataset that is in coldline or archive, you will incur retrieval costs, but the reduction in monthly costs outweighs the cost incurred in retrieval. Please have a look at the Storage Policy Simulator (in the Project Manager under My Data > Cost Simulator) to get an idea of costs and the impact of a storage policy on your bill.

Moving objects from one storage class to another does not entail a change in access. All data is still present in the same bucket, and can be accessed as if it were in Standard.

Good to know

Objects can only be downgraded in storage classes, i.e. it can go from Standard > Nearline > Coldline > Archive or Standard > Archive or Standard > Coldline > Archive, but not from Standard > Archive > Standard > Coldline > Nearline > Archive. Once the object is in a certain archive stage, it cannot go up.

And as a final note, automatic deletion is also a possibility.

Storage Billing

The following costs can be listed on your invoice:

Storage Costs: the costs incurred for storing the data
Retrieval Costs: the costs incurred for retrieving data from an archive type of class
Download Costs: the cost incurred by downloading data

3 What to Expect

3.1 Illumina based FASTQ files

3.1.1 File Name Format

source: Illumina bcl2fastq Conversion Software Support

The file structure from Illumina's bcl2fastq (demultiplexing) is of the following format:

<SampleName>_S<SampleNumber>_L<LaneNumber>_<ReadType>_001.fastq.gz

<SampleName>: This is the Sample Name or Sample ID. We use GC codes to label the samples, e.g. GC111111_AGCTATCA-GTCGATGT

S<SampleNumber>: This is the Sample Number. bcl2fastq assigns a unique, sequential number (starting with S1) to each sample based on its order in the SampleSheet.csv. S1 means this was the first sample listed in the sheet. S12 would be the twelfth.

L<LaneNumber>): This is the Lane Number on the flow cell from which the data was generated. The L is followed by a zero-padded number (e.g., L001 for lane 1, L002 for lane 2).

<ReadType>: This indicates the Read Type. R1 indicates the Read 1 (the forward read). In a paired-end run, you will have a corresponding file with R2, which contains Read 2 (the reverse read). I1 / I2: You might also see files with I1 (Index 1) or I2 (Index 2).

_001: This is the Set Number. By default, bcl2fastq does not split files, so this number is almost always 001. It's designed to differentiate file chunks if a large FASTQ file were to be split, but in practice, you'll see 001.

.fastq.gz: This is the File Extension. .fastq indicates the file is in the standard FASTQ format (containing sequence and quality scores). .gz indicates the file has been compressed using gzip to save space.

As such, we deliver the data using the following file name structure:

GC111111_AGCTATCA-GTCGATGT_S1_L001_R1_001.fastq.gz
GC111111_AGCTATCA-GTCGATGT_S1_L001_R2_001.fastq.gz
GC111111_AGCTATCA-GTCGATGT_S1_L001_I1_001.fastq.gz
GC111111_AGCTATCA-GTCGATGT_S1_L001_I2_001.fastq.gz
...
GC111112_TATCGCAG-CGATACTG_S66_L002_R1_001.fastq.gz
GC111112_TATCGCAG-CGATACTG_S66_L002_R2_001.fastq.gz
GC111112_TATCGCAG-CGATACTG_S66_L002_RI_001.fastq.gz
GC111112_TATCGCAG-CGATACTG_S66_L002_I2_001.fastq.gz
...

3.1.2 Delivery

SILVER TierGOLD Tier

You receive a download link with a zip file that has the following structure:

.
├── pi_name/
│   └── project_title/
│       └── sequencing_run_name/
│           └── demultiplexing_group_id/
│               ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_R1_001.fastq.gz
│               ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_R2_001.fastq.gz
│               ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_I1_001.fastq.gz
│               ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_I1_001.fastq.gz
│               ├── GC111112_TATCGCAG-CGATACTG_S66_L002_R1_001.fastq.gz
│               ├── GC111112_TATCGCAG-CGATACTG_S66_L002_R2_001.fastq.gz
│               ├── GC111112_TATCGCAG-CGATACTG_S66_L002_I1_001.fastq.gz
│               ├── GC111112_TATCGCAG-CGATACTG_S66_L002_I2_001.fastq.gz
│               ├── ...
│               └── dxstats.csv
└── outline.csv

The dxstats.csv contains the demultiplexing statistics, essential for quality control.

The outline.csv contains an outline of the zip file, listing every file with their expected md5sum. These can be used to verify if the file you have extracted was not corrupted during transfer or extraction (unzipping). It has the following header: md5sum, file.

The demultiplexing_group_id is a unique identifier for a group samples that were demultiplexed.

Learn more about handling download links here

The data is delivered to your bucket in the following format.

.
└── bucket/
    └── project_name/
        └── run_name/
            └── demultiplexing_group_id/
                ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_R1_001.fastq.gz
                ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_R2_001.fastq.gz
                ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_I1_001.fastq.gz
                ├── GC111111_AGCTATCA-GTCGATGT_S1_L001_I1_001.fastq.gz
                ├── GC111112_TATCGCAG-CGATACTG_S66_L002_R1_001.fastq.gz
                ├── GC111112_TATCGCAG-CGATACTG_S66_L002_R2_001.fastq.gz
                ├── GC111112_TATCGCAG-CGATACTG_S66_L002_I1_001.fastq.gz
                ├── GC111112_TATCGCAG-CGATACTG_S66_L002_I2_001.fastq.gz
                ├── ...
                ├── dxstats.csv
                └── GC_GCP_COPY_COMPLETE.flag

The dxstats.csv contains the demultiplexing statistics, essential for quality control.

The demultiplexing_group_id is a unique identifier for a group samples that were demultiplexed.

The GC_GCP_COPY_COMPLETE.flag is a file that indicates (flags) that the data the transfer to your bucket is finished.

Learn more about how to interact with a bucket here.

3.1.3 Demultiplexing Stats Explained

The dxstats.csv file has the following header:

run,lane,project,sample,barcode,cluster_count,cluster_count_0_mismatch,cluster_count_1_mismatch,pct_lane,yield,yield_above_q30,qscore_sum,yield_quality_avg,pct_above_q30_bases

How Many Reads

cluster_count The total number of reads (or read pairs) that were successfully assigned to this specific sample. In Illumina sequencing, each "read" comes from a "cluster" of DNA on the flow cell. This is your main "how many" number for a sample.

cluster_count_0_mismatch The number of reads from the cluster_count where the sample's index (barcode) perfectly matched the index you provided in the sample sheet, with zero errors or differences. This shows you how many of your index reads were sequenced perfectly.

cluster_count_1_mismatch The number of reads from the cluster_count where the index matched only after allowing for one error (one base mismatch). bcl2fastq does this to "rescue" reads that had a minor sequencing error in the index read itself. A high number here (relative to the 0-mismatch count) can suggest that the sequencing quality of the index read was low, forcing the software to do more "rescuing."

Info

We sequence with mismatch 0 as standard. We only try with mismatch 1 after investigation and communication with our customers.

How Much Data

pct_lane The percentage of all data in that sequencing lane that belongs to this specific sample. So cluster_count divided by total cluster_count for all samples in the lane. This is the most critical metric for checking sample balancing. If you pooled 10 samples and aimed for 10% each, this number tells you how close you got. If one sample is 50% (pct_lane = 50.0) and another is 1% (pct_lane = 1.0), it indicates an error during library preparation or bad sample qc, during library pooling, or can be the result of clustering preferences.

yield The total number of bases (A, T, C, G) sequenced for this sample, regardless of quality. This is the raw volume of data you got. It's often reported in Megabases (Mb) or Gigabases (Gb). This is different from cluster_count (which is number of reads), yield is the total amount of sequence. It is calculated as (This sample's cluster_count) x (Total number of cycles/bases in the reads, e.g., 2x150bp).

Quality Metrics

A Q-score (or Phred score) is a measure of base call accuracy. Q30 is the industry-standard benchmark for high-quality sequencing, and it signifies 1 in 1000 chance of error (99.9% accuracy).

pct_above_q30_bases This is arguably the most important quality metric. It's the percentage of all bases in this sample that had a quality score of Q30 or higher. This is the single best snapshot of your sample's overall data quality. You want this number to be at least the percentage stated by the kit/cycle length used. Typically >= 85% or >= 90%, but for example a MiSeq 600 the threshold is >= 75%. If this number is lower than the threshold, your data is noisy and may be unusable for sensitive analyses.

yield_above_q30 The total number of bases in your sample that were at or above Q30. This is your "high-quality yield." It's the total amount of data you can actually trust. This is often more important than the total yield.

qscore_sum The sum of all the individual Q-scores for every single base in the sample. By itself, this number is not useful. It's a massive, uninterpretable number. It is only used as an intermediate step to calculate the average.

yield_quality_avg The average Q-score across all bases in your sample, calculated as qscore_sum / yield. This gives you a single-number average for quality. It's generally less preferred than % Q30 because an average can be misleading. For example, a sample with 50% Q40 bases and 50% Q20 bases might have an average of Q30, but so would a sample with 100% Q30 bases. The % Q30 metric is more direct and easier to interpret.

3.2 PacBio 🚧

WIP

3.3 Oxford Nanopore Technologies 🚧

WIP

3.4 Custom 🚧

WIP