Analyzing Gene Expression with STAR and Pysam

Introduction

In the realm of genomics, understanding gene expression patterns is crucial for unraveling the mysteries of cellular function. This blog post will delve into a comprehensive analysis pipeline utilizing STAR, a powerful tool for Spliced Transcripts Alignment to a Reference, combined with the Python library Pysam. The focus is on a specific dataset with the Sequence Read Archive (SRA) ID “SRR639771.”

Setting Up the Environment

Before diving into the code, let’s set the stage. The necessary genomic data is fetched from Ensembl, specifically the GTF (Gene Transfer Format) and FASTA files for Cryptococcus neoformans var. grubii H99. The genome is stored locally in the GENOME_DIR directory.

id="SRR639771" # Sequence Read Archive ID (NCBI)
GENOME_DIR="/content/genome/"
FASTQ_DIR="/content/fastq/"
GTF_URL="ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz"
FASTA_URL="ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/fasta/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/dna/Cryptococcus_neoformans_var_grubii_h99.CNA3.dna.toplevel.fa.gz"
FASTAGZ=GENOME_DIR+FASTA_URL.split("/")[-1]
GTFGZ=GENOME_DIR+GTF_URL.split("/")[-1]
FASTA=FASTAGZ.replace(".gz","")
GTF=GTFGZ.replace(".gz","")
STAR_PATH="/content/STAR/source/STAR" #where to install Spliced Transcripts Alignment to a Reference
STAR_OUT="/content/starout/"

Downloading and Preprocessing

The next step involves downloading the required files and preparing the data for analysis. The SRA toolkit is employed for fetching the raw sequencing data, which is then converted to FASTQ format using fastq-dump.

!apt install sra-toolkit &>/dev/null

%%bash -s "$id" "$FASTQ_DIR"
prefetch ${1} &>/dev/null
fastq-dump --outdir ${2} --gzip --skip-technical  --readids --read-filter pass --dumpbase --split-3 --clip /content/${1}/${1}.sra &>/dev/null

%%bash -s "$id"
zcat /content/fastq/${1}_pass_1.fastq.gz | head

!pip install pysam &>/dev/null

%%bash
git clone https://github.com/alexdobin/STAR.git && make -C STAR/source &>/dev/null

Genome Indexing with STAR

STAR requires a pre-built genome index for efficient alignment. The GTF and FASTA files are used to generate this index.

%%bash -s "$GENOME_DIR" "$STAR_OUT" "$GTF_URL" "$FASTA_URL" "$GTFGZ" "$FASTAGZ" "$GTF" "$FASTA" "$STAR_PATH"

mkdir -p "${1}"
mkdir -p "${2}"

if wget --quiet --directory-prefix="${1}" "${3}"; then
    echo "First file downloaded successfully."
else
    echo "Error downloading the first file. Exiting."
    exit 1
fi

if wget --quiet --directory-prefix="${1}" "${4}"; then
    echo "Second file downloaded successfully."
else
    echo "Error downloading the second file. Exiting."
    exit 1
fi

gunzip "${5}"
gunzip "${6}"

${9} --runMode genomeGenerate --genomeDir "${1}" --genomeFastaFiles "${8}" --sjdbGTFfile "${7}" --outFileNamePrefix "${2}/genome_" --genomeSAindexNbases 11 --outSAMtype BAM SortedByCoordinat

Aligning Reads with STAR

With the genome indexed, the next step involves aligning the sequenced reads to the reference genome using STAR.

%%bash -s "$GENOME_DIR" "$id" "$FASTQ_DIR" "$STAR_PATH" "$STAR_OUT"

${4} \
--genomeDir "${1}" \
--readFilesIn "${3}${2}_pass_1.fastq.gz" \
--readFilesCommand zcat \
--outBAMsortingBinsN 200 \
--runThreadN 2 \
--limitBAMsortRAM 1795207491 \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix ${5} \

Quantifying Gene Expression with Pysam

Now, we transition to Python with pysam to quantify gene expression. The code snippet below demonstrates how to extract gene-level read counts and calculate Reads Per Kilobase of exon per Million mapped reads (RPKM).

import pysam
pysam.index(STAR_OUT+"Aligned.sortedByCoord.out.bam"))


bam_file = STAR_OUT+ "Aligned.sortedByCoord.out.bam"
bai_file = STAR_OUT+ "Aligned.sortedByCoord.out.bam.bai"
gtf_file = GTF
# Create a pysam AlignmentFile object with the index file
samfile1 = pysam.AlignmentFile(bam_file, "rb", index_filename=bai_file)
gene_read_counts = []

with open(gtf_file, "r") as gtf:
    for line in gtf:
        if line.startswith("#"):
            continue  # Skip comments in GTF file

        #if 'exon_number "1"' in line and line[0:2]=="Mt":
        fields = line.strip().split("\t")
        feature = fields[2]
        if feature == 'exon':
            chromosome = fields[0]#.replace("chr","")

            start = int(fields[3])
            end = int(fields[4])
            gene_name_full = fields[8]

            nameparts=gene_name_full.split(";")
            gene_id=nameparts[0].split(" ")[1].replace("\"","")

            transcript_id=nameparts[1].split(" ")[2].replace("\"","")
            exon_number=nameparts[2].split(" ")[2].replace("\"","")

            #print(chromosome,feature,start,end,gene_name_full)
            # Count reads that fall between start and end coordinates of the exon
            try:
              read_count = samfile1.count(chromosome, start, end)
              gene_read_counts.append((feature,chromosome, gene_id,transcript_id,exon_number,end-start+1,read_count))
            except:
              w=1

Visualizing Results

The final section involves visualizing the results. We generate various plots, including bar charts depicting RPKM and raw read counts across different chromosomes.

import pandas as pd
df=pd.DataFrame(gene_read_counts,columns=['Feature','Chromosome','Gene ID','Transcript ID', 'Exon','Length','Counts'])
df.head()

#Reads Per Kilobase of exon per Million mapped reads
total_mapped_reads = samfile1.mapped
df['RPKM'] = (df.Counts / df.Length) * (1e6 / total_mapped_reads)

P=pd.pivot_table(data=df,values="RPKM",index=['Chromosome','Feature','Gene ID','Transcript ID'],aggfunc='mean')
P.sort_values(by='RPKM',ascending=False).head(20)

df.groupby(by = 'Gene ID').mean(numeric_only = True).sort_values(by='RPKM',ascending=False).head(20)

data=df.groupby(by = 'Chromosome').mean(numeric_only = True).sort_values(by='RPKM',ascending=False).reset_index()
data

data.plot.bar(y='RPKM',x='Chromosome');

data.plot.bar(y='Counts',x='Chromosome');

import matplotlib.pyplot as plt

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(data['Chromosome'], data['RPKM'], 'b-')
ax2.plot(data['Chromosome'], data['Counts'], 'r-')
ax1.set_ylabel('Reads Per kb per Million Mapped Reads', color='blue')
ax2.set_ylabel('Counts', color='red')
ax1.set_xlabel('Chromosome')
plt.show()

Conclusion

In conclusion, this analysis pipeline seamlessly integrates command-line tools like STAR with Python libraries such as pysam, providing a comprehensive approach to gene expression analysis. Understanding and adapting this pipeline can help extract meaningful insights from genomic data.