MSCI Documentation
Functions
generate_variable_length_peptides(protein_sequence, min_length=8, max_length=11)
Generates all possible peptides of varying lengths from a given protein sequence.
- Parameters:
protein_sequence (str) – The protein sequence from which peptides are generated.
min_length (int, optional) – The minimum length of peptides to generate (default: 8).
max_length (int, optional) – The maximum length of peptides to generate (default: 11).
- Returns:
list – A list of generated peptides of varying lengths.
- Example:
peptides = generate_variable_length_peptides("ABCDEFG", min_length=3, max_length=5) print(peptides) # Output: ['ABC', 'BCD', 'CDE', 'DEF', 'EFG', 'ABCD', 'BCDE', 'CDEF', 'DEFG', 'ABCDE', 'BCDEF', 'CDEFG']
extract_peptides_from_fasta(fasta_path, min_length=8, max_length=11)
Reads a FASTA file and extracts peptides from each protein sequence.
- Parameters:
fasta_path (str) – The path to the FASTA file.
min_length (int, optional) – The minimum length of peptides to extract (default: 8).
max_length (int, optional) – The maximum length of peptides to extract (default: 11).
- Returns:
list – A list of peptides extracted from the protein sequences in the FASTA file.
- Example:
peptides = extract_peptides_from_fasta("example.fasta", min_length=3, max_length=5) print(peptides) # Output: ['ABC', 'BCD', 'CDE', 'DEF', 'EFG', ...]
keep_top_n_peaks(spectrum, n)
Filters a spectrum to retain only the top n most intense peaks.
- Parameters:
spectrum (object) – A spectrum object containing mass-to-charge ratio (m/z) peaks.
n (int) – The number of top peaks to retain.
- Returns:
object – A spectrum object with only the top n peaks.
- Example:
filtered_spectrum = keep_top_n_peaks(spectrum, n=5) print(filtered_spectrum)
filter_spectra_by_top_peaks(input_file_path, output_file_path, n_peaks)
Reads a pickled list of spectra, processes each spectrum to keep only the top n peaks, and saves the results.
- Parameters:
input_file_path (str) – Path to the input pickle file containing spectra.
output_file_path (str) – Path to save the processed spectra as a pickle file.
n_peaks (int) – The number of top peaks to retain in each spectrum.
- Returns:
list – A list of processed spectra with only the top n peaks.
reading MS spectra
This module provides functionality to read and process mass spectrometry files, including MSP, MGF, and MZML formats.
read_msp_file
Reads an MSP file and returns a DataFrame containing the spectra information.
- param filename:
The path to the MSP file
- type filename:
str
- returns:
A DataFrame with spectra information
- rtype:
pandas.DataFrame
The returned DataFrame contains the following columns:
Name – The name of the spectrum
MW – Mass/charge of the spectrum
iRT – Indexed retention time
Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>
—
read_mgf_file
Reads an MGF file and returns a list of spectra data.
- param filename:
The path to the MGF file
- type filename:
str
- returns:
A list of dictionaries containing spectra data
- rtype:
list[dict]
Each dictionary contains:
mz_values
intensities
MW
RT
Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>
—
read_mzml_file
Reads an MZML file and returns a list of processed spectrum data.
- param filename:
The path to the MZML file
- type filename:
str
- returns:
A list of processed spectrum data
- rtype:
list[dict]
Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>
—
read_ms_file
Determines the file format and calls the appropriate function to read the mass spectrometry file.
- param filename:
The path to the mass spectrometry file
- type filename:
str
- returns:
A DataFrame or a list depending on the file format
- rtype:
pandas.DataFrame | list
Grouping MS1 Module
This module provides functions for grouping MS1 peptides based on mass-to-charge ratio (m/z) and indexed retention time (iRT) using k-d tree data structures and tolerance calculations.
Functions
make_data_compatible(index_df)
Converts a DataFrame into a list of tuples compatible with further processing.
- Parameters:
index_df (pandas.DataFrame) – DataFrame containing mass spectrometry data with columns
MWandiRT
- Returns:
list of tuples in format
(index, MW, iRT)
within_ppm(pair, ppm_tolerance1, ppm_tolerance2)
Checks if two peptide pairs are within specified tolerances.
- Parameters:
pair (tuple) – Two peptide tuples
((index1, MW1, iRT1), (index2, MW2, iRT2))ppm_tolerance1 (float) – PPM tolerance for m/z values
ppm_tolerance2 (float) – Absolute tolerance for iRT values
- Returns:
bool – True if within tolerances, False otherwise
within_tolerance(pair, tolerance1, tolerance2)
Checks if peptide pairs are within absolute tolerances.
- Parameters:
pair (tuple) – Two peptide tuples
((index1, MW1, iRT1), (index2, MW2, iRT2))tolerance1 (float) – Absolute tolerance for m/z values
tolerance2 (float) – Absolute tolerance for iRT values
- Returns:
bool – True if within tolerances, False otherwise
find_combinations_kdtree(data, tolerance1, tolerance2, use_ppm=True)
Uses k-d tree for efficient querying of valid peptide combinations.
- Parameters:
data (list) – Peptide data tuples
(index, MW, iRT)tolerance1 (float) – Tolerance for m/z values
tolerance2 (float) – Tolerance for iRT values
use_ppm (bool) – Use PPM tolerance if True, absolute if False
- Returns:
list of valid peptide pairs
Similarity Module
This module calculates similarity between mass spectrometry spectra using various methods.
Functions and Classes
ndotproduct(x, y, m=0, n=0.5, na_rm=True)
Calculates normalized dot product between spectra.
- Parameters:
x (pandas.DataFrame) – First spectrum (m/z and intensities)
y (pandas.DataFrame) – Second spectrum (m/z and intensities)
m (float) – M/z values exponent (default: 0)
n (float) – Intensity values exponent (default: 0.5)
na_rm (bool) – Remove missing values (default: True)
- Returns:
float – Normalized dot product
nspectraangle(x, y, m=0, n=0.5, na_rm=True)
Calculates normalized spectral angle between spectra.
- Parameters:
x (pandas.DataFrame) – First spectrum (m/z and intensities)
y (pandas.DataFrame) – Second spectrum (m/z and intensities)
m (float) – M/z values exponent (default: 0)
n (float) – Intensity values exponent (default: 0.5)
na_rm (bool) – Remove missing values (default: True)
- Returns:
float – Normalized spectral angle
joinPeaks(tolerance=0, ppm=0)
Class that joins peaks from two spectra based on m/z and intensity values.
- Parameters:
tolerance (float) – Absolute tolerance for m/z matching
ppm (float) – PPM tolerance for m/z matching
- Methods:
- match(x, y) – Matches peaks from two spectra
Parameters: Two DataFrames with m/z and intensities
Returns: Tuple of matched peaks DataFrames
Mutation Module
Tools for processing proteins by simulating peptide digestion and introducing mutations.
ProteinMutator
- class ProteinMutator(proteome_file, mutations_file, output_dir, digestion_method)
Handles protein mutations and peptide generation.
- Parameters:
proteome_file (str) – Path to FASTA proteome file
mutations_file (str) – Path to TSV mutations file
output_dir (str) – Output directory path
digestion_method (callable) – Function returning peptide list
- Methods:
load_proteome() – Loads proteome sequences
load_mutations() – Loads mutation data
process_protein(target_protein_accession) – Processes single protein
process_all_proteins() – Processes all proteins
tryptic_digest(sequence)
Simulates tryptic digestion of protein sequence.
- Parameters:
sequence (str) – Protein sequence
- Returns:
list – Resulting peptides