MSCI Documentation

Functions 

generate_variable_length_peptides(protein_sequence, min_length=8, max_length=11)

Generates all possible peptides of varying lengths from a given protein sequence.

Parameters:

protein_sequence (str) – The protein sequence from which peptides are generated.
min_length (int, optional) – The minimum length of peptides to generate (default: 8).
max_length (int, optional) – The maximum length of peptides to generate (default: 11).

Returns:

list – A list of generated peptides of varying lengths.

Example:

peptides = generate_variable_length_peptides("ABCDEFG", min_length=3, max_length=5)
print(peptides)
# Output: ['ABC', 'BCD', 'CDE', 'DEF', 'EFG', 'ABCD', 'BCDE', 'CDEF', 'DEFG', 'ABCDE', 'BCDEF', 'CDEFG']

extract_peptides_from_fasta(fasta_path, min_length=8, max_length=11)

Reads a FASTA file and extracts peptides from each protein sequence.

Parameters:

fasta_path (str) – The path to the FASTA file.
min_length (int, optional) – The minimum length of peptides to extract (default: 8).
max_length (int, optional) – The maximum length of peptides to extract (default: 11).

Returns:

list – A list of peptides extracted from the protein sequences in the FASTA file.

Example:

peptides = extract_peptides_from_fasta("example.fasta", min_length=3, max_length=5)
print(peptides)
# Output: ['ABC', 'BCD', 'CDE', 'DEF', 'EFG', ...]

keep_top_n_peaks(spectrum, n)

Filters a spectrum to retain only the top n most intense peaks.

Parameters:

spectrum (object) – A spectrum object containing mass-to-charge ratio (m/z) peaks.
n (int) – The number of top peaks to retain.

Returns:

object – A spectrum object with only the top n peaks.

Example:

filtered_spectrum = keep_top_n_peaks(spectrum, n=5)
print(filtered_spectrum)

filter_spectra_by_top_peaks(input_file_path, output_file_path, n_peaks)

Reads a pickled list of spectra, processes each spectrum to keep only the top n peaks, and saves the results.

Parameters:

input_file_path (str) – Path to the input pickle file containing spectra.
output_file_path (str) – Path to save the processed spectra as a pickle file.
n_peaks (int) – The number of top peaks to retain in each spectrum.

Returns:

list – A list of processed spectra with only the top n peaks.

reading MS spectra 

This module provides functionality to read and process mass spectrometry files, including MSP, MGF, and MZML formats.

read_msp_file 

Reads an MSP file and returns a DataFrame containing the spectra information.

param filename:: The path to the MSP file
type filename:: str
returns:: A DataFrame with spectra information
rtype:: pandas.DataFrame

The returned DataFrame contains the following columns:

Name – The name of the spectrum
MW – Mass/charge of the spectrum
iRT – Indexed retention time

Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>

—

read_mgf_file 

Reads an MGF file and returns a list of spectra data.

param filename:: The path to the MGF file
type filename:: str
returns:: A list of dictionaries containing spectra data
rtype:: list[dict]

Each dictionary contains:

mz_values
intensities
MW
RT

Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>

—

read_mzml_file 

Reads an MZML file and returns a list of processed spectrum data.

param filename:: The path to the MZML file
type filename:: str
returns:: A list of processed spectrum data
rtype:: list[dict]

Download Example Data <https://github.com/proteomicsunitcrg/MSCI/tree/main/Example_data>

—

read_ms_file 

Determines the file format and calls the appropriate function to read the mass spectrometry file.

param filename:: The path to the mass spectrometry file
type filename:: str
returns:: A DataFrame or a list depending on the file format
rtype:: pandas.DataFrame | list

Example Data

Grouping MS1 Module 

This module provides functions for grouping MS1 peptides based on mass-to-charge ratio (m/z) and indexed retention time (iRT) using k-d tree data structures and tolerance calculations.

Functions 

make_data_compatible(index_df)

Converts a DataFrame into a list of tuples compatible with further processing.

Parameters:

index_df (pandas.DataFrame) – DataFrame containing mass spectrometry data with columns MW and iRT

Returns:

list of tuples in format (index, MW, iRT)

within_ppm(pair, ppm_tolerance1, ppm_tolerance2)

Checks if two peptide pairs are within specified tolerances.

Parameters:

pair (tuple) – Two peptide tuples ((index1, MW1, iRT1), (index2, MW2, iRT2))
ppm_tolerance1 (float) – PPM tolerance for m/z values
ppm_tolerance2 (float) – Absolute tolerance for iRT values

Returns:

bool – True if within tolerances, False otherwise

within_tolerance(pair, tolerance1, tolerance2)

Checks if peptide pairs are within absolute tolerances.

Parameters:

pair (tuple) – Two peptide tuples ((index1, MW1, iRT1), (index2, MW2, iRT2))
tolerance1 (float) – Absolute tolerance for m/z values
tolerance2 (float) – Absolute tolerance for iRT values

Returns: