Calibration and optimization of retention time, ion mobility and m/z
During search, AlphaDIA calibrates library properties such as retention time, ion mobility, precursor m/z, fragment m/z and search tolerances. Calibration removes the systematic deviation of observed and library values. Optimization reduces the search space to improve the confidence in identifications and to accelerate the search. Initial parameters are an MS1 tolerance of 30 ppm, MS2 tolerance of 30 ppm, 0.1 1/K0 ion mobility tolerance and 50% retention time tolerance.
AlphaDIA supports search space optimization with fixed target values such as a mass tolerance of 7 ppm and automatic optimization to give optimal search results. By default, mass tolerances are optimized with targeted optimization and retention time while ion mobility tolerances undergo automatic optimization. First, all targeted optimizations are performed at the same time, followed by separate automated optimizations of the remaining properties.
For each optimization, the search is performed batch-wise, starting with the first 8,000 precursors and using an exponential batch plan (16,000, 32,000, 64,000, …) until 200 precursors are identified at 1% FDR. For targeted optimization, the search space of the property of interest is updated to the 95% percentile of the precursors identified at 1% FDR. For automated optimization, the search space of the property of interest is set to the 99% percentile of the precursors identified at 1% FDR and a figure of merit is logged. MS1 error optimization uses the correlation of the observed and predicted isotope intensity profile as a figure of merit. For MS2, retention time and ion mobility use the precursor proportion of the library detected at 1% FDR as a figure of merit. Optimization is stopped if the property of interest does not change substantially. The optimal value based on the figure of merit is used.
Calibration of systematic deviations happens in parallel on the basis of the subset of confident precursors identified at 1% FDR. Library-encoded values are calibrated to match the dataset distribution using LOESS regression. For calibration of fragment m/z values, up to 5,000 (but at least 500) of the best fragments according to their extracted ion chromatogram (XIC) correlation are used.
LOESS regression with uniformly distributed kernels is used for each property to be calibrated (Supplementary Fig. 1). Regression is performed on first-degree and second-degree polynomial basis functions of the calibratable property. For m/z and ion mobility, two local estimators with tricubic kernels are used. For retention time prediction, six estimators with tricubic kernels are used. The architecture is built on the scikit-learn package and can be configured to use different hyperparameters and arbitrary predictors for calibration.
Scoring of precursors and decoys using convolution kernels and supervised classification
AlphaDIA uses a two-step scoring machine learning algorithm to identify the best potential peak group for every library entry. The first step builds on a collection of weighted convolution kernels, learned during optimization and calibration of the spectral library. For every precursor of interest, MS1 scans and MS2 scans contributing information toward the identification are identified from the DIA cycle pattern of the acquisition method. On the basis of a certain number of highest-intensity fragments in the library (default: 12), dense representations of the search space in ion mobility and retention time dimension are assembled. To identify putative peak groups for each precursor, a set of convolution kernels, reflecting the expected distribution in retention time, ion mobility and fragment intensity, are learned during calibration and optimization. The convolution of the search space is performed in Fourier space for fast processing and a single score is calculated as a log sum across kernels and fragments. Local maxima are identified using a simple peak-picking algorithm and retention time and ion mobility boundaries of the peak group of interest are defined from the joint scoring function. These candidates are subsequently rescored for FDR estimation.
As the second step, AlphaDIA uses target–decoy competition for scoring the quality of precursor spectrum matches. Upon library import, paired known false-positive decoy peptides are created for every target. By default, a mutation pattern GAVLIFMPWSCTYHKRQENDBJOUXZ>LLLVVLLLLTSSSSLLNDQEVVVVVV is used. For every library entry, target and decoy, the best high-scoring matches from the convolution kernel score are used for supervised classification. Up to 47 features are calculated for each peak group match, reflecting the merit of the identification. A multilayer perceptron (MLP) deep NN with layer sizes of 100, 50, 20 and 5 and a total of 47 input dimensions (10,810 parameters) is trained to predict the probability of being a false decoy identification. Training is performed with stochastic gradient descent for ten epochs with a batch size of 5,000 and learning rate of 0.001. While training on an 80% training set, a 20% test set is held out to mitigate overfitting. On the basis of the final score, the best (lowest) decoy probability peak group is retained for every library entry and a count-based FDR is calculated.
FDR calculation
AlphaDIA uses a count-based FDR on the level for assigning confidence to precursor, peptide, protein and channels. Identifications are given as a set of target and decoy identifications \(P=\{{p}_{0},{p}_{1},\ldots, {p}_{i}\}\), all associated with a ground-truth decoy status \({\rm{decoy}}:P\to \{{\rm{true}},{\rm{false}}\}\) and a deep-learning-derived decoy score \(\hat{y}:P{\mathbb{\to }}{\mathbb{R}}\). For every precursor with index i, the number of targets with lower or equal decoy probability,
$${n}_{\rm{target}}=|\{p\,|\;\hat{y}\left(p\right)\le \hat{y}\left({p}_{i}\right),{\rm{decoy}}\left(p\right)={\rm{false}}\}|,$$
and the number of decoys with lower or equal decoy probability,
$${n}_{\rm{decoy}}=|\{{p|}\;\hat{y}\left(p\right)\le \hat{y}\left({p}_{i}\right),{\rm{decoy}}\left(p\right)={\rm{true}}\}|,$$
are calculated. Furthermore, the total numbers of targets and decoys in the set are calculated as follows:
$${N}_{\rm{target}}=|\{{p|{\rm{decoy}}}\left(p\right)={\rm{false}}\}|$$
$${N}_{\rm{decoy}}=|\{{p|{\rm{decoy}}}\left(p\right)={\rm{true}}\}|$$
The local count-based q value is given as follows:
$${q}_{i}=\frac{{n}_{\rm{decoy}}}{{n}_{\rm{target}}}\times \frac{{N}_{\rm{target}}}{{N}_{\rm{decoy}}}$$
This is converted to the FDR using the minimum q value where a precursor was accepted:
$${\rm{FDR}}_{i}=\min \left({q}_{i},\{q,|,\hat{y}\left(p\right) > \hat{y}\left(\;{p}_{i}\right)\}\right)$$
By default, all identifications are filtered on a run-level 1% FDR precursor threshold and global 1% protein group-level threshold.
Spectrum-centric fragment competition
Competition of precursors for a fragment ion is used as a spectrum-centric element to mitigate double use of fragments for multiple identifications from the same spectra. Following initial FDR calculation, precursor candidates are filtered at 5% FDR and split into groups of potentially fragment sharing. This is determined by the quadrupole cycle pattern. Then, precursor candidates and their elution width at half maximum are compared so that precursors with overlapping elution width at half maximum have no more than \({k}_{\max }=1\) shared fragment masses within the chosen MS2 mass accuracy \({\delta }_{\rm{MS2}}\). If two or more precursor candidates share more fragments than permitted, the precursor candidate with the lowest decoy score is used.
Protein inference
Reporting all proteins whose sequence can be matched to any identified peptide can lead to inflation of false discoveries on the protein level51. Following the approach outlined by Nesvizhskii et al.52, we consider a precursor as a single piece of evidence and the task of protein inference is then to assemble these precursors into proteins while controlling the accumulation of spurious protein identifications. AlphaDIA aims to implement a simple and transparent inference approach, allowing for three inference modes: library, maximum parsimony and heuristic. Apart from the library mode, which uses the inference performed during empirical library creation, protein inference is based on an implementation of the ‘greedy set cover’ algorithm with grouping by default (heuristic) and without grouping for strict inference (maximum parsimony).
In brief, alphaDIA’s protein inference starts with a table of identified precursors. Each precursor is associated with a set of genes and proteins and based on user choice, the inference is performed on the gene or protein level (default: gene). While a common peptide precursor may match many proteins, a proteotypic peptide will match one single protein. During grouping, the precursor and protein arrays are reshaped into a protein-centric view, where each protein is associated with one set of precursors. Then, proteins are sorted by the length of their precursor set in descending order, and the protein with the largest number of precursors removed from the lists as the first query. The query is compared to all remaining subject proteins. From each subject precursor set, all precursors matching the query set are removed. If a protein’s precursor set becomes empty, it is considered redundant and dropped. After all precursor sets have been compared, the process repeats by reordering the list and extracting the next query. After completion, retained queries are denoted master proteins, necessary to explain all discovered precursors. In strict maximum parsimony mode all master proteins are simply reshaped to precursor-centric format, linking each precursor to one single protein ID. In the heuristic mode, the list of master proteins is used to remove all non-master proteins from the initial precursor table, effectively leaving each precursor with a set of associated proteins comprised solely of master proteins. Thereby, the same precursor can be claimed by different proteins, creating protein groups (see also the tutorial notebook in the GitHub repository).
Protein FDR
Protein FDR is performed on the protein groups calculated during protein inference. For all target and decoy protein groups, seven features are calculated: the total number of precursors across runs for the protein group, the mean decoy score for precursors across runs for the protein group, the number of unique peptides for the protein group, the number of unique precursors for the protein group, the number of runs the protein group was found in, the lowest decoy score across precursors for the protein group and the highest decoy score across precursors for the protein group. We use an MLP to classify decoy protein groups from target protein groups. Correct training is ensured by a 20% held-out test set. Protein group FDRs are calculated on a global level using the FDR mechanism described above.
Library refinement for fully predicted libraries
AlphaDIA uses an established two-step search strategy for library refinement15. Following an initial search of all or a subset of raw files, protein inference and FDR are determined as configured by the user. All precursors are automatically filtered at 1% local precursor FDR and global 1% protein group FDR, accumulated into a spectral library and finally saved to the project folder. For each precursor, the identification with the best (lowest) decoy probability is used. By default, MS2 quantities are used as annotated in the original library. If transfer learning accumulation is used, custom user specified fragment types can be selected and observed MS2 intensities are extracted. This spectral library is then used for the second search with full MS2-based target–decoy scoring without any relaxed FDR parameters. For protein inference and FDR, library-annotated protein groups are used.
Transfer learning
To create transfer learning libraries, precursors identified at 1% precursor and protein FDR are selected for requantification. Precursors are requantified for user-defined fragment ion types (a, b, c, x, y, z, modification loss, etc.) and a user-defined maximum charge (default: 2). Extracted fragment quantities are accumulated across samples and ordered by their decoy probability. For each unique modified precursor, the observations with the three lowest decoy scores are selected. AlphaDIA also creates a high-quality subset where only precursors with a median fragment correlation greater than 0.5 are included. For these precursors, we only retain fragments whose correlation values exceed 75% of the median fragment correlation of the respective precursor. The implementation of transfer learning library is globally sequential. At any given time, we can limit the implementation to only parallelize across a limited number of processes. This approach allows the process to scale without storing all runs in memory.
For transfer learning, we prioritized robustness to ensure performance instead of requiring users to define hyperparameters. The transfer learning dataset is split into training (70%), validation (20%) and test (10%) sets and trained for a maximum of 50 epochs. After each training epoch, we run a test epoch for assessing the test loss and data-specific test metrics. AlphaDIA uses a custom learning rate scheduler with two phases. The first phase is a warm-up period (default: five epochs) during which the learning rate gradually increases to a maximum value (default: 0.005). After this warm-up phase, the learning rate scheduler halves the learning rate if the training loss does not notably improve (default: >5% test loss) within a patience period (default: three epochs). Additionally, we use a simple early stopping mechanism that interrupts training if the validation loss starts to diverge or does not notably improve (default: 12 epochs).
After training, the deep learning model is stored on disk and can be loaded as necessary. Retention time and ion mobility fine-tuning are supervised by calculating the L1 loss, R2 and 95th percentile of the absolute error on the training data. MS2 fine-tuning is supervised by calculating the L1 loss, Pearson correlation coefficient, spectral angle and Spearman correlation on the test data. Charge fine-tuning is supervised by calculating the cross-entropy loss, accuracy, precision and recall on the test data. All training and test metrics are reported to the user. The specific implementation and details of the test metrics can be found in the open-source code on GitHub (www.github.com/MannLabs/alphadia).
Sample preparation of HeLa bulk digests
HeLa S3 cells (American Type Culture Collection) were cultured in DMEM (Life Technologies) supplemented with 20 mM glutamine, 10% FBS and 1% penicillin–streptomycin. After washing the cells in PBS and cell lysis, the proteins were reduced, alkylated and digested by trypsin (Sigma-Aldrich) and LysC (WAKO) (1:100 enzyme to protein, w/w) in one step. The peptides were dried and resuspended in 0.1% trifluoroacetic acid and 2% acetonitrile; then, 200 ng of digest was loaded onto Evotips (Evosep). The Evotips were prepared by activation with 1-propanol, washed with 0.1% formic acid (FA) and 99.9% acetonitrile and equilibrated with 0.1% FA. After loading the samples, tips were washed once with 0.1% FA.
Sample preparation of dimethylated peptides for transfer learning
HeLa cells were cultured as described above. A HeLa cell pellet was lysed by boiling for 10 min in 1% SDC in 60 mM TEAB pH 8.5, followed by sonication in a Branson type instrument, Heinemann Sonifier 250 (Schwäbisch Gmünd), operating at 20% duty cycle and 3–4 outputs for 1 min and boiling for 5 min again. After cooling to room temperature, the protein concentration was determined using the tryptophan fluorescence-based WF assay in the microtiter plate format using white Nunc 96-well plates with a flat bottom (Thermo Fisher Scientific, 136101). After diluting the lysate to 1 μg μl−1 in lysis buffer, disulfide bonds were reduced by adding TCEP to a final concentration of 10 mM and briefly incubating for 10 min. Denatured protein lysate was digested by ArgC Ultra (Promega) and LysC (WAKO) at 1:250 and 1:100 (enzyme to protein) ratios to the lysate at 37 °C for 3 h, respectively. The peptides were labeled with a dimethyl group using 100 μl of 1 μg μl−1 digested peptides and adding 4 μl of 4% formaldehyde and 4 μl of 0.6 M NaBH3CN solution. The mixture was incubated at room temperature and, every 10 min, 2.8 μl (2 μg of peptides) was sampled until 60 min and added to 17.2 μl of a 1 % solution of trifluoroacetic acid to quench the reaction.
Sample preparation for the mixed-species experiments
For the mixed-species experiment, three different mixtures with varying mixing ratios of HeLa tryptic digest (Pierce, 1862824), Saccharomyces cerevisiae tryptic digest (Promega, V746A) and Escherichia coli tryptic digest (Waters, 186003196) were prepared: sample A, 10:1:10 human, yeast and E. coli; sample B, 10:10:1 human, yeast and E. coli; sample C, 10:4:7 human, yeast and E. coli. Five replicates containing 210 ng were loaded per condition.
Peptide loading onto C-18 tips
C-18 tips (Evotip Pure, Evosep) were loaded with the Bravo robot (Agilent), followed by activation with 1‐propanol, washing two times with 50 μl buffer B (99.9% acetonitrile and 0.1% FA), activation with 1‐propanol and two wash steps with 50 μl of buffer A (99.9% H2O and 0.1% FA). In between, Evotips were spun at 700g for 1 min. For sample loading, Evotips were prepared with 70 μl of buffer A and a short spin at 700g. Samples were loaded in 20 μl with the indicated concentration into the remaining buffer A and spun at 700g for 1 min, unless described otherwise. After sample loading, Evotips were washed with 50 μl of buffer A and stored with 150 μl of buffer A after a short spin at 700g at 4 °C until MS acquisition.
MS data acquisition of dia-PASEF and synchro-PASEF data
We used the Evosep One LC system to separate peptide mixtures at varying throughputs using standardized gradients. These gradients consisted of 0.1% FA and 99.9% water (v/v) and 0.1% FA with 99.9% acetonitrile (v/v) as the mobile phases. For the 60-SPD runs, peptides were separated on a Pepsep column (8 cm × 150 μm inner diameter, 1.5 μm C18; Bruker Daltonics) connected to a 10-μm (inner diameter) fused silica emitter (Bruker Daltonics). For the Whisper 40-SPD runs, we used an Aurora Elite nanoflow column (15 cm × 75 μm inner diameter, 1.7 μm C18; IonOpticks).
The system was coupled with a timsTOF MS instrument (Bruker Daltonics) to acquire data in dia-PASEF and synchro-PASEF modes. Sample loads above 25 ng were analyzed using a timsTOF Pro2 and those below 25 ng were analyzed using a timsTOF Ultra. The dia-PASEF and synchro-PASEF methods were optimized using our Python tool, py_diAID39. This tool maximizes precursor coverage by optimally positioning the acquisition scheme over the precursor cloud and enhances sampling efficiency by adjusting the isolation window widths according to precursor density.
The dia-PASEF method covers an m/z range from 300 to 1,200 with eight dia-PASEF scans and two isolation window positions per scan (cycle time: 0.98 s). The synchro-PASEF method covers an m/z range from 140 to 1,350 with four diagonal synchro scans (cycle time: 0.53 s). The method files are deposited to the data repository. In both modes, the fragment scans are acquired with an m/z range from 100 to 1,700. Furthermore, ions are accumulated and ejected at 100-ms intervals from the TIMS tunnel. The methods cover an ion mobility range from 1.3 to 0.7 V cm−2, calibrated with Agilent ESI tuning mix ions (m/z, 1/K0: 622.02, 0.98 V cm−2; 922.01, 1.19 V cm−2; 1221.99, 1.38 V cm−2). The collision energy was linearly decreased in relation to the ion mobility elution, from 59 eV at an ion mobility of 1.6 V cm−2 to 20 eV at 0.6 V cm−2.
MS data acquisition of SWATH data on the Sciex 7600
Triplicates of 200-ng HeLa bulk digest were loaded onto C-18 tips as described above and analyzed using an Evosep One system (Evosep) coupled to a 7600 ZenoTOF MS instrument (Sciex) using Sciex OS (version 3.3 or higher). Peptides were separated by the 60-SPD method gradient (Evosep) on a PepSep reverse-phase column (8 cm × 150 μm) packed with 1.5 μm of C18 beads (Bruker Daltonics) at 50 °C connected to the low micro electrode for 1–10 μl min−1. The mobile phases were 0.1% FA in LC–MS-grade water (buffer A) and 99.9% acetonitrile and 0.1% FA (buffer B). The ZenoTOF MS instrument was equipped with the Optiflow ion source using a spray voltage of 4.5 kV, ion source gas 1 of 15 psi, ion source gas 2 of 60 psi, curtain gas of 35 psi, collision-activated dissociation gas of 7 and a temperature of 200 °C. SWATH data were acquired using the following parameters: TOF MS start mass of 400 Da, stop mass of 1,500 Da, TOF MS accumulation time of 50 ms, TOF MSMS start mass of 140 Da, stop mass of 1750 Da, accumulation time of 13 ms with dynamic collision energy turned on, a charge state of 2, Zeno pulsing enabled and 60 variable SWATH windows covering the mass range of 400–900 m/z.
MS data acquisition of mixed-species samples on the Orbitrap Astral
For mixed-species experiments, five replicates of samples A, B and C were loaded onto C-18 tips as described above. Samples were analyzed using an Evosep One system (Evosep) coupled to a Orbitrap Astral MS instrument (Thermo Scientific) using Thermo Tune software (version 1.0 or higher). Peptides were separated by the 60-SPD method gradient (Evosep) on a PepSep reverse-phase column (8 cm × 150 μm) packed with 1.5 μm of C18 beads (Bruker Daltonics) at 50 °C. The analytical column was connected to a stainless-steel emitter with inner diameter of 30 µm (EV1086). The mobile phases were 0.1% FA in LC–MS-grade water (buffer A) and 99.9% acetonitrile and 0.1% FA (buffer B). The Orbitrap Astral MS instrument was equipped with a FAIMS Pro interface and an EASY-Spray source (both Thermo Scientific). A compensation voltage of −40 V and a total carrier gas flow of 3.5 L min−1 were used and an electrospray voltage of 2.0 kV was applied for ionization. The MS1 spectra were recorded using the Orbitrap analyzer at 120,000 resolution from m/z 380 to 980 using an automatic gain control (AGC) target of 500% and a maximum injection time of 3 ms. The Astral analyzer was used for MS/MS scans in data-independent mode with 3-Th nonoverlapping isolation windows with a scan range of 150–2,000 m/z. The precursor accumulation time was 3 ms with an AGC target of 500%. The isolated ions were fragmented using higher-energy collision dissociation (HCD) with 25% normalized collision energy (NCE).
MS data acquisition of HeLa bulk data on the Orbitrap Astral
For analysis of HeLa bulk digest, 200 ng of lysate was loaded onto C-18 tips in six replicates as described above. Samples were analyzed using an Evosep One system (Evosep) coupled to a Orbitrap Astral MS instrument (Thermo Scientific) using Thermo Tune software (version 1.0 or higher). Peptides were separated by the 60-SPD method gradient (Evosep) on an Aurora Rapid reverse-phase column (80 mm × 0.15 mm) packed with 1.7 μm of C18 beads (IonOpticks) at 50 °C. The mobile phases were 0.1% FA in LC–MS-grade water (buffer A) and 99.9% acetonitrile and 0.1% FA (buffer B). The Orbitrap Astral MS instrument was equipped with a FAIMS Pro interface and an EASY-Spray source (both Thermo Scientific). A compensation voltage of −40 V and a total carrier gas flow of 3.5 L min−1 were used and an electrospray voltage of 1.9 kV was applied for ionization. The MS1 spectra were recorded using the Orbitrap analyzer at 120,000 resolution from m/z 380 to 980 using an AGC target of 500% and a maximum injection time of 3 ms. The Astral analyzer was used for MS/MS scans in data-independent mode with 2-Th nonoverlapping isolation windows with a scan range of 150–2000 m/z. The precursor accumulation time was 3 ms with an AGC target of 500%. The isolated ions were fragmented using HCD with 25% NCE.
MS data acquisition of dimethylated peptides on the Orbitrap Astral
MS data acquisition was performed as described for mixed-species samples on the Orbitrap Astral, unless described otherwise. For each of the six timepoints, triplicates of 50 ng of labeled peptide were injected. Samples were separated by the Whisper 40-SPD method gradient (Evosep) on an Aurora Elite TS column (15 cm, 75 µm inner diameter; AUR3-15075C18-TS, IonOpticks) at 50 °C. An electrospray voltage of 1.9 kV was applied. The MS1 resolution was 240,000 with a maximum injection time of 100 ms and 6 ms for MS/MS.
Search and analysis of dia-PASEF and synchro-PASEF data with alphaDIA
Data were searched with version 1.5.5 of alphaDIA using a previously published39 empirical HeLa library. A default single-step search was used with the following parameters: target MS1 tolerance, 15 ppm; target MS2 tolerance, 15 ppm; number of target candidates, 5. For synchro-PASEF, quant_all = true was set and a quant_window of six scans was used. All precursors with run-level FDR of 1% and protein groups with a global FDR of 1% were accepted. CVs were calculated on non-log-transformed directLFQ-normalized quantities.
Search and analysis of ZenoTOF data with alphaDIA
Data were searched with version 1.5.5 of alphaDIA using the HeLa library mentioned above. A default single-step search was used with the following parameters: target MS1 tolerance, 15 ppm; target MS2 tolerance, 15 ppm; number of target candidates, 3; target retention time tolerance, 300 s. All precursors with run-level FDR of 1% and protein groups with global FDR of 1% were accepted. CVs were calculated on non-log-transformed directLFQ-normalized quantities.
Search and analysis of empirical library data from Lou et al.
Raw files, libraries and FASTA files were used as provided in the original publication41. All data were searched with alphaDIA 1.5.5 using default parameters. For timsTOF data, the following parameters were changed: target MS1 tolerance, 15 ppm; target MS2 tolerance, 15 ppm; number of target candidates; quant_window, 6; group level, genes, scans; target retention time tolerance, 500 s. For QE-HF, the data search was performed with a target MS1 tolerance of 5 ppm, target MS2 tolerance of 10 ppm, five target candidates, a quant_window of six scans, group level of genes and scans and a target retention time tolerance of 600 s. Data for benchmarked tools were used as provided in the original publication. Analysis was performed as described in the original publication except for reassignment of proteins. Instead, search-engine-specific protein grouping was used. For alphaDIA, precursors passing a local 1% FDR and protein groups passing a global 1% FDR were accepted.
Search and analysis of HeLa bulk data with fully predicted spectral libraries
For fully predicted library benchmarking, Spectronaut version 18.6.231227.55695, DIA-NN version 2.1.0, CHIMERYS53 version 4.2.1 and alphaDIA version 1.10.2 were used. All analysis was performed using the same FASTA file of reviewed human proteins without isoforms (December 1, 2023). On all platforms, the search was performed for tryptic precursors with carbamidomethyl modification at cysteine as a fixed modification and variable methionine oxidation and protein N-terminal acetylation with a maximum of two occurrences. Charge states of 2–4 were included with sequence lengths between 7 and 35 aa with a single missed cleavage. For CHIMERYS, only peptides with up to 30 aa were used as the tool does not support 35 aa. For alphaDIA, automatic library prediction by alphaPeptDeep was used with the Lumos model for an NCE of 25. AlphaDIA used default parameters for a two-step search with the following changes: target MS1 tolerance, 4 ppm; target MS2 tolerance, 7 ppm. All data were analyzed at a 1% FDR threshold as enforced by the search engine. CVs were calculated on non-log-transformed intensities as provided by the search engine for all proteins.
For entrapment analysis, an Arabidopsis FASTA with reviewed sequences and no isoforms was downloaded from UniProt (February 2, 2024). The search was performed as described above with heuristic inference. After the search, all shared precursors including isoleucine–leucine pairs were identified. Protein groups with shared precursors were discarded.
Search and analysis of mixed-species data with fully predicted spectral libraries
For all three species, reviewed nonisoform proteomes were downloaded from UniProt (February 21, 2024). Proteins were in silico digested using tryptic cleavage with carbamidomethyl modification at cysteine as a fixed modification and variable methionine oxidation and protein N-terminal acetylation with a maximum of two occurrences. Charge states of 2–4 were included with sequence lengths between 7 and 35 aa with a single missed cleavage. The library was predicted using the alphaPeptDeep Lumos model at 25 NCE. AlphaDIA 1.5.4 was used with default parameters for a two-step search with the following changes: number of target candidates, 5; target MS1 tolerance, 5 ppm; target MS2 tolerance, 10 ppm; target retention time tolerance, 200 s for the first pass and 100 s for the second pass. Heuristic protein inference was used on the gene level. Proteins with shared sequences were removed as described above. For benchmarking accuracy, the median LFQ ratio was calculated for protein groups identified in at least three replicates.
Search and analysis of SILAC data with fully predicted spectral libraries
Data were searched with version 1.5.5 of alphaDIA. A fully predicted human library was generated with alphaPeptDeep as described above but for an NCE of 27. The library was multiplexed across the light channel without additional modifications and a heavy channel with isotopic labeling of arginine (+10.008269) and lysine (+8.014199). A single-step search was performed using alphaDIA with default parameters other than the following changes: target MS1 tolerance, 5 ppm; target MS2 tolerance, 20 ppm; target retention time tolerance, 600 s; channel_wise_fdr = true.
Search and analysis of dimethylated samples using transfer learning
A fully predicted human library was generated on the basis of a reviewed human UniProt library (December 1, 2023) with the general pretrained alphaPeptDeep model not trained on dimethylated peptides. The peptides were modified with methionine oxidation and protein N-terminal acetylation as variable modifications with a maximum of two. N-terminal and lysine dimethylation were set as fixed modifications. Transfer search was performed using alphaDIA 1.5.5 with default parameters other than the following changes: number of target candidates, 1; target MS1 tolerance, 4 ppm; target MS2 tolerance, 7 ppm; target retention time tolerance, 1,200 s. Transfer learning quantification was enabled and set to b and y ions with a maximum charge of 2 and the top three occurrences for every modified sequence. The generated transfer learning library was used for training with the default training scheme described above. For evaluation, the original pretrained model, the transfer learned retention time model, the transfer learned MS2 model and the fully transfer learned model were evaluated for search. All searches were performed with the same parameters as the transfer search apart from a target retention time tolerance of 100 s for searches with the updated model.
Search and analysis of transfer learning entrapments
For evaluation of transfer learning on FDRs, entrapment experiments with known false-positive Arabidopsis peptides were performed on the unmodified HeLa bulk samples acquired on the Orbitrap Astral. The entrapment library was generated as described above for the two-step search with N-terminal glutamate and glutamine to pyroglutamate conversion added as variable modifications. Raw files were searched with alphaDIA 1.5.5 using default parameters other than the following changes: number of target candidates, 1; target MS1 tolerance, 4 ppm; target MS2 tolerance, 7 ppm; target retention time tolerance, 1,200 s. Transfer learning quantification was enabled and set to b and y ions with a maximum charge of 2 and the top three occurrences for every modified sequence. Transfer learning was performed using all human and Arabidopsis precursors identified at the 1% FDR cutoff. The transfer learning model was then reused for a second search with an updated target retention time tolerance of 150 s. The process was repeated twice and the identifications after every search were analyzed for the number of false-positive Arabidopsis identifications as described above.
Data analysis and plotting
All analyses were performed using Python 3.11.11 on macOS 14.3.0. Data manipulation and analysis were conducted using pandas 2.2.3, NumPy 1.26.4 and SciPy 1.15.2. Statistical analysis and machine learning were performed using scikit-learn 1.6.1. Data visualization was created using matplotlib 3.9.0 and seaborn 0.13.2. Unless specified otherwise, box plots extend from the first quartile (Q1) to the third quartile (Q3) with the median shown as line. Whiskers extend from 1.5 times the interquartile range below Q1 to 1.5 times the interquartile range above Q3.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.