Retrieval of PiggyBac transposons
Complete PiggyBac transposon sequences were gathered from all available eukaryotic genomes in the NCBI database35 (31,565 genomes) and all PiggyBac elements in the Dfam database (20,638). Dfam sequences were directly downloaded by selecting entries labeled as PiggyBac. NCBI eukaryotic genome-derived transposase sequences were identified using Bath36,37, with a custom hidden Markov model constructed from all active PiggyBac sequences reported in the literature. For NCBI PB retrieval, flanking regions 4 kbp upstream and downstream were included to capture the complete transposon sequence including DNA TIRs. A filter was applied to retain PiggyBac transposases longer than 250 aa. After this filtering, a total of 273,643 PiggyBac were recovered, with a mean transposase length of 500 residues and mean DNA transposon length of 3,298 bp.
To refine the boundaries of each transposon in the NCBI dataset, clustering by RNase H-like domains of the PiggyBac hits at a 0.9 similarity threshold was performed with MMseqs2 (ref. 38), followed by multiple-sequence alignment (MSA) of the complete DNA sequences (including flanking regions) within clusters using MAFFT39. Transposon boundaries were then delimited on the basis of the MSA results.
Filtering for active PiggyBac elements
To identify active PiggyBac transposons from all the transposons identified in the previous step, we applied the following sequential filters:
-
1.
RNase H-like domain identification: The presence of a RNase H-like domain was confirmed using RPS-BLAST40, with the Conserved Domain Database41 as the reference database and selecting only sequences with an RNase H-like domain longer than 250 aa.
-
2.
CRD identification: A total of 50 representative CRDs were manually curated and structurally modeled using AlphaFold3 (ref. 24) to identify residues directly involved in zinc ion coordination. On the basis of this curated set, we derived a set of sequence motifs (Supplementary Table 2), revealing major CRD groups and their variants. CRDs were then identified using regular expressions matching these curated motifs.
-
3.
TIR identification: TIRs were identified in the flanking DNA regions using the EMBOSS tool Palindrome42, focusing on pairs of palindromic sequences located on opposite flanks of the transposon in the first and last 200 bp. We retained only TIRs with at least two palindromic sequences of 10 bp or longer and allowing up to two mismatches. As an additional quality control step, only palindromes in which the two most common nucleotides account for less than 80% of the palindrome were selected.
-
4.
TSD identification: TSDs were searched for with regular expression within the first and last 50 bp of each transposon, using the motif TTAACC, with up to two allowed mismatches.
A total of 116,216 putatively active PiggyBac elements were recovered after applying the filtering process.
Dataset clustering
The filtered dataset was then clustered to reduce redundancy using the RNase H-like domain of the transposase. We performed two clusterings with MMseqs2, one at 0.8 identity and one at 0.6 identity. The 0.8 clustering was performed following transposon annotation 80–80–80 (ref. 43), as it is considered that two transposon elements belong to the same family if they share 80% (or more) sequence identity in at least 80% of their coding or internal domain. This dataset was used for the fine-tuning of the pLLMs. The clustering at 0.6 was performed to make a broader classification of PiggyBac families and used for the phylogenetic analysis. The clustering at 0.8 produced 13,693 clusters, while that at 0.6 produced 2,572 clusters.
Phylogenetic analysis of bioprospected sequences
The phylogenetic tree was built with IQ-TREE (version 1.6.12)44 on the basis of an MSA generated with the 2,572 centroids from the 0.6 clustering with MUSCLE45. Model finder46 was used to select the optimal model for accurate phylogenetic estimation (LG + R10) and UFBoot47 was used for bootstrap approximation with 1,000 replicates. The resulting tree was visualized using iTOL48. Additional PiggyBac domains were identified with RPS-BLAST40. Molecular graphics were generated using UCSF Chimera49.
Blast identification of Poetur orthologs
A search with BLASTn on the core nucleotide database was conducted using Poetur. The whole transposon, including the TIR and TSD were included to find hits that also possessed these motifs. A total of four hits from four different species were manually selected on the basis of them having a coverage higher than 88%, sequence identity higher than 83% and the presence of all necessary functional domains for transposition activity (RNase H-like domain, CRD, TIR and TSD).
Model fine-tuning
The ProGen2-base19 language model of 764 million parameters was fine-tuned on over 13,000 sequences from the PiggyBac orthologs clustered at 0.8. This fine-tuning was performed to give the ProGen2-base model a better understanding of PiggyBac sequences. In this process, the pretrained model was further trained on the PiggyBac orthologs and, as the model trained, the 764 million parameters were updated in a way that aimed to minimize the cross-entropy loss. We fine-tuned two separate models: one model to generate sequences from the N terminus to C terminus and the second to generate sequences from the C terminus to N terminus. Both models were fine-tuned using the full amino acid sequences excluding the N-terminal domain, which was excluded because it is an extremely variable domain. In the HyPB, the N terminus consists of the first 116 aa and, in general, the N terminus is a disordered region leading up to the first double DNA-binding domain region.
The sequences were split using a 80:20 train–test split. In addition to the set of orthologous sequences used in the training, additional wild-type (WT) HyPB sequences (5–10) were added to the training set to bias the model toward HyPB. This allowed us to generate sequences in a closer sequence identity range to HyPB than we were able to without biasing the dataset. Fine-tuning was performed using the Trainer module fetched from Hugging Face over two epochs with a training batch size of 4 and evaluation batch size of 8. A constant learning rate of 5.0 × 10−5 was used and the model was evaluated after every 2,000 steps. Cross-entropy loss was used to evaluate every checkpoint in the model and the checkpoint with the lowest validation loss was used for sequence generation. The remaining Trainer parameters were kept at the default values. A full exploration of the Trainer hyper parameters was not performed as, with these fairly standard parameters, we were able to generate convincing sequences with our desired properties.
AI sequence generation
In both models, 50 aa from WT HyPB were used to prompt sequence generation. An initial prompt was used to give the model enough context to build a PiggyBac-like sequence. In preliminary testing, 50 aa seemed to provide a good balance of giving the models a good starting point without allowing them to replicate the HyPB sequence perfectly. For the N–>C model, the first 50 aa after the N-terminal domain were used and, in the C–>N model, the final 50 aa of the CRD were used to prompt sequence generation. For the C–>N model, sequences were generated ‘backward’ and then reversed to have the standard directionality. The maximum sequence length for both models was set to 500 aa and a temperature of T = 0.5 and nucleus probability P = 0.95 were used.
AI sequence filtering
The generated sequences first went through a set of three basic filters. First, duplicated sequences were removed. Second, sequences with noncanonical amino acids were removed. Third, sequences were filtered using a k-mer repetition filter such that no amino acid motif of six, four, three or two residues was repeated two, three, six or eight times consecutively. The next set of filters were HyPB specific and included testing for a PiggyBac CRD (based on the presence of at least seven cysteine amino acids in the final 50 aa), sequence identity to WT (80–95% to the RNAse H-like and CRD domains) and specific key residues including catalytic site, α-bridge residues, hyperactive residues and another extensive set of key residues including DNA-interacting residues.
For all of these sequences, we calculated perplexity using the ProGen2-base model and the fine-tuned model responsible for generating a given sequence. For a subset of sequences that passed our filters, structures were predicted using ESMFold50. Structures were then compared to the experimentally available PiggyBac structure (PDB 6X67) to extract r.m.s.d. and TM-scores using PyMOL (Schrödinger) and TMAlign51, respectively. Finally, structures were aligned to the experimental PiggyBac structure and several surface properties were calculated using SURFMAP: a tool that projects surface residues from a protein structure into a two-dimensional space and can calculate different amino acid residue properties. The five metrics we calculated using SURFMAP were stickiness, circular variance, Wimley–White, Kyte–Doolittle and electrostatics. We then computed cosine similarities between each surface feature in the generated structures and the experimental structure. Lastly, ProteinMPNN28 and ESM1v29 scores were calculated. ProteinMPNN is a deep learning-based sequence design method that can decode amino acid sequences from structural representations of proteins. ProteinMPNN can also be used to generate a log-likelihood score for any given sequence. Wimley–White is a measure of residue hydrophobicity, which was applied to surface residues in this case using SURFMAP.
An additional set of filters was created to narrow down the final set of sequences. Sequences were required to be in the top 75th percentile for both ProteinMPNN and ESM1v scores, sequences were filtered on length to exclude sequences that were too short, a conservative pLDDT filter of 90 was used and an acceptable range for net charge of the proteins was established. After this, sequences were selected manually in an attempt to cover sequence identities in the range of 90–97% to the entire HyPB sequence with high-quality sequences. During this manual selection process, sequences with a higher proportion of the key residues were selected for and any sequences that had particularly bad scores in any of the calculated metrics were avoided. A final selection of 22 sequences was made.
In silico deep mutational scan
ESM1v was used in a zero-shot version where the Poecliopsis amino acid sequence was given as an input. ESM1v creates a fitness score for all possible amino acids for residue position by calculating a log odds ratio, assuming an additive model when multiple substitutions exist. Then, the sum is made over the substituted positions and the sequence is masked at every substituted position29.
Variant prediction was run in Google Colab Pro with one A-100 GPU with 80 GB of RAM. The script used to run the variant prediction can be found on GitHub (https://github.com/Alejo945/IS-HyPB). The output is a TSV file with all possible variants and their scores.
Plasmid DNA sequences
Transposase ORF amino acid sequences were codon-optimized for Homo sapiens and ordered and synthesized as gene fragments to TWIST biosciences. Gene fragments were cloned into a cytomegalovirus-based expression vector by Golden Gate assembly using Esp3I restriction enzyme. Transposon (cargo vector) plasmid sequences were defined as the first 150 bp from the transposon ends from both 5′ and 3′ TIR sequences and synthesized as gene fragments by TWIST biosciences with added overhangs for golden gate assembly. An EF1α RFP poly(A) expression cassette was included between the TIR. Triple mutant (×3, R372A;K375A;D450N in Trichoplusia ni) residue selection was performed by aligning the ortholog sequences to the T. ni PiggyBac mutated sequence. All plasmid sequences are available in Supplementary Table 1.
Cell culture
Hek293T cells (Invitrogen, R70007), were cultured in DMEM supplemented with high glucose (Gibco, Thermo Fisher), 10% FBS, 2 mM glutamine, 100 U per ml penicillin and 0.1 mg ml−1 streptomycin at 37 °C in a 5% CO2 incubator.
PCR excision activity assay
To detect excision in bioprospected transposases, 120,000 cells were seeded per adherent p24 well 1 day before transfection. Plasmid DNA was mixed at a 1:3 ratio of transposase and RFP transposon, with 0.035 pmol of transposase used per p24 well plate. Then, 48 h after transfection, cells were collected and plasmid extraction was performed using an NZYMiniprep kit (NZYtech, MB01001). TIR-flanking primers (Supplementary Table 4) were used to detect transposon excision. The 2,900-bp and 1,200-bp bands indicated nonexcised and excised transposon, respectively.
Nontargeted transposon integration fluorescence assay
To evaluate stable transposon integration activity, 120,000 cells were seeded per adherent p24 well a day before transfection. Plasmid DNA was mixed with and RFP transposon at a ratio of 1:3:5, with 0.035 pmol of transposase used per p24 well plate. For transfection experiments, cells were transfected with polyethyleneimine (PEI, Thermo Fisher Scientific) at a 1:3 ratio of DNA and PEI in Opti-MEM. RFP expression of the transposon cargo vector was assessed 2 days and 20 days after transfection using cell cytometry with the Cytek Aurora CS system. The RFP signal at day 20 was considered indicative of stable transgene integration.
Transposon excision fluorescence assay
To quantify the excision activity of AI-generated transposases, a fluorescent excision reporter system was used. HEK293T cells were seeded in 24-well plates at a density of 120,000 cells per well 24 h before transfection to ensure approximately 70% confluency on the day of transfection. Transfections were performed in 24-well plates using PEI (Thermo Fisher Scientific) at a 1:3 ratio of DNA and PEI in Opti-MEM (Thermo Fisher). Transposase-expressing plasmid was cotransfected with plasmid containing a disrupted mCherry reporter sequence flanked by transposase recognition sites, leading to mCherry restoration upon excision (Supplementary Fig. 6). Transposase and transposon plasmids were mixed at a 1:3 ratio, with a total of 0.035 pmol of transposase. Then, 72 h after transfection, cells were collected and mCherry reporter expression was assessed by flow cytometry using the Cytek Aurora CS system.
Targeted transposon integration digital PCR assay
To quantify targeted integration of AI-generated transposases in the FiCAT system, C2C12 cells (American Type Cell Collection, CRL-1772) were cultured in DMEM (Gibco, Thermo Fisher) supplemented with 10% FBS, 2 mM l-glutamine, 100 U per ml penicillin and 0.1 mg ml−1 streptomycin. Cells were maintained in a 37 °C incubator with 5% CO2. Electroporation was conducted using the E Cell Line 4D-Nucleofector X Kit S (Lonza). On the day of electroporation, cells were washed with PBS, detached using trypsin–EDTA (Gibco) and adjusted to a concentration of 2 × 105 cells per condition. The cell suspension was prepared in 20 µl of nucleofection master mix buffer, consisting of 16.4 µl Nucleofector solution and 3.6 µl of supplement 1 (Lonza). Subsequently, each condition was conucleofected with a DNA plasmid encoding the triple-mutant variants (PB×3), Cas9, different guide RNAs (gRNAs) and transposon plasmids in a 1:1:3:3 molar ratio, using a maximum of 10% of the final sample volume. Lastly, each condition was transferred into Nucleocuvette vessels and electroporation was carried out using the CD-137 program. After electroporation, 100 µl of prewarmed complete medium was added and cells were carefully resuspended and transferred into a 24-well plate containing 500 µl of complete medium for recovery and expansion. Then, 4 days after electroporation, the cells were processed as follows: (1) one third were collected for genomic extraction; (2) one third were analyzed for GFP reporter expression by flow cytometry using the Cytek Aurora CS system; and (3) one third were maintained in culture until episomal disappearance. Genomic extraction was performed using Qiagen DNeasy blood and tissue kit. Primers and probes were obtained from PrimeTime qPCR probes (Integrated DNA Technologies). The assay was designed using an endogenous control and evaluating the junction PCR for both integration orientations. Reaction mixtures (44 μl) were prepared containing QIAcuityDx Universal master mix (1×), MgCl2 (6.28 mM), primers (0.73 µM), probes (0.63 µM), a restriction enzyme (0.25 U per µL) and 12.5 ng of sample DNA. These mixtures were loaded onto a QIAcuityDx Nanoplate 26k 24-well (260001) for quantification, following the preparation protocol provided in the QIAcuityDx Universal master mix kit (260102). Thermal cycling protocol consisted of an initial enzyme activation step at 95 °C for 2 min, followed by 40 cycles of a two-step amplification: denaturation at 95 °C for 15 s and annealing and extension at 60 °C for 30 s. For digital PCR analysis, the absolute DNA quantification per sample (copies per genome) was determined using QIAcuity Software. Primer sequences are described in Supplementary Table 6.
Targeted transposon integration fluorescence and qPCR assay
To quantify targeted integration of bioprospected transposases in the FiCAT system, Plasmids encoding the triple-mutant variants (PB×3) were cotransfected with Cas9, gRNA AAVS1-3, transposase and transposon plasmids at a 1:1:3:5 molar ratio in 0.5 M Hek23T cells seeded in a p6 plate the day before transfection. Cells were analyzed for RFP expression 2 days after transfection to estimate transfection efficiency using cell cytometry with the Cytek Aurora CS system. Cells were maintained in culture to measure overall integration levels after 3 weeks. In parallel, to enrich cells for junction qPCR, two rounds of enrichment by GFP sorting were conducted with BD FACSAria (Biosciences), 1 week and 2 weeks after transfection. Genomic DNA was extracted using Quiagen DNeasy blood and tissue kit column 4 days after the second sorting. A 3′ junction PCR was performed and sequenced on an Illumina MiSeq Nano kit 500 cycles (v2). A 3′ junction qPCR was performed to compare targeted integration across bioprospected transposases.
Targeted transposon integration GFP reconstitution assay
To quantify targeted integration in AI-generated PiggyBac transposases in the FiCAT system, a previously described GFP reconstitution assay52 was used. For GFP targeted integration assays, a reporter HEK293T cell line containing genomically integrated 2/2 GFP was transfected using a 1/2 GFP encoding transposon (Supplementary Fig. 6). A total of 240,000 2/2 GFP HEK293T reporter cells were seeded in a 12-well plate 1 day before transfection. Cells were transfected with Lipofectamine 3000 (Invitrogen, L3000001) using Cas9, 2/2 GFP-targeting gRNA, transposase and transposon plasmids at a 1:1:3:5 molar ratio. Cells were analyzed for GFP expression 5 days after transfection to estimate targeted integration efficiency using cell cytometry with the Cytek Aurora CS system. The 2/2 GFP was integrated using the Sleeping Beauty (SB100x) transposase system53. Reporter DNA sequences are available in supplementary Table 3.
Nontargeted transposon integration fluorescence assay in T cells
To assess nontargeted integration of the PiggyBac and AI-generated orthologs in T cells, peripheral blood mononuclear cells from two different donors, isolated from buffy coats and cryopreserved, were thawed and seeded on p24-coated plates containing anti-CD3/CD28 (1:1,000; BD Sciences) at a density of 1 × 106 cells per ml in 3 ml of CTS OpTmizer T cell expansion SFM medium (Thermo Fisher), supplemented with interleukin (IL)-7 and IL-15 (10 ng ml−1 each; Miltenyi Biotec). Buffy coats were obtained from the Barcelona Blood and Tissue Bank upon institutional review board approval.
For nontargeted integration in bioprospected orthologs, on the third day of culture, electroporation was conducted using the P3 primary cell 4D-Nucleofector X kit (Lonza). Cells were washed with PBS (Capricorn) and adjusted to a concentration of 7.5 × 105 cells per condition. The cell suspension was prepared in 20 µl of nucleofection buffer, consisting of 16.4 µl of P3 primary cell Nucleofector solution and 3.6 µl of supplement 1 (Lonza). Subsequently, 1 µg of each DNA plasmid was added to the suspension and electroporation was carried out using the EO-115 nucleofection program. The minimal backbone GenCircle-TIR_CAR19-GFP transposon plasmid was used (GenCircle, manufactured by Genscript). For each evaluated transposase, conditions with transposase + transposon and transposon only were electroporated in duplicates to differentiate between episomal and integrated signals. Following electroporation, 80 µl of complete medium was added and cells were incubated at 37 °C for 20 min. The cells were then carefully resuspended and transferred to a fresh p24 plate containing 500 µl of medium for recovery and expansion. Approximately one third of the well volume was used for flow cytometric analysis using the Aurora system (Cytek) to assess RFP expression levels at 4 and 7 days after transfection.
For nontargeted integration of AI-generated orthologs, On the third day of culture, electroporation was conducted using the P3 primary cell 4D-Nucleofector X kit (Lonza). Cells were washed with PBS (Capricorn) and adjusted to a concentration of 1 × 106 cells per condition. The cell suspension was prepared in 20 µl of nucleofection buffer, consisting of 16.4 µl of P3 primary cell Nucleofector solution and 3.6 µl of supplement 1 (Lonza). Subsequently, 1 µg of each DNA plasmid was added to the suspension and electroporation was carried out using the EH-115 nucleofection program. The minimal backbone GenCircle-TIR_CAR19-GFP transposon plasmid was used (GenCircle, manufactured by Genscript).For each evaluated transposase, conditions with transposase + transposon and transposon only were electroporated in duplicates to differentiate between episomal and integrated signals. Following electroporation, 80 µl of complete medium was added and cells were incubated at 37 °C for 20 min. The cells were then carefully resuspended and transferred to a fresh p24 plate containing 500 µl of medium for recovery and expansion. Medium supplemented with H-151 (MedChemExpress, HY-112693) STING inhibitor at 2 µM was added. Approximately one third of the well volume was used for flow cytometric analysis using the Aurora system (Cytek) to assess GFP expression levels at 4 and 7 days after transfection.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.