Site-specific DNA insertion into the human genome with engineered recombinases

Ethics statement

Our research complies with relevant ethical regulations. Experiments using hESC lines were performed under an allowance granted by the Arc Institute Stem Cell Research Oversight Committee.

Cell lines and culture

Experiments were conducted in HEK293FT cells (Thermo Fisher Scientific, R70007, female), H1 hESCs (WiCell Research Institute, male), WTC-11 iPSCs (Coriell Institute for Medical Research, GM25256, male) and primary human T cells (STEMCELL Technologies, 200-0092) from deidentified healthy donors. HEK293FT cells were cultured in DMEM with 10% FBS (Gibco) and 1× penicillin−streptomycin (Thermo Fisher Scientific) and dissociated using TrypLE Express (Gibco). H1 hESCs and WTC-11 iPSCs were maintained in mTeSR Plus (STEMCELL Technologies) supplemented with 1× antibiotic-antimycotic (Thermo Fisher Scientific) and cultured on Cultrex (Bio-Techne) or Matrigel (Corning) coated plates. For routine passaging, hESCs and iPSCs were dissociated with ReLeSR (STEMCELL Technologies). For 96-well plating prior to transfections, single-cell dissociation was performed using Accutase (STEMCELL Technologies). hESCs and iPSCs were supplemented with 10 µM ROCK inhibitor for 24 hours after dissociation. Primary human T cells were cultured in complete X-VIVO 15 (cXVIVO 15) (Lonza Bioscience, 04-418Q), which consists of 5% FCS (R&D Systems, M19187), 5 ng μl⁻¹ IL-7 and 5 ng μl⁻¹ IL-15. For the cell cycle arrest experiment, HEK293FT cells were treated with 5 μM aphidicolin at the time of transfection. HEK293FT, WTC-11 and H1 cells tested negative for mycoplasma, tested monthly.

Dn29 deep mutational scan library construction

An NNK deep mutational scanning library of the entire Dn29 coding sequence (CDS) was generated using NNK oligos and overlap extension PCRs. First, forward and reverse oligos with NNK mixed bases at each codon were designed with a melting temperature of 65 °C. Each NNK forward primer was paired with Dn29 DMS_universal_reverse that binds downstream of the CDS and each NNK reverse primer with DMS_universal_forward primer, generating amplicons flanking the mutated codon. PCR reactions contained 2.5 µl of Q5 Master Mix (New England Biolabs (NEB)), 0.01 µl of Dn29 plasmid template (100 ng µl⁻¹), 0.025 µl of universal primer (100 µM), 1.465 µl of water and 1 µl of unique NNK primer (2.5 µM). Cycling conditions were as follows: 98 °C for 30 seconds; 30 cycles of 98 °C for 10 seconds, 60 °C for 30 seconds and 72 °C for 1 minute; final extension of 72 °C for 2 minutes.

Upstream and downstream amplicons (2.5 µl each) were pooled and cleaned with 2 µl of ExoSAP-IT (Thermo Fisher Scientific) and 0.5 µl of DpnI (NEB), incubating at 37 °C for 30 minutes and then at 80 °C for 15 minutes. For the overlap extension PCR, 1 µl of cleaned PCR pool was mixed with 2.5 µl of Q5 2× Master Mix, 0.025 µl of each universal primer (100 µM) and 1.45 µl of water, using the same cycling conditions.

The full mutant pool was created by combining 2.5 µl of each overlap extension PCR. The full-length Dn29 fragment was gel extracted (Monarch DNA Gel Extraction Kit; NEB). The library and pEVO backbone were digested with XbaI and HindIII-HF (NEB). Ligation used 100 ng of total DNA (3:1 molar ratio of library to backbone), 2 µl of T4 ligase (NEB), 4 µl of 10× T4 ligase buffer (NEB) and water to 40 µl. The reaction was split into two 20-µl reactions, ligated for 30 minutes at room temperature, inactivated at 65 °C for 10 minutes and purified (Clean and Concentrator-5 Kit; Zymo Research).

The ligation product was electroporated into XL-1 Blue cells (Agilent Technologies) according to the manufacturer’s instructions, recovered for 1 hour at 37 °C in 1 ml of SOC medium and plated onto four 245-mm × 245-mm BioAssay dishes. Approximately 1 million colonies were obtained. Plasmids were purified using a NucleoBond Xtra Midi EF Kit (Macherey Nagel) and sequenced with an Illumina NextSeq 2000 600-cycle P1 Kit (Supplementary Fig. 1c) using the NextSeq 1000/2000 Control Software Suite version 1.7.1.

Substrate-linked directed evolution

For library transformation, induction and growth: 4 µl of pEVO plasmid library was electroporated into 50 µl of XL-1 Blue competent cells (Agilent Technologies), recovered in 1 ml of SOC medium (37 °C, 1 hour) and then seeded into 100 ml of LB medium with carbenicillin and L-arabinose (10 µg ml⁻¹ or 0 µg ml⁻¹). Cultures were grown overnight at 37 °C. Library coverage (>1 million colonies) was confirmed by plating serial dilutions. Plasmids were extracted using a Qiagen Plasmid Midi Kit (0.3 g of wet bacteria pellet per column).

Selection of active variants: 500 ng of plasmid was digested with NdeI (NEB) to eliminate inactive variants. Active variants were amplified using 25 µl of 2× Platinum SuperFi II Master Mix (Thermo Fisher Scientific), 19 µl of water, 2 µl each of SLiDE_recovery_forward and SLiDE_recovery_reverse primers (10 µM) and 2 µl of NdeI-digested material. PCR conditions were as follows: 98 °C for 30 seconds; 30 cycles of 98 °C for 10 seconds, 52 °C for 10 seconds, 72 °C for 55 seconds; final extension at 72 °C for 5 minutes. The correct size band was gel extracted (Monarch DNA Gel Extraction Kit).

Cloning for next evolution cycle: Amplified active variants and pEVO backbone were digested with XbaI and HindIII-HF (NEB) at 37 °C for 30 minutes and then heat inactivated at 80 °C for 20 minutes. Digested variants were purified using DNA Clean and Concentrator-5 (Zymo Research) and backbone with DNA Clean and Concentrator-25 (Zymo Research). Five ligation reactions (20 µl each) were set up using 100 ng of DNA (3:1 ratio of library to backbone) and T4 ligase (NEB). Ligation occurred at room temperature for 30 minutes, followed by heat inactivation at 65 °C for 10 minutes. Pooled reactions were purified (DNA Clean and Concentrator-5 Kit), eluted in 6 µl of water and electroporated into XL-1 Blue cells to start the next evolution cycle.

DNA shuffling and fragment reassembly

Shuffling the active variants between rounds of cycling involved a uridine exchange PCR to partially exchange thymidines for uridine, USER Enzyme fragmentation at uridine sites, primerless PCR fragment reassembly and PCR for full-length gene recovery.

Uridine exchange PCR: Fragment size and yield was optimized by modifying dUTP/dTTP ratio, with the optimal ratio being 3/7. PCR mixture: 5 µl of 10× Thermopol Buffer, 1 µl of 10 mM dNTPs, 1 µl each of SLiDE_recovery_forward and SLiDE_recovery_reverse primers (10 µM), 1 µl of plasmid library, 1 µl of Taq Polymerase and 40 µl of water. Cycling conditions were as follows: 95 °C for 30 seconds; 30 cycles of 95 °C for 20 seconds, 60 °C for 30 seconds, 68 °C for 1 minute per kilobase; final extension at 68 °C for 5 minutes. Full-length gene band was gel extracted (Monarch Gel Extraction Kit).

USER Enzyme digestion: 500-ng aliquots were digested with 2 µl of USER Enzyme (NEB) at 37 °C for 3 hours. Gel electrophoresis confirmed fragment distribution (100−1,000 bp).

Fragment reassembly: Fragments were purified (DNA Clean and Concentrator-5) and reassembled in a primerless PCR reaction using the following conditions: 25 µl of purified fragments and 25 µl of 2× Q5 High-Fidelity Master Mix (NEB). Cycling conditions were as follows: 98 °C for 30 seconds; 30−50 cycles of 98 °C for 10 seconds, 30 °C for 30 seconds (+1 °C per cycle), 72 °C for 1 minute (+4 seconds per cycle); final extension at 72 °C for 10 minutes. A final PCR was performed to recover only the full-length Dn29 CDS for further rounds of directed evolution. The following conditions were used for full-length gene recovery: PCR mixture: 25 µl of Platinum SuperFi II 2× Master Mix (Thermo Fisher Scientific), 10 µl of reassembled fragments, 2 µl each of DMS_universal_forward and DMS_universal_reverse primers (10 µM) and 11 µl of water. Cycling conditions were as follows: 98 °C for 30 seconds; 35 cycles of 98 °C for 10 seconds, 60 °C for 10 seconds, 72 °C for 55 seconds; final extension at 72 °C for 5 minutes.

The gel-extracted, shuffled and reassembled genes were cloned into the plasmid backbone using XbaI and HindIII digest and T4 ligation as previously described.

Variant library NGS and analysis

Six primer sets (DMS_NGS primers; Supplementary Table 8) were designed to amplify approximately 260-bp segments of the Dn29 CDS with Illumina adapter overhangs. Two rounds of PCR were performed to add P5/P7 adapters and i5/i7 indexes (FLAP2 primers). Amplicons were cleaned with AMPure XP beads (Beckman Coulter) between PCR rounds and after the final PCR. Amplicons were pooled in equimolar ratios, quantified using Qubit dsDNA High Sensitivity Kit (Thermo Fisher Scientific) and sequenced on an Illumina NextSeq 2000 (600-cycle kit). Full overlap between read 1 and read 2 was ensured for higher confidence in mutation calling.

Paired-end reads were merged using BBMerge (version 39.06) and analyzed with a custom Python script. The script converted Phred quality scores to error probabilities using the formula \(P=1{0}^{(Q/-10)}\), where P is the probability of error and Q is the Phred quality score. Reads with a summed error probability greater than 0.5 or containing frameshifts were filtered out. Nucleotide and amino acid mutations at each position were then counted and plotted. Enrichment for each amino acid (AA) between the input and output libraries was calculated using the following formula: ((%AA_output) / (1 − %AA_output)) / ((%AA_input) / (1 − %AA_input)). To distinguish library construction-based dropouts from selection-based dropouts in the enrichment heatmaps, any amino acids with zero reads in the output library were assigned a single read.

Nanopore sequencing and analysis

Variants were cloned into a vector containing a 100-nucleotide (nt) random unique molecular identifier (UMI) barcode with a BHVD repeat pattern. The plasmid library was linearized by Eco105I digestion. Nanopore libraries were prepared using a barcoded nanopore sequencing kit (SQK-NBD114.24) with 1 µg of linearized plasmid library and sequenced on a MinION flow cell (R10.4.1) for 72 hours using MinKnow UI control software version 6.5.15.

Sequencing reads were filtered using nanoq (version 0.9.0) with settings –min_len 4500–max_len 5500–min-qual 10 (that is, minimum q score of 10, a minimum read length equivalent to 90% of the expected read length and a maximum read length equivalent to 110% of the expected read length). The UMI sequence was extracted using Cutadapt (version 1.18) with settings –g ‘GGCGGTCACCATCACCACCACCACGCTACACG;max_error_rate = 0.2…ACTGTAC;max_error_rate = 0.2’–trimmed-only–revcomp–minimum_length 95. All UMI sequences were trimmed to 95 nt using seqkit (version 1.3-r106) with the command seqkit subseq -r 1:95.

Reads were clustered by UMI with mmseqs easy-linclust (version 14.7e284) with setting –min-seq-id 0.5. For each UMI cluster bin with at least 15 reads, a representative cluster sequence was generated by using usearch (version 11) with settings -cluster_fast -id 0.75 -strand both -sizeout -centroids and taking the first representative sequence of the output⁶⁶. A final consensus sequence was generated by one round of polishing with Medaka (version 1.9.1) with settings -m r1041_e82_260bps_hac_g632. Counts for each unique variant were determined by tallying the total consensus sequences.

Cloning variant library into a mammalian expression vector

Primers (DE_mammalian_forward and DE_mammalian_reverse) were designed to amplify the Dn29 CDS from the active variant PCR library, adding overhangs for Esp3I-compatible Golden Gate cloning. PCR conditions were as follows: 25 µl of 2× Platinum SuperFi II Master Mix (Thermo Fisher Scientific), 19 µl of water, 2 µl of purified active variant library and 2 µl each of primer. Cycling conditions were as follows: 98 °C for 60 seconds; 30 cycles of 98 °C for 10 seconds, 60 °C for 10 seconds, 72 °C for 55 seconds; final extension at 72 °C for 5 minutes. The product was purified (DNA Clean and Concentrator-5) and quantified by NanoDrop (Thermo Fisher).

A mammalian expression vector was designed with the EF1α promoter upstream of an Esp3I Golden Gate landing pad, used as the destination for the protein variant library. The landing pad was followed by a T2A self-cleaving peptide sequence and an enhanced green fluorescent protein (EGFP) CDS.

Golden Gate reaction mixture: 75 ng of mammalian expression vector, amplified variant library (3:1 molar ratio to vector), 1 µl of T4 DNA Ligase Buffer (NEB), 0.5 µl of T4 DNA Ligase (NEB), 0.5 µl of Esp3I (Thermo Fisher Scientific) and up to 10 µl of nuclease-free water. Cycling conditions were as follows: 35 cycles of 37 °C for 1 minute, 16 °C for 1 minute; 37 °C for 30 minutes; 80 °C for 20 minutes. Five Golden Gate reactions were performed, pooled and purified (DNA Clean and Concentrator-5). The library was transformed into Mach1 Escherichia coli and plated for overnight growth. Random colonies were picked, grown in 4 ml of TB-Carbenicillin and miniprepped (NucleoSpin Plasmid Transfection Grade Mini Kit; Machery-Nagel).

Transfection of HEK293FT cells for assessing genomic integration

One day before transfection, 12,000−18,000 HEK293FT cells were plated per well of a 96-well plate, aiming for 60–80% confluency at the time of transfection.

Standard LSR + donor transfection: for transfections containing an LSR effector plasmid and a donor plasmid, each well was transfected with 725 ng of DNA, containing a 5:1 molar ratio of donor plasmid to effector plasmid, using 0.5 µl of Lipofectamine 2000 (Thermo Fisher Scientific) per well.

Standard LSR-dCas9 + donor + guide transfection: LSR−dCas9 effector plasmid, donor plasmid and guide plasmid were transfected with 725 ng of total DNA, containing a 5:1:1 molar ratio of donor:effector:guide plasmid with 0.5 µl of Lipofectamine 2000 per well, unless specified otherwise in the figure legends.

Modified transfection conditions: Experiments shown in Fig. 3d,e and Extended Data Fig. 5b were transfected with 375 ng of effector plasmid, 100 ng of sgRNA plasmid and 250 ng of donor plasmid using Lipofectamine 2000. Experiments shown in Figs. 3g, 5 and 6 used a consolidated plasmid expressing both the effector and guide RNA. In HEK293FT experiments, this consolidated plasmid was transfected at a 5:1 ratio of donor:effector/guide plasmid with 0.585 µl of Lipofectamine 2000 per well. For transfections containing two gRNA plasmids (Fig. 3k and Extended Data Fig. 5h), each well was transfected with 375 ng of effector plasmid, 75 ng each of gRNA plasmid and 250 ng of donor plasmid.

The cells were incubated and monitored for 3 days for mCherry (donor plasmid) and GFP (effector plasmid) expression. Cells were then harvested for flow cytometry (Attune NxT Flow Cytometer; Thermo Fisher Scientific) or genomic DNA extraction for downstream analyses.

Cell harvest, ddPCR, qPCR and flow cytometry

Three days after transfection, cells were trypsinized with 50 µl of TrypLE (Gibco) for 10 minutes and then quenched with 50 µl of Stain Buffer (BD Biosciences). The 100-µl cell suspension was split into two 50-µl aliquots in U-bottom 96-well plates and centrifuged (300g, 5 minutes), and the supernatant was aspirated. One plate was resuspended in 200 µl of Stain Buffer and analyzed with an Attune NxT Flow Cytometer with autosampler (Thermo Fisher Scientific).

The other plate was resuspended in 50 µl of QuickExtract DNA Solution (Biosearch Technologies), vortexed for 15 seconds and thermocycled: 65 °C for 15 minutes, 68 °C for 15 minutes, 98 °C for 10 minutes. DNA was cleaned with 0.9× AMPure XP (Beckman Coulter) beads.

To assess integration efficiency and specificity, qPCR/ddPCR primers and probes were designed to span the left integration junction of attH1 and attH3, using a constant primer that binds to the donor plasmid sequence (ddPCR_donor_reverse_1), a genome binding primer near the pseudosite (ddPCR_attH1_forward_1, ddPCR_attH3_forward) and a FAM probe within the amplicon (ddPCR_attH1_probe_1, ddPCR_attH3_probe). For attH1, a second set of primers/probes was designed to target the right junction to verify measurement accuracy (ddPCR_attH1_2 primers/probe). Genomic reference primers and probes located nearby each attachment site were designed to measure pseudosite copy number for efficiency percentage calculations.

ddPCR reaction mix (22 µl total): 11 µl of ddPCR Supermix for Probes (no dUTP) (Bio-Rad), 1.98 µl of each primer (10 µM), 0.55 µl of each probe (10 µM), 1.65 µl of cleaned gDNA, 0.22 µl of SacI-HF (NEB) and water to volume. Each reaction contained primers and probes for the target site (FAM probe) and a nearby reference locus (HEX probe). Reactions were run on a QX200 AutoDG Droplet Digital PCR System (Bio-Rad) using Bio-Rad QX Manager Software version 2.1.0, and data were analyzed and visualized using Microsoft Excel (version 16.89.1) and GraphPad Prism (version 10.3.0). For off-target detection or low-concentration samples, primers were increased to 20 µM and volume halved, and gDNA volume was increased to 4.95 µl.

qPCR reaction mix (40 µl total): 1 µl of each primer, 0.8 µl of each probe, 20 µl of TaqMan Fast Advanced Master Mix (Thermo Fisher Scientific), 2.4 µl of genomic DNA and 12 µl of water. The master mix was split into three 10-µl technical replicates in a 384-well plate and run on a LightCycler 480 (Roche) using LightCycler 480 software version 1.5.1.62. Primer pairs for ddPCR and qPCR are provided in Supplementary Table 8.

Three-plasmid recombination assay in HEK293FT cells

A fluorescent reporter assay was used to assess episomal plasmid recombination in HEK293FT cells. One day before transfection, 12,000−18,000 HEK293FT cells were plated per well of a 96-well plate, aiming for 60−80% confluency at the time of transfection. Three plasmids at a 1:1:1 molar ratio were transfected into the cells using Lipofectamine 2000: (1) 200 ng of the effector plasmid expressing the Dn29 variants and GFP; (2) 50.5 ng of the donor plasmid containing the attP attachment sequence and mCherry; and (3) 70.6 ng of the acceptor plasmid containing an EF1α promoter and the cognate attB attachment sequence. Upon recombination of the two attachment sequences, the EF1α promoter will drive expression of the mCherry CDS, which is read out by flow cytometry (Extended Data Fig. 4d). To assess the excision reaction, the attP in the donor plasmid is replaced with the left post-recombination attachment site (attB-L:attP-R), called attL, and the attB is replaced with the right post-recombination attachment site (attP-L:attB-R), called attR. To assess attP recombination with model organism pseudosites, the attB sequence is replaced with the pseudosite sequences. Mismatching LSR (Bxb1) controls with each donor and acceptor plasmid is used to correct for the leaky mCherry background expression, defining the flow cytometry gating boundaries. Three days after transfection, the cells were trypsinized with 50 µl of TrypLE (Gibco) for 10 minutes, quenched with 50 µl of Stain Buffer, transferred to U-bottom 96-well plates and centrifuged (300g, 5 minutes), and then the supernatant was aspirated. Plates were resuspended in 200 µl of Stain Buffer and analyzed with an Attune NxT Flow Cytometer with autosampler.

Site-directed mutagenesis for combinatorial mutant cloning

Site-directed mutagenesis (SDM) primers were designed using the script from Bi et al.⁶⁷, selecting primers with melting temperature closest to 65 °C. For each mutation, a forward and reverse primer were generated, each containing the desired mutation at the center. PCR reactions were set up combining forward SDM primer with DMS_universal_reverse primer or reverse SDM primer with DMS_universal_reverse primer. The PCR mixture (12.5 µl total) contained 6.25 µl of Platinum SuperFi II Master Mix, 0.5 µl each of primer (10 µM), 1 µl of plasmid template DNA (1 ng µl⁻¹) and water to volume. PCR was run using the standard Platinum SuperFi II Master Mix protocol with annealing temperature at 65 °C. Products were cleaned with 0.5× AMPure XP beads. For Gibson assembly, 1 µl each of cleaned PCR product, 5 µl of Gibson master mix and 3 µl of water were incubated at 50 °C for 15 minutes and then transformed into Mach1 E. coli and plated. For simultaneous cloning of two or more mutations, universal primers were replaced with other mutationsʼ forward and reverse primers. Two mutations required a two-piece Gibson assembly, three mutations required a three-piece assembly and so forth.

Genome-wide integration site mapping

A Tn5 tagmentation and PCR amplification-based assay was used to unbiasedly measure the relative efficiency of all integration sites, as described in Durrant et al.¹². In brief, extracted genomic DNA is tagmented with Tn5 transposase to randomly add adaptors throughout the genome. Then, two nested PCRs are performed, with primers that bind the donor plasmid and the Tn5 adaptor to amplify the donor−genome junction and add Illumina sequencing adaptors. UMIs on the donor plasmid enable counting of the relative frequencies of integration events at each genomic locus.

HEK293FT cells were transfected as previously described, with a non-matching LSR (Bxb1) plasmid replacing the effector plasmid as a control for donor plasmid dilution. Cells were cultured for 2−3 weeks, passaging and analyzing by flow cytometry every 2−3 days at 80% confluency, until the non-matching LSR control was less than 1% mCherry⁺, indicating that the plasmid had nearly completely diluted out. Genomic DNA was extracted using a Quick-DNA Miniprep Plus Kit (Zymo Research), quantified by Qubit HS dsDNA Assay (Thermo Fisher Scientific), and 1 µg of gDNA per sample was DpnI digested (NEB) to remove residual donor plasmid.

Tn5 transposase was purified following the Picelli et al.⁶⁸ protocol. Tn5 adaptors were prepared by annealing top and bottom oligos (100 μM each) at 95 °C for 2 minutes, followed by slow cooling to 25 °C over 1 hour. The transpososome was assembled by combining 85.7 μl of purified Tn5 with 14.3 μl of pre-annealed oligos and incubating at room temperature for 1 hour. Tagmentation reactions contained 150 ng of gDNA, 4 μl of 5× TAPS-DMF, 1.5 μl of transpososome and water to 20 μl total volume. Samples were mixed thoroughly and incubated at 55 °C for 20 minutes. Reactions were placed on ice and purified with Zymo DNA Clean and Concentrate Kit according to the manufacturer’s protocol, eluting in 11 μl of nuclease-free water. Sample quality was confirmed by Bioanalyzer to verify fragmentation of approximately 1.5−2.5 kb.

For round 1 PCR, each reaction contained 12.5 μl of 2× SuperFi II Master Mix, 1.5 μl of TMAC (0.5 M), 0.5 μl of outer nest donor-specific primer (PR_N165, 10 μM), 0.25 μl of outer nest i5 primer (PR_N163, 10 μM), 1.25 μl of DMSO and 9 μl of tagmented DNA. Cycling conditions were as follows: 98 °C for 2 minutes; 12 cycles of 98 °C for 10 seconds, 68 °C for 10 seconds, 72 °C for 90 seconds; followed by 72 °C for 5 minutes. Products were purified using 0.9× Agencourt AMPure XP SPRI beads and eluted in 11 μl of water.

For round 2 PCR, each reaction contained 25 μl of 2× SuperFi Master Mix, 3 μl of TMAC (0.5 M), 2.5 μl of DMSO, 2.5 μl of i5 primer (PR_N149, 10 μM), 5 μl of i7 donor-specific primer (PR_N184-PR_N204, 10 μM), 2 μl of water and 10 μl of purified round 1 PCR product. Cycling conditions were as follows: 98 °C for 2 minutes; 18−20 cycles of 98 °C for 10 seconds, 68 °C for 10 seconds, 72 °C for 90 seconds; followed by 72 °C for 5 minutes.

For size selection, approximately 40 μl of round 2 PCR product was loaded on a 2% agarose gel, and the smear between 300 bp and 800 bp was excised. DNA was extracted using the Monarch Gel Extraction Kit according to the manufacturer’s protocol. Purified libraries were quantified using a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) and pooled at equimolar ratios. Final library quality and molarity were assessed using a KAPA Library Quantification Kit (Roche). Pooled libraries were sequenced on a NextSeq 2000 or Illumina MiSeq with 2 × 300-bp paired-end reads, using Nextseq 1000/2000 Control Software Suite version 1.7.1 or MiSeq Control Software version 4.1.0 and Illumina BaseSpace Software version 7.38.0. Raw sequencing data were processed using a custom bioinformatics pipeline as described in Durrant et al.¹².

To reduce occurrence of index hopping, unique dual i7 and i5 barcodes were used for the attH1 targeted samples in Fig. 3. To directly compare specificity of samples with different numbers of measured integration events, samples were downsampled to the same total UMI count.

LSR−dCas9 and gRNA plasmid design and cloning

Fusion proteins consisting of a catalytically dead Cas9 fused to an LSR and a P2A−GFP were constructed by Gibson assembly into a pUC19-derived plasmid containing the EF1α promoter and a SV40 poly(A) tail. Variable flexible linkers, including (GGS)₈, (GGGGS)₆ XTEN16, XTEN32-(GGSS)₂ and XTEN48-(GGSS)₂, were used to link the dCas9 and LSR. Spacers targeting loci proximal to the LSR integration site and non-targeting controls were cloned into an sgRNA-expressing plasmid via oligo ligation and Golden Gate cloning. Spacer selection was based on PAM sequence and pseudosite proximity.

Designing and cloning the attP library

Two plasmid libraries (attP-L and attP-R) were constructed to determine nucleotide preference within the attP, with each 26-bp half-site mutagenized separately. Integrated DNA Technologies (IDT)-synthesized oligo pools contained 79% WT base and 7% each of the other bases at each position. Single-stranded oligo pools were subjected to second-strand synthesis. First, an oligo anneal reaction containing 2 µl of Library Oligo (100 µM), 4 µl of Klenow primer (100 µM), 3.4 µl of 10× STE Buffer and 24.6 µl of water was heated at 95 °C for 5 minutes and then cooled to room temperature. Next, a Klenow extension reaction containing 34 µl of annealed libraries, 8 µl of water, 5 µl of 10× NEBuffer2, 2 µl of 10 mM dNTPs (NEB) and 1 µl of DNA Polymerase I, Large (Klenow) Fragment (NEB, 5,000 U ml⁻¹) was incubated at 37 °C for 30 minutes, purified (DNA Clean and Concentrator-5) and eluted in 20 µl of nuclease-free water.

The purified product was cloned by Esp3I Golden Gate cloning into the donor plasmid backbone: 75 ng of pre-digested backbone, 3:1 molar ratio of attP library to backbone, 0.5 µl each of T4 DNA ligase (NEB) and Esp3I (Thermo Fisher Scientific), 1 µl of T4 DNA Ligase Buffer (NEB) and water to 10 µl were incubated at 37 °C for 1 hour, purified and eluted in 6 µl of nuclease-free water. Then, 1 µl of purified library was electroporated into Endura Electrocompetent Cells (Biosearch Technologies) at 10 μF, 600 Ω, 1,800 V, recovered in 2 ml of Lucigen Recovery Medium (37 °C, 1 hour), plated on 245-mm × 245-mm BioAssay dishes and incubated at 30 °C overnight. Final libraries were scraped, purified (Nucleobond Xtra Maxi EF Kit) and sequenced with an Illumina NextSeq 2000.

attP library transfection, harvest and library preparation

Next, 2.2 × 10⁶ HEK293FT cells were plated on 10-cm dishes 1 day before transfection to achieve 70% confluence at transfection. Then, 24 μg of total plasmid DNA was prepared at a 5:1:1 molar ratio (attP library:LSR effector:sgRNA). DNA and 72 μl of Lipofectamine 2000 were separately mixed with 1.5 ml of OMEM, incubated for 5 minutes and then combined and incubated for 10 minutes before adding dropwise to cells. After 3 days, cells were harvested with TrypLE (Gibco), and genomic DNA was extracted using a Quick-DNA Midiprep Plus Kit (Zymo Research).

Integration events were amplified by single-step PCR with i5/i7 index-adding primers using all available genomic DNA. Biological replicates had 1-bp staggered amplicons to increase nucleotide diversity. PCR conditions were as follows: 25 μl of NEBNext High Fidelity PCR Master Mix, 2.5 μg of genomic DNA, 1.25 μl each of the attL or attR i5 or i7 primers (Supplementary Table 8) and water to 50 μl. Cycling conditions were as follows: 25 cycles of 98 °C for 10 seconds, 63 °C for 10 seconds, 72 °C for 25 seconds. PCR products were pooled and run on 2% agarose gel, and correct size bands were extracted (Monarch DNA Gel Extraction Kit). Libraries were quantified (Qubit 1× dsDNA High Sensitivity Assay; Thermo Fisher Scientific), pooled equimolar with 35% PhiX spike-in and sequenced on an Illumina NextSeq 2000 (150-bp paired-end reads).

attP library enrichment analysis

Libraries were demultiplexed using Illumina BaseSpace automatic demultiplexing workflow. Paired-end reads were merged using BBMerge (version 39.06) and analyzed with a custom Python script. Reads were filtered for exact amplicon length and QScore ≥ 30. Next, percent abundance of each nucleotide at each attP position was calculated for input and output libraries. Enrichment scores were computed using the following equation: \(r=\frac{{A}/1-A}{B/1-B}\), where A and B represent the read counts for selected nucleotides in output and input libraries, respectively, normalized to the total number of reads. Enrichment scores were converted to sequence logos, generated using Logomaker⁶⁹ and matplotlib packages.

Unique library members recovered as integration events were assessed by generating the set of unique reads. The number of unique integration events from NGS analysis was compared to ddPCR analysis of bulk genomic DNA for validation.

Dinucleotide enrichment analysis was performed by first counting individual nucleotide frequencies at each position across all reads, followed by counting all possible dinucleotide combinations using a 2-bp sliding window at consecutive position pairs. Raw counts were normalized to total reads to calculate probabilities for both single nucleotides and dinucleotides at each position. To assess deviation from independence, observed dinucleotide probabilities were divided by the product of their constituent single-nucleotide probabilities: P(dinucleotide) / (P(nucleotide₁) × P(nucleotide₂)).

Enrichment scores were calculated by comparing output to input library frequencies using \(r=\frac{{A}/1-A}{B/1-B}\), where A represents output library frequency and B represents input library frequency for each dinucleotide. Final values were log₂ transformed and averaged across dinucleotide categories based on purine (R: A,G) and pyrimidine (Y: C,T) classification: RR (purine−purine), YY (pyrimidine−pyrimidine), RY (purine−pyrimidine) and YR (pyrimidine−purine).

Stem cell transfection

H1 hESCs and WTC-11 iPSCs were cultured in mTeSR Plus medium (STEMCELL Technologies) on Cultrex-coated (Bio-Techne) or Matrigel-coated (Corning) six-well plates. Cells were routinely subcultured at a 1:12 ratio using ReLeSR Passaging Reagent (STEMCELL Technologies) every 4 days or at 70−80% confluency. Three days after splitting (60% confluency), the cells were dissociated for 10 minutes with Accutase (STEMCELL Technologies) and plated in Cultrex-coated 96-well plates at 25,000−30,000 cells per well with 10 µM ROCK inhibitor. The next day (at 70% confluency), media were changed to include 50 µM ROCK inhibitor 2 hours before transfection. Then, 3 µg of plasmid DNA containing a 1:1 molar ratio of combined effector/guide plasmid to donor plasmid in 10-µl volume was diluted in 81 µl of mTeSR Plus and thoroughly pipette mixed. Next, 9 µl of FuGENE HD Transfection Reagent (Promega) was added to the DNA/mTeSR mix, thoroughly mixed and incubated for 12 minutes. After another thorough pipette mix, 7 µl of the DNA was added dropwise to each well. The cells were incubated at 37 °C, splitting 1:2 if 90% confluency was reached. After 3 days, the cells were dissociated with Accutase and split into two V-bottom plates, one for flow cytometry and one for gDNA harvest with QuickExtract DNA Solution (Biosearch Technologies).

HPC differentiation and surface marker staining

hESCs were differentiated into HPCs using the STEMdiff Hematopoietic Kit (STEMCELL Technologies). On day 10 of differentiation, 250 µl of non-adherent cells were collected from the supernatant using wide-bore P1000 tips and transferred to a V-bottom 96-well plate. Next, the cells were pelleted at 400g for 5 minutes, supernatant discarded and resuspended in 95 ml of Stain Buffer containing 1 µl of each antibody with a wide-bore pipette. The following antibodies were used: APC CD81 (BD Biosciences, 551112), APC CD147 (Thermo Fisher Scientific, A15706), Alexa Fluor 647 CD63 (BD Biosciences, 561983), APC/Cyanine7 CD34 (BioLegend, 343514) and PE CD43 (BioLegend, 343204). The cells were incubated in the dark for 20 minutes to 1 hour, washed once with Stain Buffer and flowed on the Attune Flow Cytometer (Thermo Fisher Scientific) using Attune Cytometric Software version 5.3.0 for collection.

hESC single-cell dilution and genotyping

hESCs were diluted to one cell per 100 µl in mTeSR Plus medium supplemented with 1× CloneR (STEMCELL Technologies) and plated into two 96-well plates per sample. Cells were maintained until colonies were visible, and then wells with multiple colonies were removed. Single colonies were expanded to 24-well dishes when they covered half the surface area of the 96-well plate. At the next split, one quarter of each well was pelleted for gDNA extraction using QuickExtract DNA Solution (Biosearch Technologies). The extracted gDNA was cleaned with 0.9× AMPure XP beads and genotyped by ddPCR. Primers and probes were designed to target the attH1 junction (ddPCR_attH1_1 set), the donor sequence (Amp_forward, Amp_reverse, Amp_probe) and a nearby genomic reference sequence. On-target zygosity was determined by the attH1/reference ratio, and total zygosity was measured by the donor/reference ratio.

Bulk RNA-seq—cell line generation, RNA isolation and sequencing

Stem cells were transfected as previously described. Two days after transfection, cells were selected using Geneticin (Gibco) at 100 μg ml⁻¹ and penicillin−streptomycin (Gibco) at 100 U ml⁻¹. Cells were maintained in culture for 3 weeks until selection was complete and sufficient cell expansion was achieved for downstream applications, including cryopreservation and RNA extraction. Throughout the culture period, cell quality was monitored daily via microscopy to assess morphology and identify spontaneous differentiation events. Culture medium consisting of mTeSR Plus supplemented with penicillin−streptomycin and Geneticin was replaced daily. Upon reaching 70−80% confluency, cells were clump passaged using ReLeSR according to the manufacturer’s instructions.

If spontaneous differentiation was observed, cells were subjected to a straining protocol to remove differentiated cells and maintain pluripotent populations. In brief, media were aspirated, and four drops of ReLeSR were added to each well and incubated for 10 minutes. Cells were gently dislodged by pipetting or tapping the side of the culture dish to release cell clumps. The cell suspension was passed through a 40-μm cell strainer placed on a 50-ml Falcon tube and rinsed with 6 ml of PBS. The strainer was then inverted onto a fresh Falcon tube, and clumps were collected with 3 ml of culture medium before replating into six-well plates.

For RNA-seq, cells from one well of a six-well plate were harvested by adding 1 ml of TRIzol reagent. The lysate was mixed by pipetting until a homogeneous viscosity was achieved and stored at −80 °C until RNA extraction.

For RNA extraction, 200 μl of chloroform was added, followed by vigorous shaking for 15 seconds and incubation at room temperature for 10 minutes. Samples were centrifuged at 12,000g for 15 minutes at 4 °C, resulting in phase separation. The upper aqueous phase containing RNA was carefully transferred to a fresh tube, and 0.5 ml of isopropanol was added and mixed. After a 5−10-minute incubation at room temperature, samples were centrifuged at 12,000g for 10 minutes at 4 °C to precipitate RNA. The supernatant was removed, and the RNA pellet was washed with 1 ml of 75% ethanol, mixed and centrifuged at 7,500g for 5 minutes at 4 °C. The RNA pellet was air dried for 5−10 minutes before resuspension. The extracted RNA was analyzed on a High Sensitivity RNA ScreenTape (Agilent Technologies) to measure the RNA integrity number (RIN) score, which was higher than 9 for all samples.

mRNA enrichment was performed using the Roche/KAPA mRNA HyperPrep Kit according to the manufacturer’s protocol. After mRNA enrichment, sequencing libraries were prepared using the HyperPrep Library Preparation Kit according to the manufacturer’s instructions. Final libraries were sequenced on an Illumina NovaSeq X at a depth of at least 20 million reads per sample.

RNA-seq data processing

Raw paired-end FASTQ files for each stem cell sample were first subjected to adapter and quality trimming using Trim Galore (version 0.6.7)⁷⁰ with default settings, retaining reads ≥20 nt. Quality of raw and trimmed reads was assessed with FastQC (version 0.11.9)⁷¹ and aggregated using MultiQC (version 1.15)⁷². Trimmed reads were aligned to the GRCh38.p13 reference genome (GENCODE version 46 primary assembly, FASTA and GTF obtained from GENCODE) using STAR (version 2.7.10a)⁷³ in two-pass mode. Alignment metrics and insert size distributions were evaluated with Picard (version 2.27.4)⁷⁴, RSeQC (version 4.0.0)⁷⁵, Qualimap (version 2.2.2)⁷⁶, dupRadar (version 3.21)⁷⁷ and Qualimap RNA-seq modules, with reports again aggregated by MultiQC. Concurrently, Salmon (version 1.10.0)⁷⁸ was used to quantify transcript abundances (–validateMappings), and transcript-to-gene summarization was performed to produce gene-level count and transcripts per million (TPM) matrices. All steps were orchestrated via the nf-core/rnaseq pipeline (version 3.12.0)⁷⁹ under Nextflow (version 24.10.0)⁸⁰ with the Docker (version 28) profile, specifying –strandedness reverse⁸¹.

Differential expression analysis

Gene-level count matrices (salmon.merged.gene_counts.tsv) were imported into R (version 4.3.1)⁸², and DESeq2 (version 1.38.1)⁸³ was used for normalization and differential expression. A sample metadata table containing sample_id, condition and group_id was preprocessed so that identifiers matched the column names of the count matrix. For each stem cell line (group_id), the wild-type (‘WT’) condition was identified, and pairwise comparisons were performed between each edited condition and the corresponding wild-type condition. Differential expression was modeled in DESeq2 with the formula ‘~ condition’ (R formula syntax), meaning that gene counts were fit as a function of the experimental condition (edited or wild-type). Wald tests were used to estimate log₂ fold changes, and P values were adjusted for multiple testing by the Benjamini–Hochberg (false discovery rate (FDR)) method. DEGs were defined as those with adjusted P < 0.05 and |log₂ fold change | > 1.

HEK293FT single-cell sorting and genotyping

HEK293FT cells were transfected as previously described. Eight days after transfection, cells were placed under puromycin selection (0.5 μg ml⁻¹) for 10 days. On day 18, cells were trypsinized and strained through a 35-µm filter, and single mCherry⁺ cells were sorted into four 96-well plates per sample using a FACSAria Fusion (BD Biosciences). Single-cell colonies were expanded for 2 weeks until more than 50% confluent, with visual inspection to ensure single colony growth. Wells with zero or multiple colonies were excluded from analysis.

Confluent colonies were harvested with QuickExtract DNA Solution (Biosearch Technologies) and amplified in two separate PCRs: PCR 1 using primers UMI_reverse and ddPCR_attH1_forward_1, flanking the UMI and attH1 donor/genome junction, and PCR 2 using primers UMI_reverse and UMI_forward, flanking the UMI on the donor plasmid. Amplicons were sequenced via Sanger and/or NGS to determine on-target UMI count (PCR 1) and total UMI count (PCR 2), allowing calculation of on-target and off-target insertion counts per colony.

Quantification of indels at attH1

HEK293FT cells were transfected with LSR and donor plasmids at a 1:5 ratio, as described above. After 3 days, cells were passaged into a 24-well dish for expansion. On day 5 after transfection, genomic DNA was harvested using the Zymo Quick-DNA Miniprep Plus Kit according to the manufacturer’s instructions. PCR primers with 0−5 stagger base pairs were designed to amplify the attH1 site. Each PCR reaction contained 1 μg of gDNA, 40 μl of Platinum SuperFi II Master Mix, 3.2 μl of forward primer (PR_N284−PR_N289, 10 μM), 3.2 μl of reverse primer (PR_N290−PR_N295, 10 μM) and water to 80 μl total volume. After 25 cycles under standard conditions, products were purified with 0.8× AMPure XP beads. A second PCR amplification added Illumina indexes using 1 μl of purified product, 12.5 μl of Platinum SuperFi II Master Mix, 1 μl each of uniquely indexed FLAP2 primers and 9.5 μl of water. After seven cycles, libraries were purified with 0.7× AMPure XP beads, quantified via Qubit, pooled and sequenced using Illumina chemistry.

Indel rates were calculated using Crispresso2 with the following command: CRISPResso -a CATTGGTGAATGTCTCATGTGGGTTTGAAAAGAGTGTGTATTCTGCTGTTGTTGGGTAAAGTAGTCTATACATGTCAATGATATGCTGTTGATTGATGCTGGTGTTGAATTCAACTATGTCCTTGCTGATTTTCTGCCTGCTGGATCTGTCTGAC -g GTCTATACATGTCAATGATA -r1 Read_1.fastq.gz -r2 Read_2.fastq.gz–keep_intermediate -w 20 -q 30–min_bp_quality_or_N 30–exclude_bp_from_left 10–exclude_bp_from_right 10–plot_window_size 20–ignore_substitutions. The Modified% output value represented the percentage of unintegrated cells containing indels. Background indel rates from untransfected cells were subtracted from each sample. The final percentage of cells with indels was calculated by multiplying the Modified% by the percentage of uninserted cells (1 minus the average insertion percentage determined by ddPCR).

Cell viability assay

HEK293FT cells were plated in black-walled, clear-bottom optical plates, excluding edge wells and transfected with LSR and donor plasmids at a 1:5 ratio with four replicates per sample. Two days after transfection, cell viability was assessed using the CellTiter-Glo Assay (Promega). Cells were first refreshed with 100 μl of fresh D-10 medium and then treated with 100 μl of combined room temperature CellTiter-Glo Buffer and Substrate. Plates were orbitally shaken at 510 r.p.m. for 2 minutes on a Tecan Spark Microplate Reader and incubated for an additional 8 minutes, and then luminescence was measured with 1,000-ms integration time. Background luminescence from empty wells was subtracted from all measurements. Final viability values were normalized to control cells transfected with donor plasmid and pUC19 stuffer plasmid in place of the LSR effector plasmid.

Phosphorylated H2AX staining and flow cytometry

HEK293FT cells were plated into 96-well plates and transfected as described above. Two days after transfection, cells were dissociated with TrypLE and transferred to a V-bottom plate. Cells were centrifuged at 300g for 5 minutes and washed with 200 μl of DPBS. Next, cells were centrifuged again and resuspended in 50 μl of 4% paraformaldehyde (diluted in DPBS) for fixation. Cells were incubated for 10 minutes at room temperature. After fixation, cells were washed three times and stored in PBS overnight. To permeabilize cells, samples were resuspended in 0.25% Triton-X (diluted in DPBS) and incubated for 15 minutes at room temperature in the dark. Next, cells were washed twice with DPBS and incubated in blocking buffer composed of the following: 10% goat serum (Sigma-Aldrich, G6767), 0.5% NP-40 (Sigma-Aldrich, I3021) and 5% w/v saponin (Sigma-Aldrich, 84510) diluted in DPBS. Samples were incubated in blocking buffer for 30 minutes at room temperature in the dark. After incubation, samples were centrifuged and resuspended in a 1:1,000 dilution of Alexa Fluor 647-conjugated anti-phospho histone H2A.X (Ser139) antibody (Sigma-Aldrich, cat. 05-636-AF647, clone JBW301, lot 4214083) diluted in blocking buffer. Samples were incubated for 2 hours, washed twice with DPBS and analyzed on the Attune flow cytometer.

Quantification of translocations and genomic rearrangements

HEK293FT cells were transfected with LSR and donor plasmids at a 1:5 ratio. After 3 days, cells were passaged into a 24-well dish for expansion. On day 5 after transfection, genomic DNA was harvested using the Quick-DNA Miniprep Plus Kit. Tn5 tagmentation was performed as described above, with two reactions performed per sample.

For enrichment of translocation junctions, tagmented DNA underwent a two-step nested PCR. The first PCR combined 10.5 μl of tagmented DNA with 12.5 μl of Platinum SuperFi II Master Mix, 1 μl of outer nest primer (PR_N296 for upstream or PR_N297 for downstream of attH1) and 1 μl of PR_N163 (Tn5 adaptor binding). Reactions were amplified for 12 cycles (standard three-step protocol, 60 °C annealing, 1-minute extension), purified with 0.9× AMPure XP beads and eluted in 11 μl of water. The second nested PCR added indexes and Illumina adaptors using 10 μl of the first PCR product, 25 μl of Platinum SuperFi II Master Mix, 2.5 μl of inner primer (PR_N298−PR_N327 for upstream samples or PR_N328−PR_N356 for downstream samples), 2.5 μl of PR_N149 and 10 μl of water. After 20 cycles, products were purified with 0.9× AMpure XP beads, quantified by Qubit and pooled equimolarly. Amplicons between 300 bp and 900 bp were selected by gel extraction, quantified using the KAPA Library Quantification Kit and sequenced with Illumina chemistry for 600 cycles.

Genomic rearrangements and translocations were identified using a custom pipeline. After merging paired-end reads, the sequence between the inner primer and the attH1 dinucleotide core (‘upstream sequence’) was searched for, allowing up to three mismatches to account for sequencing errors. Reads containing the upstream sequence were processed to extract downstream portions (minimum 20-bp length), which were then aligned to both WT and donor insertion references using BWA-MEM (-a -M -k 8 -T 20)⁷⁹. Reads were classified based on alignment quality (≥80% alignment and mapping quality (MAPQ) ≥ 20) into WT aligned, donor insertion aligned or potential translocations.

Potential translocation reads underwent further analysis by BWA alignment to the human reference genome (hg38). The resulting alignments were converted to sorted BAM files using SAMtools (version 1.22) for visualization and BED files using BEDTools (version 2.31.0) for genome browser compatibility.

Translocation events were classified into four categories: (1) close to target (within 2 kb of on-target site, reclassified as WT aligned); (2) EF1α promoter aligned (mapping to chr6 region 73,519,610−73,522,070, reclassified as donor insertion aligned); (3) non-chr10 translocations; and (4) chr10 rearrangements.

To quantify the presence of ITRs at the attH1/donor junction of AAV integrations, the same protocol was used, using the upstream bait primers. All reads containing the upstream sequence were aligned to WT and donor insertion references using BWA (version 0.7.19). All reads that did not align to these references were then aligned to the human genome. Finally, all reads that did not align to the human genome were aligned to AAV ITR sequences using Geneious Prime (version 11.0.20.1+1).

Lentivirus production and HPC transduction

sgRNA spacers targeting cell surface markers CD81, CD147 and CD63 were cloned into the LentiGuide-Puro construct (Addgene, 52963). Lentivirus was generated using the LV-MAX Lentiviral Production Kit (Invitrogen) according to the manufacturer’s instructions and concentrated 100× with Lenti-X Concentrator (Takara). HPCs were diluted to 50,000 cells per well in 100 µl of Medium B (STEMCELL Technologies, STEMdiff Hematopoietic Kit) in a 96-well plate. Each well received 1 µl of LentiBOOST (SIRION Biotech) and 1 µl of lentivirus. Media were changed the next day. Four days after transduction, a subset of cells was stained for cell surface markers. Remaining cells were treated with 1 μg ml⁻¹ puromycin for 4 days to select for transduced cells, followed by cell surface marker staining. Antibodies used for cell surface staining were as follows: APC CD81 (BD Biosciences, cat. 551112, lot 2061009, clone JS-81, 1:100 dilution); APC CD147 (Thermo Fisher Scientific, cat. A15706, lot 540242, clone 8D12, 1:100 dilution); Alexa Fluor 647 CD63 (BD Biosciences, cat. 561983, lot 2112938, clone H5C6, 1:100 dilution); APC CD63 (BioLegend, cat. 353008, lot B373947, clone H5C6, 1:100 dilution); APC/Cyanine7 CD34 (BioLegend, cat. 343514, lot B413134, clone 581, 1:100 dilution); and PE CD43 (BioLegend, cat. 343204, lot B359578, clone CD43-10G7, 1:100 dilution). All antibodies chosen are validated for flow cytometric analysis of human cells according to the manufacturer’s website.

Generating AlphaFold3 models of Dn29 bound to attB

The full-length WT Dn29 protein sequence and minimal attB-L or attB-R sequence (attB-L: GTAGACAAGGAAGGTAATGA; attB-R: GAAATAAGTTTGATAGATAT) were input into the AlphaFold3 web server with the seed set to ‘auto’. Five models were generated for each query of Dn29 bound to an attB half-site. Outputs were manually inspected in pymol (version 3.0.2) to ensure correct orientation of Dn29 bound to the half-site, with the dinucleotide core of the DNA proximal to the NTD. One model (Dn29 × attB-R) out of the 10 generated models met this criterion and was selected for further analysis. The chosen model was compared to the Listeria integrase crystal structure of the LSR CTD and attP complex (PDB: 4KIS). Despite 4KIS being bound to attP instead of attB, domain-wise comparisons showed strong alignment: RMSDs were 1.341 and 1.707 for the zinc-ribbon domain and the recombinase domain, respectively (Extended Data Fig. 4b). Protein/DNA interface residues were identified with the InterfaceResidues pymol script using default settings.

Predicting combinatorial mutations and feature importance with machine learning

The efficiency and specificity data of all Dn29 variants were split into a training and test set based on what round of experimentation they were generated in. The training set, called round 1, contained all variants from the two single-mutation validation experiments, where mutations were tested individually on top of variant 127 (Fig. 1g and Extended Data Fig. 2c−e) or variant 381 (Extended Data Fig. 2f). The testing set contained all higher-order combinations from the iterative rounds of driver mutation stacking (rounds 2−5). The efficiency (percent of integrations at attH1) was normalized to WT, and specificity (ratio of attH1/attH3 activity) was log transformed. The full amino acid sequences of the protein variants were one-hot encoded, a technique that transforms each amino acid in the sequence into a binary vector of length 21 (corresponding to the 20 standard amino acids plus a stop codon), where the position corresponding to that amino acid is set to 1 and all others are 0—this encoding is then flattened into a single vector representing the entire sequence. Activity in the training set was modeled using linear regression, ridge regression, XGBoost and CatBoost with the scikit-learn (version 1.0.2), xgboost (version 1.6.2) and catboost (version 1.2.5) Python libraries. Additional Python packages used include pandas (version 1.3.5), numpy (version 1.19.5), matplotlib (version 3.5.2), seaborn (version 1.73) and scipy (version 1.7.3).

For the ridge regression, optimal α was identified through minimization of the testing set R² (α = 0.8 for efficiency model, α = 1.3 for specificity model). Hyperparameter optimizations were conducted for XGBoost and CatBoost by performing a randomized search, evaluating on negative mean squared error, using the following parameters: XGBoost: ‘n_estimators’: [100, 500, 1,000], ‘learning_rate’: [0.01, 0.05, 0.1], ‘max_depth’: [3, 5, 7], ‘subsample’: [0.5, 0.6, 0.7, 0.8, 1.0], ‘colsample_bytree’: [0.7, 0.8, 1.0]; CatBoost: ‘iterations’: [100, 200, 500, 1,000], ‘learning_rate’: [0.01, 0.05, 0.1, 0.2], ‘depth’: [4, 6, 8, 10], ‘l2_leaf_reg’: [1, 3, 5, 7, 9], ‘bagging_temperature’: [0, 1, 2, 3], ‘random_strength’: [1, 1.5, 2, 3], ‘border_count’: [32, 64, 128], ‘grow_policy’: [‘SymmetricTree’, ‘Depthwise’, ‘Lossguide’].

The following parameters were chosen for each model: XGBoost, specificity: ‘subsample’ = 0.5, ‘n_estimators’ = 1,000, ‘max_depth’ = 7, ‘learning_rate’ = 0.1, ‘colsample_bytree’ = 0.8; XGBoost, efficiency: ‘subsample’ = 0.7, ‘n_estimators’ = 100, ‘max_depth’ = 7, ‘learning_rate’ = 0.05, ‘colsample_bytree’ = 0.7; CatBoost, specificity: ‘random_strength’ = 1.5, ‘learning_rate’ = 0.1, ‘l2_leaf_reg’ = 1, ‘iterations’ = 1,000, ‘grow_policy’ = ‘Depthwise’, ‘depth’ = 4, ‘border_count’ = 128, ‘bagging_temperature’ = 1; CatBoost; efficiency: ‘random_strength’ = 1.5, ‘learning_rate’ = 0.1, ‘l2_leaf_reg’ = 7, ‘iterations’ = 500, ‘grow_policy’ = ‘Lossguide’, ‘depth’ = 4, ‘border_count’ = 32, ‘bagging_temperature = 2.

In vitro transcription and purification of mRNA

Effector constructs were cloned into an in vitro transcription (IVT) plasmid as previously described⁸⁴. This plasmid contained a mutated T7 promoter, 5’ untranslated region (UTR), P2A EGFP and 3’ UTR followed by a 145-bp poly(A) sequence. IVT templates were generated by PCR using primers oGX006 and oLGR009, which incorporate a poly(A) tail and correct the T7 promoter mutation. PCR reactions were performed using KAPA HiFi HotStart 2× (Roche) Master Mix with 6.25 ng of plasmid template per 25-µl reaction. The PCR protocol involved annealing at 63 °C, extending for 45 seconds per kilobase and running for 18 cycles. The reactions were purified using 0.8× volume of SPRI beads and eluted into water. The purified PCRs were analyzed by gel electrophoresis and NanoDrop to ensure correct size and determine concentration.

The IVT reactions were set up using the HiScribe T7 High-Yield RNA Synthesis Kit (NEB, E2040S), modified with full pseudo-UTP substitution using N1-Methyl-Pseudo-U (TriLink Biotechnologies, N-1081) and co-transcriptionally capped with CleanCap AG (TriLink Biotechnologies, N-7113). Each IVT reaction contained 5 mM ATP, CTP, GTP and pseudo-UTP, 4 mM CleanCAP AG, 1× Transcription Buffer, 3.75 ng µl⁻¹ DNA template, 1 U µl⁻¹ Murine RNAse Inhibitor (NEB, M0314L), 0.002 U µl⁻¹ yeast inorganic pyrophosphatase (NEB, M2403L) and 5 U µl⁻¹ T7 RNA polymerase. Reactions were incubated for 2.5 hours at 37 °C.

Next, the mRNA was purified using lithium chloride. To each reaction, 1.5× water and 1.25 × 7.5 M LiCl were added. The solution was chilled at −20 °C for 30 minutes and then spun at maximum speed (16,000g) for 15 minutes at 4 °C. The supernatant was discarded, and the pellet was rinsed with 70% ice-cold ethanol to remove residual salts. After another maximum speed spin for 10 minutes at 4 °C, the mRNA was resuspended in water and stored at −80 °C. The mRNA was analyzed on an Agilent TapeStation and by Qubit RNA High Sensitivity (Thermo Fisher Scientific) to ensure correct size and determine concentration.

RNA electroporation and AAV transduction of primary human T cells

Two days before electroporation, T cells were seeded at 1 × 10⁶ fresh cells per milliliter and activated with a 1:1 bead-to-cell ratio with anti-CD3/CD28 Dynabeads (Life Technologies, 40203D). On the day of electroporation, the beads were magnetically removed, and the T cells were electroporated with 2 µg of LSR−dCas9−P2A−EGFP mRNA and 2 µg of sgRNA (Synthego) for LSR−dCas9 samples or 1 µg of LSR−P2A−EGFP mRNA for LSR samples using the Lonza P3 Primary Cell Kit. Each electroporation contained between 0.5 × 10⁶ and 1 × 10⁶ cells in 20 µl total volume and was electroporated using the 4D Nucleofector system and the DS-137 pulse code. Immediately after electroporation, 80 µl of pre-warmed culture media was added to the Nucleocuvette strip, which was then incubated at 37 °C for 15−30 minutes. Next, 2 × 10⁵ cells per condition were split into 96-well U-bottom plates in 100 µl of serum-free medium (TheraPEAK X-VIVO-15 Serum-free Hematopoietic Cell Medium, BEBP04-744Q) supplemented with 5 ng μl⁻¹ IL-7 and 5 ng μl⁻¹ IL-15. Cells were then transduced at an MOI of 1 × 10⁵ genome copies per cell with ssAAV or scAAV vectors of serotype 6 (AAV6) containing the e-attP sequence, attH1 sgRNA target sequence and an mCherry expression cassette, which were ordered from VectorBuilder. The next morning, cells were spun down at 300g for 5 minutes; the serum-free medium was removed; and cells were resuspended in 200 µl of fresh cX-VIVO. Cells were maintained and passaged as needed by the addition of cX-VIVO every 2−3 days.

Plasmid and mRNA electroporation of primary human T cells

Peripheral blood mononuclear cells (PBMCs) from healthy human blood donors were collected under an approved institutional review board protocol by the Stanford Blood Center and used to isolate human T cells. In brief, leukoreduction chambers from processing of platelet donations were used to isolate PBMCs using density centrifugation with Ficoll (Lymphoprep; STEMCELL Technologies) within SepMate tubes (STEMCELL Technologies) according to the manufacturerʼs instructions. Next, primary human CD3⁺ T cells were isolated by negative selection using a Human CD3 T Cell Enrichment Kit (STEMCELL Technologies) according to the manufacturerʼs instructions. Isolated primary human CD3 T cells were counted using an automated cell counter (Countess; Thermo Fisher Scientific) and activated using anti-human CD3/CD28 Dynabeads (Cell Therapy Systems; Thermo Fisher Scientific) at a 1:1 ratio in X-VIVO 15 medium (Lonza) supplemented with 5% FBS (MilliporeSigma) and 50 IU ml⁻¹ human IL-2 (PeproTech). T cells were activated at a 1:1 ratio of cells to Dynabeads and initially cultured in standard tissue culture incubators at approximately 1 × 10⁶ cells per milliliter of medium. After gene editing/electroporations, T cells were counted and reseeded at approximately 1 × 10⁶ cells per milliliter, with additional IL-2 and X-VIVO 15 complete media added every 2–3 days to maintain a culture density of approximately 1 × 10⁶ cells per milliliter.

Forty-eight hours after activation, Dynabeads were magnetically removed from activated T cell cultures by incubating for 2 minutes at room temperature on a magnet (EasySep Magnet; STEMCELL Technologies), and cells were counted using an automated cell counter (Countess; Thermo Fisher Scientific). For electroporations, 1−2 million T cells per editing condition were gently pelleted by centrifugation at 90g for 10 minutes, followed by careful aspiration of the supernatant. T cell pellets were resuspended in 20 μl per editing condition in P3 buffer (Lonza) and then mixed with prepared LSR mRNA and DNA templates. Then, 1.5 µg of LSR mRNA, 2 µg of donor plasmid, 1.5 µg of sgRNA plasmid and 20 µl of T cell suspension were mixed and aliquoted into a 96-well Nucleocuvette plate (Lonza). All plasmids were purified using the ZymoPure II Plasmid Midiprep Kit (Zymo Research). The 5.8-kb CD19 CAR-expressing plasmid contains the EF1α promoter, tNGFR EC domain (cell surface reporter), T2A, 1928z CAR and bGH poly(A). Total nucleic acid volume was limited to 5 µl. Electroporation occurred on a Gen2 Lonza 4D instrument with a 96-well plate attachment using pulse code EO-151. Immediately after electroporation, 80 μl of pre-warmed X-VIVO 15 media was added to each cuvette, and cells were rested within the cuvettes for 15 minutes in a standard 37 °C tissue culture incubator. The cells were then gently resuspended and transferred to standard 96-well round-bottom plates with 300 µl of total X-VIVO 15 complete medium with 50 IU ml⁻¹ human IL-2. T cells were maintained at 0.5 × 10⁶ to 1 × 10⁶ cells per milliliter, and X-VIVO 15 complete medium with 50 IU ml⁻¹ human IL-2 was refreshed every 2−3 days.

T cell staining, flow cytometry and genomic harvesting

Three days after electroporation, 50 µl of T cells was collected for staining and flow cytometry. In brief, cells were centrifuged, washed once with 200 μl of cell staining buffer and stained with Ghost Dye Red 780 at a 1:1,000 dilution (Tonbo, 13-0865-T500) for 20 minutes in the dark at 4 °C. The cells were measured using an Attune NxT Cytometer with a 96-well autosampler (Invitrogen) and analyzed using FlowJo software (version 10.10.0) for viability, mCherry fluorescence (expressed on the AAV) and GFP fluorescence (effector expression). The remaining 150 µl of T cells in culture was centrifuged at 300g for 5 minutes, and the gDNA was harvested using QuickExtract DNA Solution (Biosearch Technologies) and analyzed by ddPCR as described above.

In vitro cancer target cell-killing assays

At 13 days after non-viral gene editing, T cell editing was assessed by flow cytometry, and cells edited with goldDn29, goldDn29−dCas9 and Dn29−dCas9 were selected for the killing assay. T cells were mixed at indicated effector:target (E:T) ratios with target Nalm6 leukemia cells in 96-well plates, with four different Nalm6 conditions (16,000, 8,000, 4,000 or 2,000 cells per well) and 4,000 T cells per well. Cell killing was assessed by flow cytometry at 48 hours, and the percentage of Nalm6 tumor cell killing was calculated by taking 1 − (no. of Nalm6 cells alive in experimental condition / no. of Nalm6 cells alive in no-T-cell conditions). Effector cells were stained with human NGFR-APC (clone ME20.4, BioLegend, 345108), and target cells were stained with human CD19-PE (clone HIB19, BioLegend, 982402), for flow cytometric analysis.

Generative artificial intelligence

Artificial intelligence language models (ChatGPT and Claude) were used for generating custom Python scripts for data analysis and visualization, assistance with copyediting and infilling preliminary drafts of some sections based on an author-provided outline. All content generated by artificial intelligence was thoroughly reviewed, edited and verified by the authors.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.