Identification of potential TALEN and CRISPR/Cas9 targets of selected genes of some human pathogens which cause persistent infections

: The human pathogens, Epstein Barr virus, human papilloma virus, herpes simplex virus-2, hepatitis B virus and Leishmania species can cause persistent infections, which cannot be cured with currently available treatments. The modern gene editing techniques, transcription activator like effector nuclease (TALE N) and clustered regularly interspaced short palindromic repeat / CRISPR associated protein 9 (CRISPR/Cas9), are potential candidates for their treatment. In this study, target sites for TALEN and CRISPR/Cas9 were identified in silico on selected essential and indispensable genes of the above pathogens, targeting the cease of the essential functions and curing the infection. The gene sequences of the pathogens were obtained from public databases and conserved sequences were identified. Then potential TALEN target sites were identified. For some selected targets, the off-target effects on the genomes of human, mouse, same pathogen and other organisms were tested and the putative functions of the mutated proteins were predicted. TALEN targets without having a potential off-target effect and not leading to mutated proteins with undesirable functions were selected for each gene. The potential CRISPR/Cas9 targets without off-target effect on human and murine genomes were identified and other off-target effects were evaluated. Results showed that potential TALEN and/or CRISPR/Cas9 targets with higher binding specificity and efficiency were available for the selected genes. It can be concluded that the selected targets can potentially be used to produce respective proteins, and in vitro and in vivo applications are potentially possible.


INTRODUCTION
Genome editing is a technique that is used to alter the genomes of organisms. There are mainly three types of genome editing techniques known as zinc finger nuclease (ZFN), transcription activator like effector nuclease (TALEN) and clustered regularly interspaced short palindromic repeat / CRISPR associated protein 9 (CRISPR/Cas9). All these are site -specific nucleases (Ochiai Yamamoto, 2015). Among them ZFN and TALEN are protein based chimeric nucleases while CRISPR/Cas is RNA based (Nemudryi et al., 2014). TALEN and CRISPR/Cas9 are of greater interest due to high specificity of binding (Yeadon, 2014). These techniques have recently been used in the gene editing of organisms such as zebra fish (Gonzales & Yeh, 2014), Arabidopsis thaliana (Feng et al., 2013), Zea maize (Char et al., 2015;Kelliher et al., 2017), Oryza sativa (Shen et al., 2017;Han et al., 2019;Usman et al., 2021) and of human cell lines (Yuen et al., 2015).
Since these techniques are based on site -specific nucleases, specific sites to be bound to the nucleases can be previously determined. The essential or indispensable genes for infection, pathogenesis or persistence of the pathogens can be targeted and these genes could be mutated to cure the infection or to diminish the pathogens' activities. The techniques are highly effective to be used in persistent incurable infections or that have developed resistance against currently available drugs. Some examples of causative agents of these persistent infections are herpes simplex virus (HSV), hepatitis B virus (HBV), HPV, HIV, Epstein Barr virus (EBV), cytomegalovirus (CMV), some bacteria, fungi and protists (Boldogh et al., 1996).
The site where the TALEN or CRISPR/Cas9 binds in a specific gene can be predicted, according to that the TALEN or CRISPR/Cas9 proteins can be designed and synthesised. The main criterion in designing a TALEN is the positioning of a Thymine (T) nucleotide at the 5´ end of the target. In CRISPR/Cas9, the presence of Protospacer Adjacent Motif (PAM) is the essential criterion (Nemudryi et al., 2014).
The objective of this study was to identify TALEN and CRISPR/Cas9 potential targets and to predict their off-target effects to select the best target sites for some selected genes of EBV, HBV, HPV, HSV-2 and Leishmania donovani and Leishmania infantum pathogens, which are showing persistent infection and for which treatment is not currently available. The genes of these pathogens were selected with the aim to cease the pathogen persistency and replication. In EBV, the gene LMP2A is essential for the persistence of the virus (Longnecker, 2000), and EBNA1 gene is essential for the replication and transcription (Sivachandran et al., 2012). It can be postulated that the introduction of a site-specific nuclease for LMP2A gene followed by that of EBNA1 would diminish the viral content in the host leading to eradication of the virus with continual application. In the same way, the UL21 and UL30 of HSV-2 could be mutated which are essential in viral propagation (Le Sage et al., 2013) and replication (Liu et al., 2006), respectively. The replication of the HPV can be inhibited by mutating the E2 gene (Sanders & Stenlund, 2000;McBride, 2013 ) and also the oncogenic effect (Leykauf et al., 2008) could be diminished. HBx gene of HBV is indispensable for the development of the viremia and persistency (Tsuge et al., 2010) and its mutations might cease the viral infection in the host. Leishmania sp., a pathogen that causes persistent infection could be controlled by mutating the tryR gene that protects against oxidative stress (Paul et al., 2014).

Selection of pathogen specific genes
The genes LMP2A and EBNA1 of EBV, E2 of HPV type 16, UL21 and UL30 of HSV-2, HBx of HBV and tryR of L. donovani and L. infantum were selected and subjected for TALEN and/or CRISPR/Cas9 target identification. Randomly selected entries were obtained from the databases 'GenBank' (Clark et al., 2016) and 'RefSeq' (O'Leary et al., 2015). These sequences were tested using the tool 'NCBI conserved domains' (Marchler-Bauer et al., 2014) and from the results, the sequences that confirm the presence of the gene were selected. Then the open reading frame (ORF) responsible for the gene in each sequence was identified using the tool 'ORF Finder' (Rombel et al., 2002). The maximum length ORF of each sequence was obtained for the analysis.

Identification of conserved residues of the gene
Conserved residues of each selected gene were identified using the software 'Unipro UGENE' (Okonechnikov et al., 2012) by aligning the maximum length ORFs of selected sequences. Then some conserved residues above 60 nucleotides were selected for identification of potential TALEN and CRISPR/Cas9 targets.

Identification of TALEN target sites
The tool 'TALEN Targeter' was used to identify the TALEN target sites. The selected conserved sequences of a selected gene were used as the input data. The predesigned TALEN architecture by Miller et al. (2011) was used to design TALEN targets and 'NH' was selected as the G substitute repeat variable diresidue (RVD). Then the parameters were adjusted to hide redundant TALENs in output. Other than these, guidelines by Streubel et al. (2012) were applied in the analysis. From the output, several TALEN targets having highest percentage of 'HD or NH' RVDs in the respective TALENs and having at least one unique restriction site at the spacer region were selected for further analyses. This procedure was followed for all selected conserved sequences in selected genes of LMP2A and EBNA1 of EBV, E2 of HPV type 16, HBx of HBV and tryR of L. donovani and L. infantum.
Journal of the National Science Foundation of Sri Lanka 49 (3) September 2021

Identification of the potential off-target effect of the respective TALENs of the selected TALEN targets
The potential off-target effect or target specificity of the selected TALENs were identified for human genome and murine genome using two bioinformatic tools, TAL Effector Nucleotide Target 2.0 (Doyle et al., 2012) and PROGNOS (Fine et al., 2013). The RVD sequences of the TALEN targets selected above were used as the input data for both tools.
Apart from these the probable unnecessary bindings on the genome of the selected pathogen was also determined using the tool Paired Target Finder. There, a genome sequence of the pathogen in 'RefSeq' database was used as the target sequence and the RVD sequences of the selected TALENs as the query sequence.

Identification of potential off-target effect of TALENs on other organisms
Basic local alignment search tool (BLAST) (Altschup et al., 1990) was used to identify the off-target effect of 'TALENs respective to the selected TALEN targets' in genomes of other organisms. Three methods were followed in the procedure. In the first method, TALEN target sequence was entered as the query sequence, where the spacer region was in lowercase letters, to the nucleotide BLAST tool. Filters were selected to mask lowercase letters and the search was carried out keeping all the other parameters default. In the second method instead of entering the whole TALEN target sequence as the query, only the TALE regions were entered as a continuous sequence by removing the spacer region. Next the same procedure as mentioned above was carried out except masking for lowercase letters. In the third method, all the nucleotides of the spacer region were replaced by the letter 'N' and used as the query sequence. The rest of this procedure is same as of the second method.
The suspected sequences for having off-target effect from the BLAST results were tested again using the tool 'Paired Target Finder'entering the NCBI accessions of the suspected sequences as the target and RVD sequence of the respective TALEN target as the query.

Identification of putative functions of the mutated protein
A sequence of a selected gene (Supplementary Table 1) was obtained and first, the nucleotides of the sequence were numbered from 5 to 3 end using the Group DNA option of the tool Sequence Manipulation suite (Stothard, 2000).
Then the probable cut site by the first TALEN target was marked in the sequence and one nucleotide adjacent to the cut site was deleted. The resulting sequence was filtered to remove unnecessary numbering and spaces using the option Filter DNA of the tool Sequence Manipulation Suite. Then the ORFs in the sequence were identified using the tool 'ORF Finder'. After that, the ORFs responsible for amino acid sequences greater than 75 amino acids and which passed through the cut site were obtained. Each of these sequences was BLAST searched in the Protein BLASTtool (Altschup et al., 1990). The same procedure was followed by deleting two nucleotides adjacent to the cut site. Then for all the other potential TALEN targets of the same gene and other selected genes, the same procedures were followed.

Identification of potential CRISPR/Cas9 target sites and their potential off-target effects
Potential CRISPR/Cas9 target sites were identified and off-target effect was predicted using the tool CCTop (Stemmer et al., 2015). First, a conserved residue of a selected gene of the selected pathogen was entered into the tool as a plain text as the input. Then the maximum mismatches that an off-target should possess were set as four and the human genome (Homo sapiens GRCH38/hg38) was selected to identify off-targets. Other categories were kept default and submitted for analysis. In the same way, off-targets in the murine genome were identified by selecting the mouse genome (Mus musculus GRCm38/mm10). Then the above procedures were carried out for all the selected conserved residues of the genes LMP2A and EBNA1 of EBV, E2 of HPV type 16, UL21 and UL30 of HSV-2, HBx of HBV and tryR of L. donovani and L. infantum.
The tool CCTop displays the CRISPR/Cas9 target sequences of the query sequence in the order of off-targets in the selected genome from targets with null off-target effect to the targets with the highest off-target effect. Among them, the targets with the null off-target effect on both human and the murine genome were selected and were overlaid to identify the targets common to both the human and murine genomes with null off-target effect. The binding efficacies of the designed CRISPR/Cas9 nucleases were determined using the tool CRISPRator (Labuhn et al., 2018).

September 2021
Journal of the National Science Foundation of Sri Lanka 49(3)

Identification of the potential off-target effect of selected CRISPR/Cas9 targets on genomes of other organisms
The potential CRISPR/Cas9 target sites identified above were used as the query and BLASTN searches were carried out. From the results, the targets showing the offtarget effect on other genomes were identified.

Obtaining the ORF of confirmed gene sequence
The presence of the selected genes, LMP2A and EBNA1 of EBV, E2 of HPV type 16, UL21 and UL30 of HSV-2, HBx of HBV and tryR of Leishmania species in obtained sequences were confirmed from the results of the tool NCBI Conserved Domains. The ORF responsible for the gene in each sequence was identified from the results of the tool ORF Finder (Supplementary Table 2). The confirmation of a gene sequence is important because in some instances annotation errors are present in the sequences available in the NCBI database. The identification of ORF of the selected gene sequence is also of immense importance because a sequence obtained from the databases may contain areas that do not belong to the ORF of the gene. If such regions are present in the sequence, TALEN and/or CRISPR/Cas9 target sites may be identified for those regions too.

Selection of conserved residues to identify TALEN and CRISPR/Cas9 target sites
According to the selected TALEN architecture, the maximum length of a TALEN target site is 60 nucleotides. Therefore, from the gene sequence alignment results (Supplementary Table 3), the conserved sequences with minimum length of 60 nucleotides were selected for each gene (Supplementary Table 4). The same conserved sequences were used for the identification of CRISPR/ Cas9 target sites. HPV types 16,18,31,33,34,35,39,45,51,52,56,58,59,66,68 and 70 were initially selected because these types are the high-risk types for cancer (Burd, 2003). But any conserved domain greater than 60 nucleotides were not observed in them. Then HPV type 16 was considered because it is the type that is responsible for the highest percentage of cancers among the HPV types (National Cancer Institute of USA, 2017). For Leshmania sp. also any conserved sequence in enough length for a TALEN target was not identified. Therefore, the Leishmanaia donovani complex which includes the species L. donovani, L. infantum and L. chagasi was considered because it is the cause for visceral leishmaniasis (Sundar Rai, 2002). Visceral leishmaniasis is the most severe form of leishmaniasis among others (Das et al., 2016). But tryR gene sequence of L. chagasi was not available in the databases, GenBank, RefSeq, EMBL-EBI or DDJB and therefore the sequences of other two species were used. For HBV, a fully conserved residue longer than 60 nucleotides common to all genotypes was not identified but a partially conserved sequence was selected (Supplementary Table 4).

Designing of potential TALEN target sites
Potential TALEN target sites were identified in each of the above selected conserved residues of the LMP2A and EBNA1 gene of EBV, E2 gene of HPV type 16, and tryR gene of L. donovani and L. infantum. The tool TALEN Targeter' provides both TALEN targets and the RVD sequence for the targets, as the output. Apart from this, it shows the unique restriction sites at the spacer region. Supplementary Table 5 contains the TALEN target results obtained for each sequence. For the HBx gene of HBV, the targets were identified for each genotype separately using the whole HBx gene as the query, and the targets were identified for a selected, partially conserved sequence, considering all genotypes. Then for each gene, several TALEN targets with higher percentage of HD or NH in their RVDs and having at least one unique restriction site at the spacer region were selected. TALENs having high percentage of HD or NH were selected because the binding specificity and the efficiency are higher when the percentage of HD or NH is high (Streubel et al., 2012). The selection of target sites with unique restriction sites is beneficial in experimental identification of TALEN activity (Doyle et al., 2012).

Potential off-target effect of selected TALENs on human genome, murine genome and unnecessary areas of the pathogen genome
The off-target effect of the TALENs to their targets on human genome, murine genome and unnecessary loci of same pathogen genome were identified for LMP2A and EBNA1 gene of EBV, E2 gene of HPV type 16 and tryR gene of L. donovani and L. infantum, using the tools PROGNOS and Paired Target Finder. The off-target effect of the TALENs separately selected for the HBx gene of each genotype and the TALENs common to HBx gene of all the HBV genotypes were also identified.

Journal of the National Science Foundation of Sri Lanka 49(3) September 2021
The off-target effect on the human genome is essential to be identified because if any off-target is present for a 'TALEN respective to the identified target site', it may cause mutations in the human genome. The offtarget effect on the mouse genome was also identified as preliminary toxicity tests were mostly carried out in vivo using mouse as the model organism, and as such, unnecessary mutations in the mouse genome was avoided. Two tools, Paired Target Finder and PROGNOS, were used in order to minimise the errors in identification of off-target effect. The off-target effects on human and murine genomes were identified with respect to the genomes already available in the tools. These genomes are consensus sequences and therefore, it cannot be concluded that the output given by the tools are valid for every human and mouse, but they would be valid for most. The use of the two tools minimises this effect because the genome entries are different in the two tools.
Other than these, unnecessary bindings on the same pathogen genome were identified in order to minimise the effect of mutations of other genes of the pathogen. If any undesired mutation occurs it might not be suitable for the host or might have a chance to elevate the pathogenic effect. Thus, the tool Paired Target Finder was used because other options were not available in the tool PROGNOS to check the off-targets in NCBI sequences other than the genomes already entered into the tool. Here a selected representative genome was used for each pathogen for convenience and it cannot be concluded about the off-target effect in every isolate and strain of the pathogen.
The score given by the tool, Paired Target Finder is the key by which the tool differentiates the off-targets from the on-targets. The score is given to the TALEN target based on the types of the RVDs present and the matching percentage of RVDs. The perfectly matching off-target gives the same score as the on-target of query TALEN sequence and the score increases when the mismatches increase. Doyle et al. (2012) suggested that the maximum score that an off-target would have is four times the score of the on-target. In the tool, only the off-targets below the maximum value are displayed. In the tool PROGNOS, the score has been adjusted to reduce when the off-target deviate from the on-target. Therefore, the perfect off-target is having the score same as that of the on-target. The results of the tool display the off-targets up to a selected number of mismatches, and the off-targets with higher potential of binding with the TALEN are mentioned. The maximum number of mismatches that off-target should contain is selected as five for all the TALEN target sites of the selected genes, and therefore, the off-target effect of those TALENs could be compared.

Identification of off-target effect of the TALENs on genomes of other organisms
The identification of the off-target effect of the TALENs on the genomes of other organisms is necessary. If the designed TALEN proteins are released to the environment, there is a chance of mutating the genomes of other organisms in the environment. The selection of target sites that are lacking off-targets in genomes of other organisms prevents this undesirable effect. Furthermore, human and mouse are inhabited with numerous species of commensals and mutation induction on them can also be predictively prevented with this step. Three methods of BLAST search gave desired results with comparative merits and demerits, and the results were in different formats (Figure 1). In this way the off-target effect of the selected TALENs of LMP2A and EBNA1 of EBV, E2 gene of HPV type 16, tryR gene of L. donovanii and L. infantum, HBx gene of all the genotypes of HBV and TALENs common for HBx gene of all the genotypes of HBV were identified. The 'nucleotide' search page of the tool BLAST was used because the tools Paired Target Finder and PROGNOS identify only the off-targets in selected genomes. But in BLAST tool it was a challenge to identify the targets/off-targets because the query (the TALEN target site) contained the spacer region which does not involve in the specific binding with a TALEN, and therefore three BLAST search methods were used. The method one of BLAST search displayed the highest number of probable targets/off-targets when compared with other two methods. But the identity and the query coverage the output have been calculated, including the spacer region, although it was masked in BLAST search. This interferes with the differentiation of off-targets from on-targets. In the result type 1 of method two, only the targets in the range one and range two that were lying in a distance not more than 30 nucleotides were selected because the maximum spacer length that a TALEN can be bound is thirty (30) nucleotides (Doyle et al., 2012). In the result type 2 of method two, the results are much effective because the spacer region has not been considered in calculating the percent identity and query coverage. In method three, the results of the on-targets/ off-targets are mostly similar to that of the 'result type 2 of method two. In these two results (result type 2 of method two and method three), the identification of the on-targets/off-targets are comparatively easier than other two, referring to the graphical alignment. The suspected sequences for having off-target effects were further tested with the tool Paired Target Finder, because BLAST tool is not specific for the purpose and a TALEN binding score is unavailable in the BLAST tool. Only the sequences

September 2021
Journal of the National Science Foundation of Sri Lanka 49 (3) similar to the query sequence (TALEN target sequence) can be identified with BLAST search, and that does not reflect that the TALEN of that query sequence binds with the off-target sequence.

Putative functions of the mutated protein by TALEN
The putative functions of the mutated protein due to the double stranded break by each TALEN of the selected genes were identified. The TALENs that were shown to produce proteins with unnecessary function were avoided. The identification of the putative function of the mutated protein is important because the mutated proteins might have undesirable functions. The proteins that formed due to frame shift of one base pair and frame shift of two base pairs were considered, but frame shift of three base pairs was not considered because it will not change the reading frame and only lead to alterations of a few amino acids resulting slight deviations from the initial function of the protein. The ORFs of the mutated protein that were passing through the cut site of the TALEN were selected because other ORFs are not mutated. Other than that, only the ORFs leading to proteins with amino acid number greater than 75 were considered, because smaller proteins less than that might not possess specific functions. But there are small proteins with key functions in the cell (Reichman-Fried Raz, 2014), and at some point, mutations cause changes in the protein function and is a limitation of the present study. The putative functions of the proteins of the selected ORFs were identified with the protein BLAST tool, where the functions of similar proteins to the query were considered as the function of the query protein. Most of the results obtained for the ORFs were lacking BLAST queries and hence the putative function could not be determined. This might be a potential error because these queries also may have some undesirable functions. This can be identified up to some extent by de novo protein function analysis. The partial proteins of the original were not further considered because they might not have novel functions. But some partial proteins could have undesirable functions in the cell.

Selected TALEN target sites
The target sites were selected for each gene that did not show an off-target effect in the human genome and murine genome in both results of the tools Paired Target Finder and PROGNOS, which did not show an unnecessary binding effect in the same pathogen genome and did not produce mutated proteins with undesirable function. Table 1 shows the target sites selected for the LMP2A and EBNA1 genes of EBV, E2 gene of HPV type 16, tryR gene of L. donovani and L. infantum and HBx gene of HBV.

Identification of potential CRISPR/Cas 9 targets
The results obtained for the LMP2A gene of EBV is shown in Figure 2. Among them T1, T2, T3, T4, T5, T6 and T7 were observed to be lacking off targets in the human genome (Supplementary Table 6). The same target sites were obtained and were numbered in the order of off-target effect on the murine genome as T65, T5, T21, T9, T30, T46 and T60 respectively. Among them T5 and T9 were not shown off target effect on the murine genome.
The selected targets were searched for the presence of off-target effect on the genomes of other organisms using 'nucleotide BLAST' tool. From the results it was identified that the target CRLMP2A01 possesses off-target effect on Schistosoma rodhaini genome. Therefore, target CRLMP2A02 was selected as the potential CRISPR/Cas9 target on LMP2A gene. Any genome with off-target effect with respect to the target CRLMP2A02 was not identified. In the same way CRISPR/Cas9 targets were selected and their potential of binding on other genomes was observed. But targets without off-target effect on human and murine genome were not observed for E2 gene of HPV type 16 and HBx gene of HBV. Table 2 shows the CRISPR/Cas9 targets selected for EBNA1 gene of EBV, UL21 and UL30 gene of HSV-2 and tryR gene of L. donovani and L. infantum respectively. The efficacy of guideRNA of each selected target was analysed by the tool CRISPRater (Labuhn et al., 2018). Score 0.56 or below shows low efficacy, score within 0.56 and 0.74 shows medium efficacy and high These targets are same as T2 and T4 arrangement s of the off-targets in the human genome.  efficacy is shown by scores equal or above 0.74. This scoring system is easily applicable for selecting efficient targets due to its ability to select efficient guideRNA. Supplementary Figure 1 represents the structure and action of CRISPR nuclease.
The protospacer adjacent motif (PAM) selected for the CRISPR/Cas9 targets in this study was NGG of Streptococcus pyogenes. But other PAM motifs can be substituted instead of NGG, which might change the off-target effect. The off-targets of the CRISPR/ Cas9 nucleases might differ by two base pairs from an on-target (Cho et al., 2014). Therefore, in our study we considered the targets below three pair mismatches as off-targets. The targets above four base pair mismatches mostly prevented the double-strand break and therefore, they were omitted. The cleavage efficiency of CRISPR system greatly varies on different target sites or the cell type/organism. The efficacy of binding and cleavage of CRISPR system depends on several features; features of the guide RNA, genetic features including epigenetics and energetic properties that have been identified through various studies as factors involved in determining the efficacy of the guideRNA (Cong et al., 2013;Fu et al., 2013;Wang et al., 2014;Chari et al., 2015;Liu et al., 2020 ).The putative function of the mutated protein by CRISPR/Cas9 was not identified because the cut site cannot be exactly determined.
In this research, only the pathogens with double stranded DNA genomes were considered because the TALEN and CRISPR/Cas9 nucleases cannot function on single stranded genomes or in RNA genomes. But recently Abudayyeh et al. (2017) have identified the CRISPR/ Cas13 system, which can be applied on RNA genomes.
Delivery of the CRISPR/cas9 nucleases or the TALENs to the required specific cell type is a question not solved yet. Specific strategies should be used for the successful transportation of mRNAs of the nucleases to the cytoplasm and the resulting nucleases to the nucleus of the cell. Viral vectors, microinjection, electroporation and chemical methods are a few currently used methods (Glass et al., 2018). The selection of a specific cell type by the nuclease is also important and Cheng et al. (2020) have described a tissue specific nanoparticle based method to deliver CRISPR mRNA.
The development of site specific nucleases is a concerns in terms of an ethical perspective (Rodriguez, 2016). The main ethical concern is the balance between risks and benefits. The loss of ecological equilibrium could occur. Apart from that the regulation of the product to the consumers is also an essential criterion to be evaluated. These ethical questions should be addressed in the development of TALEN and CRISPR/Cas nucleases as treatment strategies.

CONCLUSION
The identified potential TALEN and CRISPR/Cas9 targets may be applicable for specific mutagenic agents of EBV, HPV type 16, HSV-2, HBV and L. donovani and L. infantum and can be further developed as a treatment strategy. Furthermore, fully conserved residues of enough length for a TALEN target site are absent in the HBx gene considering all the genotypes of HBV, and in E2 gene, considering all high-risk types of HPV.