Macrel is a computational pipeline which can
- classify peptides into antimicrobial/non-antimicrobial,
- classify peptides into hemolytic/non-hemolytic,
- predict peptides from genomes (provided as contigs) or metagenomes (provided as short-reads) and output all the predicted anti-microbial peptides found.
See the usage section for more information.
If you use this software in a publication please cite
MACREL: antimicrobial peptide screening in genomes and metagenomes Celio Dias Santos-Junior, Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho bioRxiv 2019.12.17.880385; doi: https://doi.org/10.1101/2019.12.17.880385
NOTE: This is still a work in progress and, while the results of the tool should be correct, we are still working on making Macrel easier to install and use.
IMPORTANT: Macrel is also available as a webserver, please pay a visit to us.
Macrel represents a joint effort of Celio Dias Santos Jr., Shaojun Pan, Xing-Ming Zhao, and Luis Pedro Coelho from the Institute of Science and Technology for Brain-Inspired Intelligence (ISTBI) at Fudan University (Shanghai, China).
Antimicrobial peptides (AMPs) are peptides with a huge variety of biological activities (such as anticancer, antibacterial, antifungal and insecticidal), and their sequences are key to that activity. Microbes producing AMPs can limit the growth of other microorganisms and should be considered another normal source of them. Microbial AMPs are quite distinct from eukaryotic ones, since they can be obtained from nonribosomal synthesis. Thus, nonribosomal peptides can adopt different structures, such as cyclic or branched structures, and carry modifications like N-methyl and N-formyl groups, glycosylations, acylations, halogenation, or hydroxylation. Some examples of commercial microbial AMPs include polymyxin B and vancomycin, both FDA-approved antibiotics (Zhang and Gallo, 2016).
Most AMPs are peptides 10-50 residues long (some reaching 100 amino acids), with charges ranging between 2 and 11, consituted of approximately 50% of hydrophobic residues (Zhang and Gallo, 2016). The formation of amphiphilic ordered structures works as a driving force for membrane binding and disruption, a key AMP feature. The helix destabilization often can reduce the cytotoxicity of AMPs, although this can result in reduction of the antimicrobial effects (Malmsten, 2014; Borgden, 2005; Pasupuleti et al., 2012; Hancok and Sahl, 2006; Shai, 2002; Stromstedt et al., 2006). There is a dynamic interchange in AMPs structure and topologies along the interaction with the microbial cell membranes (Samson, 1998). Electrostatic interactions of AMPs with the outer membrane surface of prokaryotic cells (negatively charged) is the primary mechanism for antimicrobial activity. Most AMP activities are associated to the rupture of cell membrane, promoting the leakage of cell contents. Other cases are based in the AMP translocation across the cell membrane and the inhibition of essential cellular processes (e.g. protein synthesis, nucleic acid synthesis, enzymatic activities) (Brogden, 2005). Based on the mechanisms of action, AMPs are categorized into membrane acting and nonmembrane acting peptides.
The genomic era was constrasted by the reality of hundreds of available bacterial genomes that have so far failed to deliver the hoped-for new molecular targets for antibiotics. However, so far it always have focused in the active molecules produced by the metabolism, instead searching for active peptides or proteins. The best reason to bet in host defense antimicrobial peptides or AMPs is that they remained potent for millions of years, constituting a useful strategy to develop a new generation of antimicrobials meeting the worldwide growing antibiotic resistance problem. However, the prediction of small genes from meta-genomic/transcriptomic sequences and the prediction of active AMPs are the main problems with AMPs mining from meta- and genomic data sets.
Current methods to small genes prediction tipically lead to high rates of false positives (Hyatt et al., 2010). Recent smORFs surveys demonstrated that these methods followed by a filtering of false-positives can lead to biologically active smORFs (Miravet-Verde et al., 2019; Sberro et al., 2019). Furthermore, the prediction of AMP activity demands techniques other than homology-based methods, due to the degeneration of searches at smaller sequences. Several machine learning-based methods demonstrated high accuracy in predicting antimicrobial activity in peptides (Xiao et al., 2013; Meher et al., 2017; Bhadra et al., 2018), although, none of them represented a full pipeline to extract AMPs from genomic data and filter off mispredictions. Our main goal with Macrel is a highthroughput screening system of AMPs, through machine learning, able to retrieve AMP sequences with high confidence from meta(genomic) reads.
Macrel can be used in a wide-ranging of scenarios, such as screening for novel AMPs, generating candidates to further testing and patenting, as well as, determination of microbiome quorum sensing mechanisms linking AMPs to health conditions or presence of diseases.
Macrel pipeline does:
quality trimming of single- and paired-end reads,
assembly of reads into contigs
small genes prediction,
clustering of peptides at 100% of similarity and 100% coverage,
calculation of the features of the predicted peptides,
classification of peptides into AMPs by using Random Forests,
classification of AMPs accordingly to their hemolytic activity also using Random Forests,
calculate AMPs abundance in meta(genomic) samples by reads mapping.
Macrel is fast and works by coordinating NGLess, megahit, prodigal and PALADIN. It is implemented in Python and R. Its models were trained with Scikit-Learn python module, and the descriptors are calculated with the Peptides R package.
The 22 descriptors adopted by Macrel are hybrid comprising local and global contexts to do the sequence encoding. Macrel performs firstly a distribution analysis (Figure 1) of three classes of residues in two different features (Solvent accessibility and Free energy to transfer from water to lipophilic phase) as shown in Table 1. The novelty in this method is using the Free energy to transfer from water to lipophilic phase (FT) firstly described by Von Heijne and Blomberg, 1979 to capture the spontaneity of the conformational change that AMPs suffer while their transference from water to the membrane. For more info about the other descriptors used in Macrel and the algorithms used to train the classifiers, please refer to the Macrel preprint.
Figure 1. Method of sequence encoding using CTD (Composition, Distribution and Transition). (Source: Dubchak et al., 1995)
Table 1. Classes adopted to the sequence encoding of the distribution at the first residue of each class. The Solvent Accessibility was adopted as in previous studies (Dubchak et al. 1995, 1999), however, the new feature FT was adapted from Von Heijne and Blomberg, 1979.
|Properties||Class I||Class II||Class III|
|Solvent accessibility||A, L, F, C, G, I, V, W||R, K, Q, E, N, D||M, S, P, T, H, Y|
In this sense, despite the high accuracy and sensitivity, other works still suggest that methods independent of sequence order and mostly based in cheminformatics have comparable statistics (Boone et al., 2018). Fjell et al. (2009) has shown using a combination of 77 QSAR (quantitative structure-activity relationships) descriptors that artificial neural network models could predict the extension of peptides activity, not only classify them. Thus, these methods could be joined to achieve a better performance and fix their pitfalls, since the sequence order independent methods fail in classify, however are good to describe activity; and sequence encoding is essential to a good classification, but fails when predict activity extension.
The other descriptors (independent of sequence order) used in Macrel classifiers are widely used in the AMPs description, as follows:
- tinyAA (A + C + G + S + T) - smallAA (A + B + C + D + G + N + P + S + T + V) - aliphaticAA (A + I + L + V) - aromaticAA (F + H + W + Y) - nonpolarAA (A + C + F + G + I + L + M + P + V + W + Y) - polarAA (D + E + H + K + N + Q + R + S + T + Z) - chargedAA (B + D + E + H + K + R + Z) - basicAA (H + K + R) - acidicAA (B + D + E + Z) - charge (pH = 7, pKscale = "EMBOSS") - pI (pKscale = "EMBOSS") - aindex (relative volume occupied by aliphatic side chains - A, V, I, and L) - instaindex -> stability of a protein based on its amino acid composition - boman -> overall estimate of the potential of a peptide to bind to membranes or other proteins as receptor - hydrophobicity (scale = "KyteDoolittle") -> GRAVY index - hmoment (angle = 100, window = 11) -> quantitative measure of the amphiphilicity perpendicular to theaxis of any periodic peptide structure, such as the alpha-helix or beta-sheet
We opted to use random forests after some tests with alternative algorithms. The training of AMPs classifier used the same parameters and data sets used by Bhadra et al. (2018), while the classifier of hemolytic peptides was trained and tested with the data sets previously established by Chaudhary et al. (2016).
The models here mentioned were implemented to filter off the non-AMP peptides and classify AMPs into hemolytic or not. After that, this script also submits the predicted AMPs to a decisions tree (Figure 2), classifying AMPs into 4 families accordingly to their nature (Cationic or Anionic) and structure (linear or disulfide bond forming). These classifications are then available in an output table with sequence, random identifiers, hemolytic nature and associated probabilities are also given.
Figure 2. Decision tree to classification of peptides into different classes accordingly to their composition and capacity in forming disulfide bonds (Legend: AcidicAA - Acidic amino acids: B + D + E + Z; BasicAA - Alkaline amino acids: H + K + R).
Benchmark procedures showed that Macrel models are efficient in retrieving AMPs with statistics that are similar to the top state-of-art methods (Table 2). The AMP prediction model was compared at two levels the first level with it trained with the small training dataset (1:3 positives:negatives) and when trained with the unbalanced dataset from AMPep (1:50 AMPs to non-AMPs). The final results (Table 2) shows clearly that Macrel models are comparable in retrieving AMPs from the testing dataset reaching accuracies very close to the best systems. However, Macrel AMP classifier trained with the unbalanced dataset outperforms other methods in terms of precision, what is benefitial in the work with meta(genomic) samples, which usually present few AMPs per sample.
Table 2. Comparison of Macrel and other state-of-art AMP prediction systems. All systems were tested with the benchmark data set from Xiao et al. 2013.
|MACREL 1:3||0.953||0.972||0.935||0.971||0.91||This study|
|MACREL 1:50||0.946||0.998||0.895||0.998||0.90||This study|
Meanwhile, the hemolytic prediction model implemented in Macrel has a comparable performance of the state-of-art methods previously tested by Chaudhary et al., 2016 as shown in Table 3. The MCC measure also shows our model performing similarly to the models implemented by Chaudhary et al. 2016.
Table 3. Comparison of the performance of different hemolytic activity prediction systems. All the systems were trained and benchmarked with HemoPI-1 data sets used by Chaudhary et al., 2016.
Our classifiers seems to be extremely interesting in the execution of the filtering off non-AMP peptides and classifying them into hemolytic or non-hemolytic peptides. The models implemented in Macrel used the same set of descriptors and, although are not the most accurate, ensure highly precise results, important to meta(genomic) samples work.