Introduction

Abstract

The RPCT toolkit is a dedicated toolkit based on the RAAC-PSSM protein classification prediction method, which developed by Zuo's Lab. It uses 7 feature extraction methods and SVM (Support Vector Machines) for protein classification prediction. You can find almost all common functions which used in protein classification in the RPCT toolkit, and get better classification models and more diverse results analysis.

1. RPCT Model Structure

RPCT uses 7 feature extraction methods to extract effective features ( raaPSSM, OAAC, SAAC, raaKmer, raaKPSSM, raaSW, raaDTPSSM ) from amino acid sequences, selects key features through three feature selection methods ( F-score, Relief, PCA ) and trains classification models through SVM(Fig1.1).

RPCT Classification Process

Fig1.1. RPCT Classification Process

2. Advantages Of RPCT Model

Why we choose machine learning?

Traditional protein analysis methods are mostly physical and chemical methods. Such as X-ray crystal diffraction and nuclear magnetic resonance technology. It not only waste a lot of time, but also consumes a lot of manpower and material resources. Mining the features in the protein sequence and predicting protein through machine learning can not only greatly improve the prediction efficiency, but also obtain higher-accuracy results compare with experimental analysis.

What is the RAAC?

The concept of Reduced Amino Acids was proposed in 1960. It has great potential in sequence alignment and structure prediction(Fig1.2). In Zuo's article, the PseKRAAC method has been proposed and a web server based on PseKRAAC has been built. In Zheng's web server, 74 types of Reduce Amino Acid Codes and literature sources have been listed in detail.

rpct_view

Fig1.2. One of the RAAC Maps from Zheng's Web server

What are the advantages of RAAC?

As the complexity of prediction models and the diversity of feature extraction continue to increase, the cost of protein prediction based on machine learning is also increasing quickly. Amino acid reduction can not only simplify the protein sequence, make the sequence more concise and clear, but also reduce the dimension of the feature space and reduce the input cost of machine learning. Moreover, through different reduction schemes, we can get more accurate prediction results and richer biological significance(Fig1.3).

IFS-both

Fig1.3. Two methods of feature selection (Left: IFS-RF; Right: IFS-PCASVD)

20aa

Fig1.4. The classification weight of different Amino Acids

heatmap

Fig1.5. The classification weight of different Amino Acids pairs

pie

Fig1.6. Distribution of the first 200 dimensional features

weblogo

Fig1.7. Sequence Reduce Weblogo

How many platforms does the RPCT toolkit support?

We have built a convenient program on Windows and Linux platforms respectively. Users can run the Windows version of the program in CMD to open the GUI, set parameters and submit tasks through the interactive window.

gui
Fig1.8. RPCT toolkit GUI

Users can also run Linux version programs in CMD to submit tasks and parameters directly through Linux commands.

cmd
Fig1.9. RPCT toolkit CMD

In addition, we developed pyrpct (a python package based on RPCT method) for those who need further development, and we described the API of all functions in detail. Users can not only download and install it from GitHub, but also use the pip command to call pyrpct easily.

pip install git@github.com:KingoftheNight/RPCT.git -U

Tips

You should read the following process to learn how to use the RPCT toolkit: