- Quick start
- Introduction
- Installation
- Subcommands
- Other functions
- Phylogenomics pipeline
- Grid Computing
- Input formats
- Output formats
- Citation
git clone https://github.com/zhangrengang/orthoindex.git
cd orthoindex
# install
conda env create -f OrthoIndex.yaml
conda activate OrthoIndex
pip3 install .
# test
cd example_data/
sh example.sh
# example.sh:
# dot plots
# A
soi dotplot -s Populus_trichocarpa-Salix_dunnii.collinearity.gz \
-g Populus_trichocarpa-Salix_dunnii.gff.gz -c Populus_trichocarpa-Salix_dunnii.ctl \
--kaks Populus_trichocarpa-Salix_dunnii.collinearity.ks.gz \
--xlabel 'Populus trichocarpa' --ylabel 'Salix dunnii' \
--ks-hist --max-ks 1.5 -o Populus_trichocarpa-Salix_dunnii \
--plot-ploidy --gene-axis --number-plots
# B
soi dotplot -s Populus_trichocarpa-Salix_dunnii.orthologs.gz \
-g Populus_trichocarpa-Salix_dunnii.gff.gz -c Populus_trichocarpa-Salix_dunnii.ctl \
--kaks Populus_trichocarpa-Salix_dunnii.collinearity.ks.gz \
--xlabel 'Populus trichocarpa' --ylabel 'Salix dunnii' \
--ks-hist --max-ks 1.5 -o Populus_trichocarpa-Salix_dunnii.o \
--plot-ploidy --gene-axis --number-plots \
# homology input
# C
soi dotplot -s Populus_trichocarpa-Salix_dunnii.collinearity.gz \
-g Populus_trichocarpa-Salix_dunnii.gff.gz -c Populus_trichocarpa-Salix_dunnii.ctl \
--xlabel 'Populus trichocarpa' --ylabel 'Salix dunnii' \
--ks-hist -o Populus_trichocarpa-Salix_dunnii.io \
--plot-ploidy --gene-axis --number-plots \
--ofdir OrthoFinder/OrthoFinder/Results_*/ --of-color # coloring by Orthology Index
# D
soi dotplot -s Populus_trichocarpa-Salix_dunnii.collinearity.gz \
-g Populus_trichocarpa-Salix_dunnii.gff.gz -c Populus_trichocarpa-Salix_dunnii.ctl \
--kaks Populus_trichocarpa-Salix_dunnii.collinearity.ks.gz \
--xlabel 'Populus trichocarpa' --ylabel 'Salix dunnii' \
--ks-hist --max-ks 1.5 -o Populus_trichocarpa-Salix_dunnii.io \
--plot-ploidy --gene-axis --number-plots \
--ofdir OrthoFinder/OrthoFinder/Results_*/ --of-ratio 0.6 # filtering by Orthology Index
# filter orthologous synteny
soi filter -s Populus_trichocarpa-Salix_dunnii.collinearity.gz -o OrthoFinder/OrthoFinder/Results_*/ \
-c 0.6 > Populus_trichocarpa-Salix_dunnii.collinearity.ortho.test
# or (alter input format)
soi filter -s Populus_trichocarpa-Salix_dunnii.collinearity.gz -o Populus_trichocarpa-Salix_dunnii.orthologs.gz \
-c 0.6 > Populus_trichocarpa-Salix_dunnii.collinearity.ortho.test
# compare with the expected output: no output via `diff`
diff Populus_trichocarpa-Salix_dunnii.collinearity.ortho Populus_trichocarpa-Salix_dunnii.collinearity.ortho.test
Note: If you want to run the full phylogenomics pipeline of SOI,
GENE ID is needed to label with SPECIES ID (e.g. Angelica_sinensis|AS01G00001) for compatibility.
See details how to prepare the data.
Anyway, the GENE/CHROMOSOME IDs in the input files are at least required to be consistent and unique.
Figure. The Orthology Index in identifying orthologous synteny: a typical case.
A) Ks-colored dot plots showing synteny detected by WGDI, with an observable distinction of three categories of syntenic blocks derived from three evolutionary events (three peaks: Ks ≈ 1.5, Ks ≈ 0.27, and Ks ≈ 0.13).
B) Ks-colored dot plots illustrating orthology inferred by OrthoFinder2, with an observable high proportion of hidden out-paralogs (Ks ≈ 0.27).
C) Orthology Index (OI)-colored dot plots: integrating synteny (A) and orthology (B), with polarized and scalable distinction of three categories of syntenic blocks (three peaks: OI ≈ 0, OI ≈ 0.1, and OI ≈ 0.9).
D) Ks-colored dot plots of synteny after applying an OI cutoff of 0.6, with clean 1:1 orthology as expected from the evolutionary history.
A-D are plotted using the dotplot subcommand with four subplots:
a) dot plots with colored by Ks or OI (x-axis and y-axis, chromosomes of the two genomes; a dot indicates a homologous gene pair between the two genomes),
b) histogram (with the same color map as the dot plots) of Ks or OI (x-axis, Ks or OI; y-axis, number of homologous gene pairs),
c-d) synteny depth (orthologous synteny depth indicating relative ploidy) derived from 50-gene windows (x-axis, synteny depth; y-axis, number of windows).
Orthology Index (OrthoIndex or OI) incorporates algorithmic advances of two methods (orthology inference and synteny detection), to determine the orthology of a syntenic block. It is straightforward, representing the proportion of orthologous gene pairs within a syntenic block.
You can install the environment and the lasest verion using conda or mamba:
git clone https://github.com/zhangrengang/orthoindex.git
cd orthoindex
mamba env create -f OrthoIndex.yaml
mamba activate OrthoIndex
pip3 install .
soi -h
Sometimes, OrthoIndex.yaml may be failed due to conflicts. You can install the dependencies as below:
mamba install python=3.8.8 -y -n orthoindex
mamba install -y -n orthoindex biopython networkx lazy-property drmaa psutil matplotlib \
mafft trimal 'iqtree>=2' newick_utils pal2nal mcl muscle \
wgdi orthofinder aster
mamba activate orthoindex
pip3 install .
soi -h
Alternatviely, the released version can be installed through conda or mamba:
mamba create -n OrthoIndex
mamba install -n OrthoIndex -c conda-forge -c bioconda soi
mamba activate OrthoIndex
soi -h
To use the container, you need to have installed Apptainer or Singularity. Then you can download the container image and run:
apptainer remote add --no-login SylabsCloud cloud.sylabs.io
apptainer remote use SylabsCloud
apptainer pull orthoindex.sif library://shang-hongyun/collection/orthoindex:1.2.0
./orthoindex.sif soi -h
The image can be found here.
The subcommand filter filters orthologous blocks with a default minimum index of 0.6:
Usage examples:
# from outputs of WGDI and OrthoFinder
soi filter -s wgdi/*.collinearity -o OrthoFinder/OrthoFinder/Result*/ > collinearity.ortho
# from outputs of MCscanX and OrthoMCL
soi filter -s mcscanx/*.collinearity -o pairs/orthologs.txt > collinearity.ortho
# from a list file and decrease the cutoff
ls wgdi/*.collinearity > collinearity.list
soi filter -s collinearity.list -o OrthoFinder/OrthoFinder/Result*/ -c 0.5 > collinearity.ortho
# filter a out-paralogous peak
soi filter -s wgdi/*.collinearity -o OrthoFinder/OrthoFinder/Result*/ -c 0.05 -upper 0.4 > collinearity.para
# remove intra-species, tandem repeat-derived synteny (in-paralogous)
soi filter -s wgdi/*.collinearity -o OrthoFinder/OrthoFinder/Result*/ -gff all_species_gene.gff -d 200 > collinearity.homo
# It can also distinguish in-paralogous from out-paralogous synteny split by a given speciation event,
# if we provide in-paralogs instead of orthologs
soi filter -s wgdi/SP1-SP1.collinearity -o inparalogs.pairs > collinearity.inpara
The subcommand 'cluster' groups orthologous syntenic genes into syntenic orthogroups (SOGs), through constructing an orthologous syntenic graph and applying the Markov Cluster (MCL) algorithm to perform graph clustering and break weak links.
Usage examples:
# all species to include
soi cluster -s collinearity.ortho -prefix cluster
# exclude outgroup species that do not share the INGROUP-specific WGD event
soi cluster -s collinearity.ortho -outgroup XXX YYY
The defualt output file is cluster.mcl, with the orthogroup format of legacy OrthoMCL.
The subcommand 'outgroup' retrieves syntenic orthologs from outgroups that lack WGDs shared with ingroups.
Usage examples:
# If outgroups are excluded in the last `cluster` step:
soi outgroup -s collinearity.ortho -og cluster.mcl -outgroup XXX YYY > cluster.mcl.plus
The subcommand ‘phylo’ reconstructs multi-copy or single-copy gene trees, by aligning protein sequences with MAFFT v7.481 (Standley and Katoh 2013), converting protein alignment to codon alignment with PAL2NAL v14 (Suyama et al. 2006), trimming alignments with trimAl v1.2 (Capella-Gutierrez et al. 2009) (parameter: -automated1) and reconstructing maximum-likelihood trees with IQ-TREE v2.2.0.3 (Minh et al. 2020).
Usage examples:
# output multi-copy gene trees of both protein and CDS(-both); rooted with grape (-root)
soi phylo -og cluster.mcl.plus -pep pep.faa -cds cds.fa -both -root Vitis_vinifera -pre mc-sog -p 80
# output single-copy gene trees (-sc) and concatenated alignments (-concat) of both protein and CDS (-both); rooted with grape (-root)
soi phylo -og cluster.mcl.plus -pep pep.faa -cds cds.fa -both -root Vitis_vinifera -pre sc-sog -sc -concat -p 80
# output multi-copy gene trees of protein, allowing up to 50% taxa missing
soi phylo -og cluster.mcl.plus -pep pep.faa -mm 0.5
The subcommand dotplot enables visualization and evaluation of synteny,
with colored by the Orthology Index or Ks values, or subgenome/ancestor assignments.
Usage examples:
# basic dotplot with Ks coloring (gene-axis is default)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --kaks wgdi_ks.tsv -o dot_ks
# use base-pair coordinates instead
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --kaks wgdi_ks.tsv --bp-axis -o dot_bp
# color by Orthology Index (takes priority over Ks)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl -orth orthologs.txt --of-color --of-ratio 0.6 -o dot_oi
# with ploidy subplots (a-d panels)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --kaks wgdi_ks.tsv --ks-hist --plot-ploidy --number-plots -o dot_full
# specify chromosomes directly (no ctl file)
soi dotplot -s collinearity.ortho -g all.gff --xchrs Pt1 Pt2 --ychrs Sd3 Sd4 -o dot_chrs
# auto-detect chromosomes by species name from GFF
soi dotplot -s collinearity.ortho -g all.gff --xsp Vitis_vinifera --ysp Daucus_carota -o dot_sp
# ancestor chromosome bars (--xbars triggers bar, data from --xanc)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --xanc anc_x.txt --xbars --xbarlab -o dot_bar
# bars with explicit file (no need for --xanc when not coloring dots)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --xbars alt_anc.txt -o dot_bar_file
# color bars by subgenome; dots by subgenome (y axis)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --yanc anc_y.txt --colorby-sg y --ybars --bar-colorby-sg -o dot_sg
# color dots by ancestor colors (from ancestor file color column)
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --xanc anc_x.txt --colorby-anc x -o dot_anc_color
# custom subgenome colors for bars and dots
soi dotplot -s collinearity.ortho -g all.gff -c xy_chrs.ctl --xanc anc_x.txt --xbars --bar-colorby-sg --sg-colors '#FF0000' '#00FF00' '#0000FF' '#FFFF00' -o dot_custom
The subcommand depth enables visualization of synteny depth (window-based),
with one genome as the reference.
Usage examples:
# specify multiple queries:
soi depth -s collinearity.ortho -g ../all_species_gene.gff -r Vitis_vinifera -q Daucus_carota Angelica_sinensis Apium_graveolens
# specify window size and step:
soi depth -s collinearity.ortho -g ../all_species_gene.gff -r Vitis_vinifera -q Daucus_carota Angelica_sinensis Apium_graveolens --window_size 60 --window_step 1
The subcommand ksplot plots Ks distributions with three visualization types:
histogram, density curve, and ridge plot.
Usage examples:
# histogram + density + ridge plots (default)
soi ksplot --kaks wgdi_ks.tsv -o ks_plot
# histogram only, with custom max Ks
soi ksplot --kaks wgdi_ks.tsv -o ks_hist -p hist --max-ks 1.5
The subcommand evaluate generates multi-panel diagnostic plots to evaluate and compare synteny patterns,
including fractionation rate, block size decay, Orthology Index, and copy-number statistics.
Usage examples:
# basic evaluation
soi evaluate -s collinearity.ortho -o orthologs.txt -g all.gff -r Vitis_vinifera -pre eval
# filter by query species
soi evaluate -s collinearity.ortho -o orthologs.txt -g all.gff -r Vitis_vinifera -q Daucus_carota Angelica_sinensis
The subcommand detandem removes tandem duplicate genes from orthogroups.
Tandem duplicates are defined as genes on the same chromosome whose index difference
is below a threshold (default -d 200). When synteny files (-s) are provided,
the gene with the highest degree in the synteny graph is retained;
otherwise, one gene is chosen arbitrarily.
Usage examples:
# basic tandem removal
soi detandem -og cluster.mcl -g all_species_gene.gff > cluster.mcl.detandem
# with custom distance threshold
soi detandem -og cluster.mcl -g all_species_gene.gff -d 100 > cluster.mcl.detandem
# with synteny files for smarter gene retention
soi detandem -og cluster.mcl -g all_species_gene.gff -s collinearity.ortho > cluster.mcl.detandem
The subcommand hog splits orthogroups into Hierarchical Orthologous Groups (HOGs)
based on orthologous synteny, and can output copy-number statistics for detecting
whole-genome duplications (WGD) across the species tree.
Usage examples:
# basic HOG splitting
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs
# include paralogs
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs -paralog
# output copy-number statistics table (per-node distribution)
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs --out-stats
# bar chart of copy-number distribution per node
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs --bar-plot
# species tree with pie charts at nodes (gray=1 copy, blue=2, green=3, orange=4, red=5+)
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs --tree-plot
# all outputs combined
soi hog -og cluster.mcl -s collinearity.ortho -t species.tree -prefix HOGs --out-stats --bar-plot --tree-plot --max-copies 10
Output files:
HOGs.tsv— HOG table (HOG / OG / Tree_Node / Parent / Genes)HOGs.stats.tsv— per-node copy-number distribution (columns: 1, 2, 3, ..., N+, Multi%)HOGs.bar.pdf/.png— multi-panel bar chart of the distributionHOGs.tree.pdf/.png— species tree with pie charts at each node
The subcommand prune purifies orthogroups (OGs) to single-copy per species,
guided by Hierarchical Orthologous Group (HOG) information.
For each SOG, it traverses the species tree and keeps only the gene copy
from the HOG with the broadest species representation, removing the rest.
Usage examples:
# basic pruning
soi prune -og cluster.mcl -s collinearity.ortho -t species.tree -o cluster.sc.mcl
Other functions can be found in SOI-tools. Related functions can be requested by users via issues.
See the function.
See the function.
See evolution_example for a pipeline of phylogenomics analyses based on Orthology Index.
The phylo subcommand supports SGE (Sun Grid Engine) and SLURM for parallelizing gene tree reconstruction tasks.
However, it requires the DRMAA to be properly configured (see DRMAA Python).
After libdrmaa is installed, just set the DRMAA_LIBRARY_PATH environment variable, like:
export DRMAA_LIBRARY_PATH=/opt/gridengine/lib/lx-amd64/libdrmaa.so.1.0
All the output format of state-of-the-art synteny detectors, including JCVI, MCscanX and WGDI, are supported:
# WGDI -icl (*.collinearity):
# Alignment 1: score=3194 pvalue=0.0265 N=80 Dc1&Lj1 plus
Daucus_carota|DCAR_003996 3506 Lonicera_japonica|Lj1C1189G6 4566 -1
Daucus_carota|DCAR_004004 3514 Lonicera_japonica|Lj1P1192T21 4580 1
Daucus_carota|DCAR_004005 3515 Lonicera_japonica|Lj1P1192T26 4581 1
.....
# MCscanX (*.collinearity):
############### Parameters ###############
# MATCH_SCORE: 50
# ....
## Alignment 28: score=500.0 e_value=1.3e-24 N=11 Ac1&Ah1 plus
28- 0: Ananas_comosus|Aco009515.1 Arabidopsis_thaliana|AT1G72340 0
28- 1: Ananas_comosus|Aco009511.1 Arabidopsis_thaliana|AT1G72360 0
28- 2: Ananas_comosus|Aco009507.1 Arabidopsis_thaliana|AT1G72370 0
28- 3: Ananas_comosus|Aco009502.1 Arabidopsis_thaliana|AT1G72410 0
28- 4: Ananas_comosus|Aco009492.1 Arabidopsis_thaliana|AT1G72480 0
.....
# JCVI (*.anchors):
###
Tetracendron_sinense|Tesin01G0059600 Trochodendron_aralioides|evm.TU.group9.733 1780
Tetracendron_sinense|Tesin01G0060100 Trochodendron_aralioides|evm.TU.group9.725 334
Tetracendron_sinense|Tesin01G0060800 Trochodendron_aralioides|evm.TU.group9.710 868
Tetracendron_sinense|Tesin01G0061600 Trochodendron_aralioides|evm.TU.group9.757 294
Tetracendron_sinense|Tesin01G0062600 Trochodendron_aralioides|evm.TU.group9.777 1400
....
The outputs from OrthoFinder2, OrthoMCL, Proteinortho6, Broccoli, InParanoid, SonicParanoid2 are supported:
# OrthoFinder2 output directory like:
OrthoFinder/OrthoFinder/Results_Jun25/
# OrthoMCL:
Tetracendron_sinense|Tesin01G0059600 Trochodendron_aralioides|evm.TU.group9.733 ...
Tetracendron_sinense|Tesin01G0060100 Trochodendron_aralioides|evm.TU.group9.725
Tetracendron_sinense|Tesin01G0060800 Trochodendron_aralioides|evm.TU.group9.710
...
# SonicParanoid2 output directory via `-o sonicparanoid_output --project-id my_run`:
sonicparanoid_output/runs/my_run/
# InParanoid output directory via `-out-dir ip-output -out-table`:
ip-output/
# or InParanoid output ortholog file via `-out-dir ip-output -out-allPairs`:
ip-output/allPairs
# Proteinortho6 output ortholog file:
*.proteinortho-graph
# Broccoli output ortholog file:
dir_step4/orthologous_pairs.txt
The gff/bed format for JCVI, MCscanX and WGDI are supported:
# gff for WGDI:
Dc1 Daucus_carota|DCAR_000504 20809 26333 + 1 Daucus_carota|DCAR_000504
Dc1 Daucus_carota|DCAR_000505 30205 39120 + 2 Daucus_carota|DCAR_000505
Dc1 Daucus_carota|DCAR_000506 53069 54763 + 3 Daucus_carota|DCAR_000506
Dc1 Daucus_carota|DCAR_000507 56557 60502 - 4 Daucus_carota|DCAR_000507
....
# gff for MCscanX:
Dc1 Daucus_carota|DCAR_000504 20809 26333
Dc1 Daucus_carota|DCAR_000505 30205 39120
Dc1 Daucus_carota|DCAR_000506 53069 54763
Dc1 Daucus_carota|DCAR_000507 56557 60502
....
# bed for JCVI:
Dc1 20809 26333 Daucus_carota|DCAR_000504 0 +
Dc1 30205 39120 Daucus_carota|DCAR_000505 0 +
Dc1 53069 54763 Daucus_carota|DCAR_000506 0 +
....
The outputs from KaKsCalculator and WGDI are supported:
# KaKsCalculator:
Sequence Method Ka Ks Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4) Substitutions S-Substitutions N-Substitutions Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4) Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA) GC(1:2:3) ML-Score AICc Akaike-Weight Model
Arabidopsis_thaliana|AT1G29430-Oryza_sativa|LOC_Os09g37470 YN 0.650185 3.49784 0.185882 3.0186e-10 420 106.037 313.963 NA 218 82.1279 135.872 NA NA 1.36913 2.07932:2.07932:1:1:1:1 0.477381(0.467857:0.428571:0.535714) NA NA NA NA
Arabidopsis_thaliana|AT1G29440-Oryza_sativa|LOC_Os09g37420 YN 0.541299 3.27405 0.16533 3.75171e-12 330 78.6813 251.319 NA 161 64.364 96.636 NA NA 1.19287 1.62285:1.62285:1:1:1:1 0.468182(0.431818:0.459091:0.513636) NA NA NA NA
Arabidopsis_thaliana|AT1G29460-Oryza_sativa|LOC_Os09g37410 YN 0.60446 3.48369 0.173511 1.7716e-11 408 104.055 303.945 NA 207 81.5585 125.441 NA NA 1.33877 2.28955:2.28955:1:1:1:1 0.470588(0.452206:0.419118:0.540441) NA NA NA NA
....
# WGDI -ks:
id1 id2 ka_NG86 ks_NG86 ka_YN00 ks_YN00
Angelica_sinensis|AS08G00315 Daucus_carota|DCAR_007041 0.0685 0.3645 0.0717 0.328
Angelica_sinensis|AS01G00334 Daucus_carota|DCAR_027727 0.0871 0.4938 0.0815 0.8313
Angelica_sinensis|ASUnG00186 Daucus_carota|DCAR_004673 0.1858 0.5447 0.1871 0.5516
....
The ctl format for MCscanX (dot_plotter) is supported:
1500
1500
As1,As2,As3,As4,As5,As6,As7,As8,As9,As10,As11 // y axis
Dc1,Dc2,Dc3,Dc4,Dc5,Dc6,Dc7,Dc8,Dc9 // x axis
The ancestor file (WGDI ancestor.txt format) maps modern chromosomes to ancestral chromosome blocks.
Used by --xanc, --yanc, --xbars, --ybars, --colorby-sg, --colorby-anc, and --bar-colorby-sg.
Format: chrom start end color subgenome [label]
# basic format (5 columns):
Pt1 0 100 #FF0000 1
Pt1 100 250 #00B9F1 2
Pt2 0 80 #FF0000 1
Pt2 80 200 #7200DA 3
# extended format (6 columns, label for --xbarlab / --ybarlab):
Pt1 0 100 #FF0000 1 Anc1A
Pt1 100 250 #00B9F1 2 Anc1B
Pt2 0 80 #FF0000 1 Anc2A
Pt2 80 200 #7200DA 3 Anc2B
Columns: chrom (modern chromosome), start/end (gene-order coordinates), color (hex color for ancestor chromosome), subgenome (integer subgenome ID), label (optional, shown when --xbarlab/--ybarlab is set).
In summary, users may be not needed to preprare additional files for this tool.
The Newick format (with or without branch lengths and support values) is supported for hog and rak subcommands:
# simple Newick with species names only:
(Vitis_vinifera,(Daucus_carota,(Angelica_sinensis,Apium_graveolens)));
# with branch lengths:
(Vitis_vinifera:0.1,(Daucus_carota:0.05,(Angelica_sinensis:0.02,Apium_graveolens:0.02):0.03):0.05);
# with WGD indicator (p=2: tetraploidy, p=2: hexaploidy, etc.):
(Vitis_vinifera:0.1,(Daucus_carota:0.05,(Angelica_sinensis:0.02,Apium_graveolens:0.02):0.03)[p=2]:0.05);
In summary, users may be not needed to preprare additional files for this tool. And other popular format can be supported upon request.
But it is important to label GENE ID with SPECIES ID (e.g. Angelica_sinensis|AS01G00001) (see details in evolution_example).
Unique CHROMOSOME ID is also required.
It is the best to make all ID unique when preparing your data,
so that many issues will be avoided, not only in soi but also in other tools.
The output fommat is just the same as the input format.
The output format of cluster and outgroup subcommands is the same as the OrthoMCL output format (legacy format):
SOG3000: Angelica_sinensis|AS08G01493 Angelica_sinensis|AS09G02085 Apium_graveolens|Ag7G00949 Aralia_elata|AE10G00968 Centella_asiatica|evm.TU.Scaffold_1.3269 Coriandrum_sativum|Cs09G02292 Coriandrum_sativum|Cs09G02294 Daucus_carota|DCAR_003880 Daucus_carota|DCAR_007867 Eleutherococcus_senticosus|Ese12G000172 Panax_ginseng|GWHGBEIL036624.1 Panax_ginseng|GWHGBEIL065014.1 Panax_notoginseng|PN028370
SOG3001: Angelica_sinensis|AS08G03434 Angelica_sinensis|AS08G03435 Apium_graveolens|Ag6G02640 Aralia_elata|AE12G02374 Centella_asiatica|evm.TU.Scaffold_7.2677 Coriandrum_sativum|Cs09G00377 Daucus_carota|DCAR_001654 Eleutherococcus_senticosus|Ese19G002246 Panax_ginseng|GWHGBEIL023685.1 Panax_ginseng|GWHGBEIL043114.1 Panax_ginseng|GWHGBEIL043118.1 Panax_ginseng|GWHGBEIL043125.1 Panax_notoginseng|PN013450
SOG3002: Angelica_sinensis|AS10G01791 Apium_graveolens|Ag1G00857 Apium_graveolens|Ag6G02641 Aralia_elata|AE12G02379 Centella_asiatica|evm.TU.Scaffold_7.2680 Coriandrum_sativum|Cs06G01941 Coriandrum_sativum|Cs06G01943 Coriandrum_sativum|Cs09G00381 Daucus_carota|DCAR_001660 Daucus_carota|DCAR_029095 Panax_ginseng|GWHGBEIL023683.1 Panax_ginseng|GWHGBEIL043112.1 Panax_notoginseng|PN013453
...
The hog subcommand outputs a TSV file with hierarchical orthologous group information:
HOG OG Tree_Node Parent Genes
SOG100.N5.hog0 SOG100 N5 Root Sp1|G001 Sp1|G002 Sp2|G003
SOG100.N3.hog0 SOG100 N3 SOG100.N5.hog0 Sp1|G001 Sp2|G003
SOG100.N3.hog1 SOG100 N3 SOG100.N5.hog0 Sp1|G002
Columns: HOG (unique HOG ID), OG (source orthogroup), Tree_Node (species tree node), Parent (parent HOG or "Root" if at root), Genes (space-separated gene IDs).
Zhang RG, Shang HY, Milne RI et. al. SOI: robust identification of orthologous synteny with the Orthology Index and broad applications in evolutionary genomics [J]. Nucleic. Acids. Res., 2025, 53 (7):gkaf320 [https://doi.org/10.1093/nar/gkaf320]