# Evolution and adaptations of the seminal proteome in an insect with traumatic insemination

## Description of the data and file structure

### Data archive

**File:** `data.zip`

This archive contains the data files used in the analyses described in
the associated manuscript.

The archive contains multiple `.csv`, `.tsv`, and `.rds` files. 

Missing values are represented either as blank cells or `"NA"`.

------------------------------------------------------------------------

# File descriptions

## Shared reference files

The following files are used in multiple analysis scripts.

------------------------------------------------------------------------

## `DAVID2entrez_geneID.txt`

Mapping file used to convert **UniProt protein IDs to Entrez gene IDs**
using the DAVID Bioinformatics Resources.

Source: https://davidbioinformatics.nih.gov/

**Columns**

  Column        Description
  ------------- ---------------------------
  `From`        UniProt protein accession
  `To`          Entrez Gene ID
  `Gene.Name`   Gene name annotation

**Species:** *Cimex lectularius*

------------------------------------------------------------------------

## `cimex_signal_peps_combined.rds`

List of *Cimex lectularius* proteins predicted to contain **signal
peptides**, generated by combining predictions from:

-   Phobius\
-   SignalP 6.0

The file contains UniProt protein IDs predicted to encode secreted
proteins.

------------------------------------------------------------------------

## `chrom_locs.csv`

Chromosomal location of each gene inferred from **BLASTn searches
against the *Cimex lectularius* chromosome-level genome assembly (Miles et al. 2025)**. 

**Columns**

  Column      Description
  ----------- -----------------------------------------
  `gene_id`   Entrez gene ID
  `chm`       Chromosome on which the gene is located
  `accession` UniProt ID

The *Cimex lectularius* karyotype is:

-   **Males:** 2n = 26 + X₁X₂Y\
-   **Females:** 2n = 26 + X₁X₁X₂X₂

------------------------------------------------------------------------

## `codeml_model0_full_results.csv`

Evolutionary rate estimates calculated using **PAML codeml (model 0)**.

**Columns**

  Column     Description
  ---------- ------------------------------------------
  `GeneID`   Entrez gene ID
  `t`        Total branch length across the phylogeny
  `S`        Number of synonymous sites
  `N`        Number of nonsynonymous sites
  `dN`       Nonsynonymous substitution rate
  `dS`       Synonymous substitution rate
  `omega`    dN/dS ratio

------------------------------------------------------------------------

## `N0.tsv`

Hierarchical Orthogroup assignments generated using **OrthoFinder**.

Input consisted of protein FASTA files for each species included in the
analysis.

**Columns**

  -----------------------------------------------------------------------
  Column                              Description
  ----------------------------------- -----------------------------------
  `HOG`                               Hierarchical Orthogroup identifier

  `OG`                                Orthogroup identifier (deprecated)

  `Gene Tree Parent Clade`            Phylogenetic clade defining the
                                      orthogroup
  -----------------------------------------------------------------------

Species included in the analysis:

-   Acyrthosiphon\
-   Aedes\
-   Callosobruchus\
-   Chemipterus\
-   Cimex\
-   Drosophila\
-   Gryllus\
-   Rhodnius\
-   Triatoma

------------------------------------------------------------------------

## `FlyBaseAllGenes_2_Uniprot_2025_05_07.tsv`

Gene ID conversion table downloaded from **FlyBase** for *Drosophila
melanogaster*.

Source: https://flybase.org/

**Columns**

  Column                Description
  --------------------- ---------------------------
  `From`                FlyBase gene identifier
  `Entry`               UniProt protein accession
  `Protein names`       Protein name annotation
  `Gene Names`          Gene symbol
  `Length`              Protein length
  `Gene Ontology IDs`   Associated GO terms

------------------------------------------------------------------------

## `cimex_go_ids.rds`

Gene Ontology (GO) annotations for *Cimex lectularius* proteins
downloaded from **UniProt**.

**Columns**

  Column        Description
  ------------- --------------------------------------------
  `Accession`   UniProt protein ID
  `GOID`        Semicolon-separated list of GO identifiers

------------------------------------------------------------------------

# Reproductive biology datasets

# `LFQ_v2.0.xlsx`

Protein-level quantification and identification results from label-free quantification (LFQ) comparing each tissue/sperm: AGS, Mesa, Sperm, and Testis, each with multiple biological replicates (F1–F17 across 17 fractions/samples).

## Protein Identification & Quality

  Column                                        Description
  --------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------
  `Checked`                                     Boolean flag indicating whether the protein entry has been manually selected/verified in PD.
  `Protein FDR Confidence: Combined`            Confidence level assigned based on the combined false discovery rate (FDR) across search engines. Values: High, Medium, Low.
  `Master`                                      Indicates whether the protein is a Master Protein — the representative entry for a protein group.
  `Accession`                                   UniProt or database accession number for the identified protein.
  `Description`                                 Full protein name including organism, taxonomy ID (OX), gene name (GN), protein existence level (PE), and sequence version (SV).
  `Exp. q-value: Combined`                      Experimental q-value (FDR threshold) from the combined scoring across search engines; lower = higher confidence.
  `Sum PEP Score`                               Summed Posterior Error Probability (PEP) score across all PSMs; higher = better identification confidence.
  `Coverage [%]`                                Percentage of the protein sequence covered by identified peptides.
  `# Peptides`                                  Total number of distinct peptides identified for this protein.
  `# PSMs`                                      Total number of peptide-spectrum matches assigned to this protein.
  `# Unique Peptides`                           Number of peptides that uniquely map to this protein and no other.
  `# AAs`                                       Length of the protein sequence in amino acids.
  `MW [kDa]`                                    Theoretical molecular weight in kilodaltons.
  `calc. pI`                                    Calculated isoelectric point of the protein.
  `Score Sequest HT: Sequest HT`                Total Sequest HT search engine score for this protein.
  `# Peptides (by Search Engine): Sequest HT`   Number of peptides identified specifically by the Sequest HT search engine.
  `# Razor Peptides`                            Peptides shared between protein groups but assigned to the highest-scoring protein (razor rule).

---

## Abundance Ratios (Linear Scale)

Pairwise ratios of grouped abundance values between the four sample groups. A value >1 indicates higher abundance in the numerator group.

  Column                                  Description
  --------------------------------------- ---------------------------------------------------
  `Abundance Ratio: (Mesa) / (AGS)`       Linear fold-change of Mesa relative to AGS.
  `Abundance Ratio: (Sperm) / (AGS)`      Linear fold-change of Sperm relative to AGS.
  `Abundance Ratio: (Testis) / (AGS)`     Linear fold-change of Testis relative to AGS.
  `Abundance Ratio: (Sperm) / (Mesa)`     Linear fold-change of Sperm relative to Mesa.
  `Abundance Ratio: (Testis) / (Mesa)`    Linear fold-change of Testis relative to Mesa.
  `Abundance Ratio: (Testis) / (Sperm)`   Linear fold-change of Testis relative to Sperm.

---

## Abundance Ratios (Log₂ Scale)

Log₂-transformed versions of the above ratios. Positive values = upregulated in numerator; negative = downregulated.

  Column                                        Description
  --------------------------------------------- ---------------------------------------------------
  `Abundance Ratio (log2): (Mesa) / (AGS)`      Log₂ fold-change of Mesa vs AGS.
  `Abundance Ratio (log2): (Sperm) / (AGS)`     Log₂ fold-change of Sperm vs AGS.
  `Abundance Ratio (log2): (Testis) / (AGS)`    Log₂ fold-change of Testis vs AGS.
  `Abundance Ratio (log2): (Sperm) / (Mesa)`    Log₂ fold-change of Sperm vs Mesa.
  `Abundance Ratio (log2): (Testis) / (Mesa)`   Log₂ fold-change of Testis vs Mesa.
  `Abundance Ratio (log2): (Testis) / (Sperm)`  Log₂ fold-change of Testis vs Sperm.

---

## Statistical Testing of Abundance Ratios

Six pairwise comparisons are reported for each statistic: Mesa/AGS, Sperm/AGS, Testis/AGS, Sperm/Mesa, Testis/Mesa, and Testis/Sperm.

  Column                                          Description
  ----------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------
  `Abundance Ratio P-Value: (X) / (Y)`            Raw p-value from the statistical test comparing abundance between the two groups. Blank cells indicate insufficient data for testing.
  `Abundance Ratio Adj. P-Value: (X) / (Y)`       Benjamini-Hochberg adjusted p-value (FDR-corrected) for the pairwise comparison. Used for significance filtering.
  `Abundance Ratio Variability [%]: (X) / (Y)`    Coefficient of variation (%) of the ratio across replicates; reflects measurement consistency between groups.

---

## Grouped Abundances

Summarised abundance values per sample group, derived by averaging or summing replicate values after normalisation.

  Column                               Description
  ------------------------------------ ------------------------------------------------------------------------------------
  `Abundances (Grouped): AGS`          Mean/summed abundance across all AGS replicates.
  `Abundances (Grouped): Mesa`         Mean/summed abundance across all Mesa replicates.
  `Abundances (Grouped): Sperm`        Mean/summed abundance across all Sperm replicates.
  `Abundances (Grouped): Testis`       Mean/summed abundance across all Testis replicates.
  `Abundances (Grouped) CV [%]: AGS`   Coefficient of variation (%) across AGS replicates; reflects within-group reproducibility.
  `Abundances (Grouped) CV [%]: Mesa`  CV across Mesa replicates.
  `Abundances (Grouped) CV [%]: Sperm` CV across Sperm replicates.
  `Abundances (Grouped) CV [%]: Testis` CV across Testis replicates.

---

## Per-Sample Raw Abundances (F1–F17)

Individual raw abundance values for each sample file before normalisation. Files are grouped by condition:
AGS = F1–F5 (5 replicates), Mesa = F6–F9 (4 replicates), Sperm = F10–F13 (4 replicates), Testis = F14–F17 (4 replicates).

  Column                               Description
  ------------------------------------ ---------------------------------------------------------------------------------------------------------------
  `Abundance: F[n]: Sample, [Group]`   Raw MS1 intensity-based abundance value for that specific sample file. Missing values indicate the protein was not quantified in that replicate.

---

## Per-Sample Normalised Abundances (F1–F17)

Normalised counterparts of the raw abundance columns, adjusted to correct for sample loading differences or systematic variation between runs.

  Column                                           Description
  ------------------------------------------------ ---------------------------------------------------------------------------------------------------------------
  `Abundances (Normalized): F[n]: Sample, [Group]` Normalised abundance value for each sample/fraction, used for all ratio and statistical calculations.

---

## Detection Status per Sample

Boolean/categorical indicator of whether the protein was detected in each individual sample.

  Column                                          Description
  ----------------------------------------------- ---------------------------------------------------------------------------------------------------------------
  `Found in Sample: [Sn] F[n]: Sample, [Group]`   Detection status for each sample. Values: **High** (confidently identified), **Peak Found** (signal detected but below high-confidence threshold), or **Not Found** (absent).

---

## Protein Grouping & Modifications

  Column               Description
  -------------------- ---------------------------------------------------------------------------------------------------------------
  `# Protein Groups`   Number of protein groups this entry belongs to (typically 1 for a unique master protein).
  `Modifications`      Post-translational modifications (PTMs) identified on peptides assigned to this protein (e.g., oxidation, phosphorylation). Blank if none detected.


#  Sperm-leucylaminopeptidase datasets

## `M17LAP3_SLAP12_GrSm_orthologs.fasta`

UniProt protein sequences for S-Lap orthologs and cytosol aminopeptidase (LAP3; P00727)


## `NCBI_BLASTp_SLAPs.csv`

BLAST results from NCBI (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) giving short name for `first`, `second` and `third` top ranked Drosophila S-Lap for each Cimex `Protein`. Search results were restricted to Drosophila melanogaster

## `reciprocal_best_hits.tsv`

Results of local BLASTp for the 13 Cimex S-Lap orthologs against the 6 Drosophila S-Laps.


## `slap_trim.nwk`

Newick format tree for Cimex S-Lap orthologs from protein sequences aligned using MAFFT including Dmel and LAP3 then manually pruned to only show Cimex proteins


# Protein-ligand complex predictions

## `region_constrained_details.tsv`

Results of analysis to characterise M17 aminopeptidase metal-binding and catalytic residue motif of S-Lap and and granny-smith orthologs. 

-----------------------------------------------------------------------
Column                              Description
----------------------------------- -----------------------------------
`protein_id`                        UniProt identifier as provided
                                    in the input FASTA file

`site_name`                         Motif site identifier (`metal_1`
                                    through `metal_5` = divalent cation
                                    binding residues; `cat_1`, `cat_2`
                                    = catalytic residues)

`position`                          Position of the best-matching
                                    residue in the query sequence
                                    (1-based)

`residue`                           Single-letter amino acid code of
                                    the residue found at this position

`canonical_residues`                Residue(s) matching the M17 LAP
                                    consensus exactly, separated by `/`

`compatible_residues`               Residues with structural or
                                    mutagenesis evidence for metal
                                    coordination in M17 LAPs, separated
                                    by `/`; `none` for catalytic sites
                                    which have no compatible class

`offset`                            Positional offset of the found
                                    residue relative to the exact
                                    expected position (0 = exact match;
                                    negative = upstream; positive =
                                    downstream)

`tier`                              Classification of the found
                                    residue: `CONSERVED` = exact match
                                    to consensus; `COMPATIBLE` =
                                    structurally validated alternative;
                                    `NON_FUNCTIONAL` = neither

`score`                             Numeric score for this site:
                                    `1.0` = CONSERVED; `0.5` =
                                    COMPATIBLE; `0.0` = NON_FUNCTIONAL
-----------------------------------------------------------------------

## AlphaFold3 results

Predicted protein--ligand interactions generated using **AlphaFold3
complex structure predictions** (Abramson et al. 2024 Nature). Interactions were predicted between each protein complexed with two zinc ions. Signal peptides were trimmed prior to structure prediction. 

### `zn_cifs`

Directory containing AlphaFold3 protein--ligand model `mmCIF` files and `key_file.tsv` to match arbitrary file names (e.g., `af_protein_ligand_X_model.cif`) to real UniProt protein ids.

### `zinc_summary_chimerax.csv`

Results from measuring zinc ion coordination from `mmCIF` files using ChimeraX v.1.11.1 (Meng et al. 2023; Prot. Sci.).

**Columns**

-----------------------------------------------------------------------
Column                              Description
----------------------------------- -----------------------------------
`protein`                           Protein identifier derived from
                                    input CIF filename
                                    
`zinc_found`                        Whether zinc ions were detected
                                    in the structure (True/False)
                                    
`zn_index`                          Index of the zinc ion within the
                                    structure (1 = first zinc, 2 =
                                    second zinc)
                                    
`num_zn_total`                      Total number of zinc ions detected
                                    in the structure
                                    
`site_type`                         Classified zinc binding site:
                                    `Zn1_exchangeable` (lysine-
                                    coordinated) or `Zn2_tight`
                                    (two aspartate contacts, no
                                    lysine)
                                    
`num_canonical_contacts`            Number of canonical M17
                                    coordinating residues after
                                    bidentate deduplication
                                    
`coordinating_residues_raw`         All atoms within 3.5 Å of the
                                    zinc ion. Format:
                                    chain:RES###(atom)=dist Å.
                                    Asterisk (*) denotes canonical
                                    M17 sidechain contacts
                                    
`coordinating_residues_deduped`     Canonical contacts after
                                    bidentate deduplication (closest
                                    oxygen retained per Asp/Glu
                                    residue). Asterisk (*) denotes
                                    canonical M17 sidechain contacts
                                    
`canonical_residue_types`           Residue types of canonical
                                    coordinating contacts (LYS, ASP,
                                    GLU)
                                    
`lys_present`                       Whether lysine coordinates the
                                    zinc ion; diagnostic for Zn1
                                    of M17 LAP (True/False)
                                    
`mean_distance`                     Mean Zn–ligand distance (Å)
                                    across all contacts in the
                                    deduplicated set
                                    
`mean_plddt_at_site`                Mean AlphaFold3 pLDDT confidence
                                    score across atoms coordinating
                                    the zinc ion
                                    
`score`                             M17 coordination score (0–10).
                                    Incorporates canonical residue
                                    presence, coordination number,
                                    bond distance quality, and
                                    pLDDT confidence
                                    
`call`                              Activity classification:
                                    `LIKELY_ACTIVE` (score ≥ 6),
                                    `UNCERTAIN` (score 3–5),
                                    `LIKELY_INACTIVE` (score < 3)
                                    
`flags`                             Diagnostic notes from scoring,
                                    including missing canonical
                                    residues, unexpected coordination
                                    numbers, weak or bad distances,
                                    low pLDDT, non-canonical
                                    contacts, and bidentate
                                    deduplication events
-----------------------------------------------------------------------

### `key_file.tsv`

Used to match arbitrary filename IDs used in AlphaFold to real protein IDs. Also contains metadata. 

**Columns**

-----------------------------------------------------------------------
Column                              Description
----------------------------------- -----------------------------------
`filename`                          Arbitrary ID. Matches `protein` in `zinc_summary_chimerax.csv`

`protein_id`                        UniProt protein ID

`num_ligands`                       Total number of ligands in input .json

`ligand_type`                      	Ligand type in input .json

`seed1/2/3/4/5`                     AlphaFold3 model seed for each model 
-----------------------------------------------------------------------

# Code and software

All analysis code is available on GitHub:

https://github.com/MartinGarlovsky/Bedbug_proteomics

------------------------------------------------------------------------

# Data access

Proteomic data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the identifier PXD075584. 
