# Novel female reproductive organ differentiates postmating transcriptional response to insemination versus arrival of sperm in bedbugs

## Description of the data and file structure

### Data archive

**File:** `data.zip`

This archive contains the data files used in the analyses described in
the associated manuscript.

The archive contains multiple `.csv`, `.tsv`, and `.rds` files. 

Missing values are represented either as blank cells or `"NA"`.

------------------------------------------------------------------------

# File descriptions

## Shared reference files

The following files are used in multiple analysis scripts.

------------------------------------------------------------------------

## `DAVID2entrez_geneID.txt`

Mapping file used to convert **UniProt protein IDs to Entrez gene IDs**
using the DAVID Bioinformatics Resources.

Source: https://davidbioinformatics.nih.gov/

**Columns**

  Column        Description
  ------------- ---------------------------
  `From`        UniProt protein accession
  `To`          Entrez Gene ID
  `Gene.Name`   Gene name annotation

**Species:** *Cimex lectularius*

------------------------------------------------------------------------

## `cimex_signal_peps_combined.rds`

List of *Cimex lectularius* proteins predicted to contain **signal
peptides**, generated by combining predictions from:

-   Phobius\
-   SignalP 6.0

The file contains UniProt protein IDs predicted to encode secreted
proteins.

------------------------------------------------------------------------

## `blast_chm_locations.csv`

Chromosomal location of each gene inferred from **BLASTn searches
against the *Cimex lectularius* chromosome-level genome assembly (Miles et al. 2025)**. 

**Columns**

  Column      Description
  ----------- -----------------------------------------
  `gene_id`   Entrez gene ID
  `chm`       Chromosome on which the gene is located

The *Cimex lectularius* karyotype is:

-   **Males:** 2n = 26 + X₁X₂Y\
-   **Females:** 2n = 26 + X₁X₁X₂X₂

------------------------------------------------------------------------

## `codeml_model0_full_results.csv`

Evolutionary rate estimates calculated using **PAML codeml (model 0)**.

**Columns**

  Column     Description
  ---------- ------------------------------------------
  `GeneID`   Entrez gene ID
  `t`        Total branch length across the phylogeny
  `S`        Number of synonymous sites
  `N`        Number of nonsynonymous sites
  `dN`       Nonsynonymous substitution rate
  `dS`       Synonymous substitution rate
  `omega`    dN/dS ratio

------------------------------------------------------------------------

## `N0.tsv`

Hierarchical Orthogroup assignments generated using **OrthoFinder**.

Input consisted of protein FASTA files for each species included in the
analysis.

**Columns**

  -----------------------------------------------------------------------
  Column                              Description
  ----------------------------------- -----------------------------------
  `HOG`                               Hierarchical Orthogroup identifier

  `OG`                                Orthogroup identifier (deprecated)

  `Gene Tree Parent Clade`            Phylogenetic clade defining the
                                      orthogroup
  -----------------------------------------------------------------------

Species included in the analysis:

-   Acyrthosiphon\
-   Aedes\
-   Callosobruchus\
-   Chemipterus\
-   Cimex\
-   Drosophila\
-   Gryllus\
-   Rhodnius\
-   Triatoma

------------------------------------------------------------------------

## `FlyBaseAllGenes_2_Uniprot_2025_05_07.tsv`

Gene ID conversion table downloaded from **FlyBase** for *Drosophila
melanogaster*.

Source: https://flybase.org/

**Columns**

  Column                Description
  --------------------- ---------------------------
  `From`                FlyBase gene identifier
  `Entry`               UniProt protein accession
  `Protein names`       Protein name annotation
  `Gene Names`          Gene symbol
  `Length`              Protein length
  `Gene Ontology IDs`   Associated GO terms

------------------------------------------------------------------------

## `cimex_go_ids.rds`

Gene Ontology (GO) annotations for *Cimex lectularius* proteins
downloaded from **UniProt**.

**Columns**

  Column        Description
  ------------- --------------------------------------------
  `Accession`   UniProt protein ID
  `GOID`        Semicolon-separated list of GO identifiers

------------------------------------------------------------------------

# Reproductive biology datasets

## `cimex_secretome.csv`

List of candidate **seminal fluid proteins (SFPs)** identified through
proteomic analysis of male reproductive tissues (Garlovsky et al., in
preparation).

**Columns**

  Column      Description
  ----------- ---------------------------------------------
  `Protein`   UniProt protein ID
  `only_in`   Tissue(s) in which the protein was detected
  `bias`      Tissue expression bias

------------------------------------------------------------------------

## `cimex_Matings_BM_2024.csv`

Copulation duration data for individual females.

**Columns**

  Column           Description
  ---------------- -------------------------------------------
  `ID`             Unique female identifier
  `Date`           Date of mating observation
  `Time_in`        Time the mating pair was introduced
  `Cop_start`      Start time of copulation
  `Cop_end`        End time of copulation
  `Cop_duration`   Copulation duration
  `Treatment`      Experimental treatment assigned to female
  `Notes`          Additional observations

------------------------------------------------------------------------

## `sperm_pres.csv`

Presence/absence observations of sperm in female reproductive tissues.

**Columns**

  -----------------------------------------------------------------------
  Column                              Description
  ----------------------------------- -----------------------------------
  `Ind_ID`                            Individual female identifier

  `Timepoint`                         Dissection timepoint (`unmated`,
                                      `0h`, `1h`, `3h`, `6h`, `24h`)

  `tissue`                            Tissue examined (`Mesospermalege`
                                      or `Lower reproductive tract`)

  `value`                             1 = sperm present, 0 = sperm absent
  -----------------------------------------------------------------------

------------------------------------------------------------------------

## `repeated_counts.csv`

Blind replicate sperm counts performed on microscopy images.

**Columns**

  Column        Description
  ------------- -------------------------
  `photo_id`    Unique image identifier
  `count_one`   First sperm count
  `count_two`   Second sperm count

------------------------------------------------------------------------

## `sperm_counts_noduplicates.csv`

Sperm counts in the mesospermalege at timepoints after mating.

**Columns**

  Column          Description
  --------------- ------------------------------------------------------
  `ID`            Unique female identifier
  `time_point`    Dissection timepoint (`0h`, `1h`, `3h`, `6h`, `24h`)
  `sperm_count`   Number of sperm counted in image
  `time_hours`    Timepoint in hours
  `sperm_ml`      Estimated total sperm in the mesospermalege

------------------------------------------------------------------------

# RNA-seq expression datasets

## `sb_count_matrix.csv`

Gene count matrix generated from RNA‑seq data using the **nf-core/rnaseq
pipeline** and imported into R using **tximport**.

Samples include unmated male and female tissues.

**Columns**

  Column                   Description
  ------------------------ --------------------------------
  `gene_id`                Entrez gene ID
  `SEX_TISSUE-REPLICATE`   Gene counts for each condition

Tissue codes:

  Code   Tissue
  ------ --------------------------
  H      Head
  L      Lower reproductive tract
  M      Mesospermalege
  O      Ovaries
  T      Testes

------------------------------------------------------------------------

## `salmon.merged.gene_tpm.tsv`

Gene expression values in **Transcripts Per Million (TPM)**.

**Columns**

  Column                   Description
  ------------------------ ----------------------------
  `gene_id`                Entrez gene ID
  `gene_name`              Gene name
  `SEX_TISSUE_REPLICATE`   TPM values for each sample

------------------------------------------------------------------------

## `txi_pmr.rds`

Time‑course RNA‑seq dataset for mesospermalege and lower reproductive
tract samples.

Generated from **nf-core/rnaseq** and imported into R using
**tximport**.

The R object contains:

-   abundances --- estimated gene abundances\
-   counts --- estimated gene counts\
-   lengths --- effective transcript lengths

Sample names follow the format:

`sampleID-timepoint-organ-replicate`

------------------------------------------------------------------------

## `metadata_pmr.rds`

Metadata describing the experimental design for the RNA‑seq timecourse
experiment.

------------------------------------------------------------------------

# Protein interaction predictions

## `sfp_msl_results.csv`

## `lrt_sfp_results.csv`

Predicted protein--protein interactions generated using **AlphaFold3
complex structure predictions**.

Interactions were predicted between:

-   female reproductive tract proteins with predicted signal peptides\
-   candidate seminal fluid proteins (SFPs)

Signal peptides were trimmed prior to structure prediction.

Reference: Abramson et al. 2024 Nature.

**Columns**

  -----------------------------------------------------------------------
  Column                              Description
  ----------------------------------- -----------------------------------
  `sfp_prot`                          SFP protein identifier

  `xxx_prot`                          Female protein identifier (`msl` =
                                      mesospermalege, `lrt` = lower
                                      reproductive tract)

  `iptm`                              Interface predicted TM-score

  `ptm`                               Predicted TM-score

  `ranking_score`                     AlphaFold model ranking score

  `fraction_disordered`               Fraction of predicted disordered
                                      residues

  `has_clash`                         Whether steric clashes were
                                      detected

  `chain_iptm`                        Chain-specific ipTM score

  `chain_ptm`                         Chain-specific pTM score

  `chain_pair_iptm`                   Pairwise interface score

  `chain_pair_pae_min`                Minimum predicted aligned error
                                      between chains

  `pcomb`                             Concatenated protein pair
                                      identifier
  -----------------------------------------------------------------------

------------------------------------------------------------------------

# Code and software

All analysis code is available on GitHub:

https://github.com/MartinGarlovsky/Bedbug_RNAseq

------------------------------------------------------------------------

# Data access

Raw RNA‑seq reads are available from the **Sequence Read Archive
(SRA)**.

**BioProject accession:** PRJNA1427960
