MongoDB Schema Design: GDC, HGNC, UniProt Collections

by Alex Johnson 54 views

This article outlines the design of a MongoDB schema for integrating data from the Genomic Data Commons (GDC), HGNC (HUGO Gene Nomenclature Committee), and UniProt databases. This schema will serve as the foundation for importing, querying, and analyzing complex biological data, enabling researchers to gain deeper insights into genomic and proteomic landscapes. Let's dive into the design considerations and the structure of the collections.

Introduction

The goal is to design a conceptual and logical MongoDB database model for the T1 delivery, defining collections, fields, relationships, and nesting levels for the data already downloaded from GDC, HGNC, and UniProt. This design will be the reference for all import scripts and for the LaTeX report. A well-structured schema is crucial for efficient data storage, retrieval, and analysis. This article details the design considerations, collection structures, and relationships between GDC, HGNC, and UniProt data within a MongoDB database.

Branch Strategy

  • Create the branch from deb: feature/t1-mongo-schema-design

Input Files (Read-Only)

  • config/data_config.yaml
  • GDC:
    • data/gdc/gdc_manifest_tcga_lgg.tsv
    • data/gdc/gdc_file_metadata_tcga_lgg.tsv
    • data/gdc/gdc_genes_tcga_lgg.tsv
    • Some data/gdc/star_counts/*.rna_seq.augmented_star_gene_counts.tsv
  • HGNC:
    • data/hgnc/hgnc_complete_set.tsv
  • UniProt:
    • data/uniprot/uniprot_mapping_tcga_lgg.tsv
    • data/uniprot/uniprot_metadata_tcga_lgg.tsv

Files to Create/Modify

  • docs/t1_mongo_schema_overview.md (main schema design document; add diagrams and JSON examples).
  • Drafting file in Overleaf (?)
  • Optional: Auxiliary schemas like JSON/YAML, e.g.:
    • docs/t1_mongo_schema_gdc.json
    • docs/t1_mongo_schema_hgnc.json
    • docs/t1_mongo_schema_uniprot.json

Objectives

  • Clearly define:
    • What collections MongoDB will have (minimum GDC, genes/HGNC, proteins/UniProt).
    • What fields and types each collection has.
    • What keys are used to link collections (project_id, case_id, file_id, ensembl_gene_id, hgnc_id, uniprot_id).
    • How to guarantee ≥3 levels of nesting per collection (for example:
      • GDC: project → case → sample/file → expression_summary
      • HGNC: gene → identifiers → external_links → projects
      • UniProt: protein → genes → go_annotations → comments).
  • Justify the decisions of embedding vs. references and possible indexes, thinking about the queries planned for T2/T3.

Detailed Design Considerations for MongoDB Collections

When designing a MongoDB schema for GDC, HGNC, and UniProt data, several key considerations come into play. These include the structure of each collection, the relationships between them, and the optimization strategies for querying and indexing. A well-thought-out schema is critical for ensuring data integrity, efficient data retrieval, and scalability of the database. Let's delve into the specifics of each collection and how they interact.

1. GDC Collection Design

The GDC collection will store data related to cancer genomics from the Genomic Data Commons. This includes information about projects, cases, samples, files, and gene expression summaries. The design must accommodate the hierarchical nature of GDC data, ensuring that relationships between projects, cases, and files are clearly defined. Proper indexing and embedding strategies are crucial for optimizing query performance, particularly when analyzing large datasets.

  • Collection Name: gdc_data
  • Fields: Includes project_id, case_id, sample_id, file_id, and expression_summary.
  • Data Types: Each field should have an appropriate data type, such as strings for IDs, numbers for expression values, and nested objects for summaries.
  • Nesting Levels: project → case → sample/file → expression_summary
  • Relationships: Linked to other collections via ensembl_gene_id.

2. HGNC Collection Design

The HGNC collection will focus on gene nomenclature data from the HUGO Gene Nomenclature Committee. This includes gene symbols, names, identifiers, and links to external databases. The design must support efficient querying of gene-related information, such as retrieving all aliases for a given gene or finding genes associated with specific projects. Embedding and referencing strategies need careful consideration to balance data redundancy and query performance.

  • Collection Name: hgnc_genes
  • Fields: Includes hgnc_id, symbol, name, ensembl_gene_id, and alias_symbols.
  • Data Types: Strings for symbols and names, arrays for aliases, and object for external links.
  • Nesting Levels: gene → identifiers → external_links → projects
  • Relationships: Linked to other collections via hgnc_id and ensembl_gene_id.

3. UniProt Collection Design

The UniProt collection will contain protein data from the Universal Protein Resource. This includes protein sequences, names, functions, and annotations. The design must accommodate the rich and diverse nature of UniProt data, supporting efficient retrieval of protein-related information such as GO annotations, comments, and associated genes. Indexing strategies should optimize queries for protein function and annotation analysis.

  • Collection Name: uniprot_proteins
  • Fields: Includes uniprot_id, protein_name, gene_names, go_annotations, and comments.
  • Data Types: Strings for names and sequences, arrays for gene names and annotations, and nested objects for detailed information.
  • Nesting Levels: protein → genes → go_annotations → comments
  • Relationships: Linked to other collections via uniprot_id.

Tasks

  • [ ] Read the TSV files and understand the minimum content that should appear in each collection.
  • [ ] Propose names and structure of the collections.
  • [ ] Define examples of JSON documents for each collection (with ≥3 nesting levels).
  • [ ] Document in docs/t1_mongo_schema_overview.md:
    • Description of each collection.
    • Relationships between collections (schemas/diagrams).
    • Justification of modeling decisions.

Embedding vs. Referencing: A Critical Decision

The decision between embedding and referencing is fundamental to the design of the MongoDB schema. Embedding involves including related data within a single document, while referencing involves storing related data in separate documents and using links to connect them. Each approach has its trade-offs:

  • Embedding: This approach can improve read performance by reducing the number of queries needed to retrieve related data. However, it can also lead to data redundancy and increased document size. Embedding is best suited for one-to-one or one-to-many relationships where the embedded data is frequently accessed together.
  • Referencing: This approach avoids data redundancy and allows for more flexible data modeling. However, it can increase the number of queries needed to retrieve related data, potentially impacting read performance. Referencing is best suited for many-to-many relationships or when the related data is not always needed.

For the GDC, HGNC, and UniProt schema, a combination of embedding and referencing may be the most effective approach. For example, gene symbols and names from HGNC could be embedded within the GDC and UniProt collections to improve query performance, while more detailed gene information could be referenced in the HGNC collection.

Expected Outcome

A document in docs/ that can be used almost directly as a T1 section in the LaTeX report, serving as a specification for the import scripts of Issues 2 and 3.

Additional Notes

  • Always work in the feature/t1-mongo-schema-design branch.
  • All commits must include the issue number in the message (e.g., [#XX] Define genes collection structure).
  • Clearly document any non-trivial design decisions.

Conclusion

Designing an effective MongoDB schema for GDC, HGNC, and UniProt data requires careful consideration of data structure, relationships, and query patterns. By defining clear collection structures, utilizing appropriate indexing strategies, and making informed decisions about embedding vs. referencing, it is possible to create a database that supports efficient data storage, retrieval, and analysis. This well-designed schema will empower researchers to explore complex biological data and gain valuable insights into cancer genomics and proteomic landscapes. Remember to document all decisions and continue refining the schema as new requirements and data emerge.

For further reading on MongoDB schema design, check out the official MongoDB documentation: MongoDB Schema Design Best Practices