Cancer Mutation Data for AI & Pharma for Model Training -

Cancer Mutation Data for AI & Pharma for Model Training

Are you building LLMs for healthcare, drug discovery, or cancer diagnostics?
Looking for curated, high-impact datasets from top oncology research?

cancer genomics dataset, mutation profiles, and drug discovery intelligence

We’re offering structured, ready-to-train JSON datasets built from clinical cancer genomics covering:

✅ Rate-limiting mutations & somatic mutational pathways
✅ Tumor progression maps and genetic bottlenecks
✅ Inherited cancer syndromes (BRCA1/2, APC, MSH2, MLH1, TP53)
✅ Minimal residual disease (MRD) detection markers
✅ Mutation–drug response mapping & therapeutic targets
✅ Prognostic markers & actionable gene pathways

Who Is This For?

🔹 AI startups building healthcare foundation models
🔹 LLM developers fine-tuning for biomedical Q&A
🔹 Biotech companies working on diagnostics or companion therapeutics
🔹 Pharma teams targeting precision oncology
🔹 Academic labs and bioinformatics platforms

What’s Inside?

Over 150 curated prompt–completion pairs based on clinical oncology & genetic literature
Structured insights on gene–mutation–disease–therapy links
Formats: JSON
Ready for integration into LLMs, clinical decision tools, or drug-target databases

Prompt Sample

Rate Limiting Mutations & Cancer Genetics prompts

[

  {

    "prompt": "What are rate-limiting mutations in cancer development?",

    "completion": "Rate-limiting mutations are rare genetic alterations that confer a significant proliferative and survival advantage to cells, enabling clonal expansion and increasing the risk of malignant transformation."

  },

  {

    "prompt": "Why are most mutations in somatic cells not considered rate-limiting?",

    "completion": "Most somatic mutations have little or no positive effect on cell growth and do not significantly contribute to cancer progression."

  },

  {

    "prompt": "What happens once a rate-limiting mutation occurs in a tissue?",

    "completion": "It allows the expansion of a precancerous cell population, increasing the chance of additional somatic mutations that can lead to full malignancy."

  },

  {

    "prompt": "How does the two-hit hypothesis explain inherited cancer syndromes?",

    "completion": "The two-hit hypothesis states that individuals with an inherited mutation in one allele of a tumor suppressor gene require only a second somatic mutation in the remaining allele to initiate cancer, making cancer more likely and earlier in onset."

  },

  {

    "prompt": "What gene is involved in retinoblastoma and how does it demonstrate the two-hit hypothesis?",

    "completion": "The RB1 gene is involved; inherited inactivation of one RB1 allele followed by somatic inactivation of the second allele initiates retinoblastoma, especially in familial cases where it often presents bilaterally and at an earlier age."

  },

  {

    "prompt": "Why is sporadic retinoblastoma less common than inherited forms?",

    "completion": "In sporadic cases, both RB1 alleles must be randomly inactivated in the same somatic cell, a statistically rare event, leading to later onset and typically unifocal tumors."

  },

  {

    "prompt": "Which cancers are also explained by the two-hit model beyond retinoblastoma?",

    "completion": "The model also explains hereditary non-polyposis colorectal cancer, hereditary breast and ovarian cancer, neurofibromatosis types 1 and 2, and familial adenomatous polyposis coli."

  },

  {

    "prompt": "Who proposed the two-hit hypothesis and for what purpose?",

    "completion": "Alfred Knudson proposed the two-hit hypothesis to explain the age of onset and bilateral presentation in familial pediatric cancers like retinoblastoma."

  }

]

Selected Oncogene Mutations in Human Cancer JSON Data

{




  "note": "GTPase = guanine trinucleotide phosphatase, EGF = epidermal growth factor, FGF = fibroblast growth factor, SCLC = small-cell carcinoma of the lung, ALL = acute lymphocytic leukaemia, CML = chronic myelogenous leukaemia, APL = acute promyelocytic leukaemia, GDNF = glial-derived neurotrophic factor, HGF = hepatocyte growth factor, ND = not determined.",

  "oncogenes": [

    {

      "gene": "K-RAS",

      "activation_mechanism": "Point mutation",

      "protein_properties": "p21 GTPase",

      "cancer_types": ["Pancreatic", "Colorectal", "Lung (adenocarcinoma)", "Endometrial", "Other carcinomas"],

      "germline_mutations": "ND"

    },

    {

      "gene": "N-RAS",

      "activation_mechanism": "Point mutation",

      "protein_properties": "p21 GTPase",

      "cancer_types": ["Myeloid leukaemia"],

      "germline_mutations": "ND"

    },

    {

      "gene": "H-RAS",

      "activation_mechanism": "Point mutation",

      "protein_properties": "p21 GTPase",

      "cancer_types": ["Bladder"],

      "germline_mutations": "ND"

    },

    {

      "gene": "EGFR (ERB-B)",

      "activation_mechanism": "Amplification",

      "protein_properties": "Growth-factor (EGF) receptor",

      "cancer_types": ["Gliomas", "Squamous and other carcinomas"],

      "germline_mutations": "ND"

    },

Why Use Our Dataset?

🧬 Derived from top-tier biomedical research (peer-reviewed & reference-tracked)
📉 Reduces time and cost of data curation for precision medicine tools
🧠 Structured for natural language model training (GPT-compatible)
🧪 Enables deeper insights in drug discovery, gene therapy, and personalized medicine

Let’s Collaborate

Whether you’re training an LLM, validating gene targets, or building the next AI-powered diagnostic tool — this dataset gives you the molecular and clinical depth you need.

Ready to Access the Data?

We provide datasets in JSON formats, complete with documentation and prompt samples.

📩 Contact us for samples : contact@ieearc.com ieearctechnologies@gmail.com

Let’s unlock insights together — one dataset at a time.

Partner with us for exclusive licensing