Cancer Mutation Data for AI & Pharma for Model Training
Are you building LLMs for healthcare, drug discovery, or cancer diagnostics?
Looking for curated, high-impact datasets from top oncology research?
cancer genomics dataset, mutation profiles, and drug discovery intelligence
We’re offering structured, ready-to-train JSON datasets built from clinical cancer genomics covering:
✅ Rate-limiting mutations & somatic mutational pathways
✅ Tumor progression maps and genetic bottlenecks
✅ Inherited cancer syndromes (BRCA1/2, APC, MSH2, MLH1, TP53)
✅ Minimal residual disease (MRD) detection markers
✅ Mutation–drug response mapping & therapeutic targets
✅ Prognostic markers & actionable gene pathways
Who Is This For?
🔹 AI startups building healthcare foundation models
🔹 LLM developers fine-tuning for biomedical Q&A
🔹 Biotech companies working on diagnostics or companion therapeutics
🔹 Pharma teams targeting precision oncology
🔹 Academic labs and bioinformatics platforms
What’s Inside?
- Over 150 curated prompt–completion pairs based on clinical oncology & genetic literature
- Structured insights on gene–mutation–disease–therapy links
- Formats: JSON
- Ready for integration into LLMs, clinical decision tools, or drug-target databases
Prompt Sample
Rate Limiting Mutations & Cancer Genetics prompts
[
{
"prompt": "What are rate-limiting mutations in cancer development?",
"completion": "Rate-limiting mutations are rare genetic alterations that confer a significant proliferative and survival advantage to cells, enabling clonal expansion and increasing the risk of malignant transformation."
},
{
"prompt": "Why are most mutations in somatic cells not considered rate-limiting?",
"completion": "Most somatic mutations have little or no positive effect on cell growth and do not significantly contribute to cancer progression."
},
{
"prompt": "What happens once a rate-limiting mutation occurs in a tissue?",
"completion": "It allows the expansion of a precancerous cell population, increasing the chance of additional somatic mutations that can lead to full malignancy."
},
{
"prompt": "How does the two-hit hypothesis explain inherited cancer syndromes?",
"completion": "The two-hit hypothesis states that individuals with an inherited mutation in one allele of a tumor suppressor gene require only a second somatic mutation in the remaining allele to initiate cancer, making cancer more likely and earlier in onset."
},
{
"prompt": "What gene is involved in retinoblastoma and how does it demonstrate the two-hit hypothesis?",
"completion": "The RB1 gene is involved; inherited inactivation of one RB1 allele followed by somatic inactivation of the second allele initiates retinoblastoma, especially in familial cases where it often presents bilaterally and at an earlier age."
},
{
"prompt": "Why is sporadic retinoblastoma less common than inherited forms?",
"completion": "In sporadic cases, both RB1 alleles must be randomly inactivated in the same somatic cell, a statistically rare event, leading to later onset and typically unifocal tumors."
},
{
"prompt": "Which cancers are also explained by the two-hit model beyond retinoblastoma?",
"completion": "The model also explains hereditary non-polyposis colorectal cancer, hereditary breast and ovarian cancer, neurofibromatosis types 1 and 2, and familial adenomatous polyposis coli."
},
{
"prompt": "Who proposed the two-hit hypothesis and for what purpose?",
"completion": "Alfred Knudson proposed the two-hit hypothesis to explain the age of onset and bilateral presentation in familial pediatric cancers like retinoblastoma."
}
]
{
"note": "GTPase = guanine trinucleotide phosphatase, EGF = epidermal growth factor, FGF = fibroblast growth factor, SCLC = small-cell carcinoma of the lung, ALL = acute lymphocytic leukaemia, CML = chronic myelogenous leukaemia, APL = acute promyelocytic leukaemia, GDNF = glial-derived neurotrophic factor, HGF = hepatocyte growth factor, ND = not determined.",
"oncogenes": [
{
"gene": "K-RAS",
"activation_mechanism": "Point mutation",
"protein_properties": "p21 GTPase",
"cancer_types": ["Pancreatic", "Colorectal", "Lung (adenocarcinoma)", "Endometrial", "Other carcinomas"],
"germline_mutations": "ND"
},
{
"gene": "N-RAS",
"activation_mechanism": "Point mutation",
"protein_properties": "p21 GTPase",
"cancer_types": ["Myeloid leukaemia"],
"germline_mutations": "ND"
},
{
"gene": "H-RAS",
"activation_mechanism": "Point mutation",
"protein_properties": "p21 GTPase",
"cancer_types": ["Bladder"],
"germline_mutations": "ND"
},
{
"gene": "EGFR (ERB-B)",
"activation_mechanism": "Amplification",
"protein_properties": "Growth-factor (EGF) receptor",
"cancer_types": ["Gliomas", "Squamous and other carcinomas"],
"germline_mutations": "ND"
},
Why Use Our Dataset?
🧬 Derived from top-tier biomedical research (peer-reviewed & reference-tracked)
📉 Reduces time and cost of data curation for precision medicine tools
🧠 Structured for natural language model training (GPT-compatible)
🧪 Enables deeper insights in drug discovery, gene therapy, and personalized medicine
Let’s Collaborate
Whether you’re training an LLM, validating gene targets, or building the next AI-powered diagnostic tool — this dataset gives you the molecular and clinical depth you need.
Ready to Access the Data?
We provide datasets in JSON formats, complete with documentation and prompt samples.
📩 Contact us for samples : contact@ieearc.com ieearctechnologies@gmail.com
Let’s unlock insights together — one dataset at a time.
Partner with us for exclusive licensing
