GDPR and AI: The "Right to Be Forgotten" Now Means "Unlearning"

Why "Delete the Row" Is No Longer Enough

Let me make this concrete with an example from each product I work on.

GrowCentric.ai: The marketing optimisation platform learns from campaign performance data. When a client runs campaigns through the system, their conversion rates, audience engagement patterns, cost-per-acquisition trends, and seasonal patterns all feed into predictive models that improve budget allocation and audience targeting. If Client A's data shows that automotive audiences in Lower Austria convert better on Thursday evenings, that pattern might influence recommendations for Client B. Deleting Client A's account doesn't delete that learned pattern from the model.

Stint.co: The marketing dashboard I built generates real-time reporting and insights for digital marketing campaigns. It sends emails to large audiences, manages application forms, and provides insight generation. When the system learns that emails sent at 10am on Tuesdays get 23% higher open rates for a particular audience segment, that insight was derived from the engagement behaviour of real people. Each individual's open, click, and conversion data contributed to that aggregate learning. Delete one person's data, and the insight remains.

Regios.at: The regional platform handles local business listings, community data, and user interactions. If AI features (search ranking, content recommendations, local business matching) were trained on user behaviour patterns, those patterns persist in the model even after a specific user requests deletion.

Auto-Prammer.at: The automotive marketplace uses AI for vehicle recommendations, price predictions, and user matching. If the recommendation engine learned from User X's browsing behaviour that people who look at BMW 3 Series listings also tend to look at Audi A4 listings, deleting User X doesn't un-learn that association.

In every case, the problem is the same. Database deletion is straightforward. Model un-training is not.

What the Regulators Are Actually Saying

This isn't a theoretical concern. European regulators are actively pursuing this issue.

In March 2025, the EDPB launched its Coordinated Enforcement Framework action for 2025 focused specifically on the right to erasure. Thirty data protection authorities across Europe, plus the European Data Protection Supervisor, are participating. They're contacting controllers across sectors, opening investigations, and doing fact-finding exercises specifically about how organisations handle deletion requests.

This follows the EDPB's landmark Opinion 28/2024 from December 2024, which addressed personal data processing in the context of AI models. The key findings that matter for developers:

AI models trained on personal data are not automatically anonymous. The EDPB rejected the Hamburg DPA's earlier position that large language models don't store personal data. Instead, the Board concluded that whether a model is anonymous requires a case-by-case assessment. If there's a non-negligible likelihood that personal data could be extracted from the model (either directly or through targeted prompting), the model contains personal data and GDPR applies to it.

Right to erasure applies to models, not just datasets. If a data subject requests erasure and their data influenced a model, the obligation extends beyond deleting the training dataset. The EDPB acknowledged that this is "technically complex" but didn't grant a blanket exemption.

Worst case: erasure of the entire model. The EDPB's opinion explicitly mentions that a DPA could order "erasing part of the dataset that was processed unlawfully or, where this is not possible, the erasure of the whole dataset used to develop the AI model and/or the AI model itself." This is the nuclear option, and while the EDPB says it must be proportionate, the fact that it's on the table at all should concern every AI developer.

Unlawful training taints downstream use. If a model was trained on unlawfully processed personal data, even a different controller deploying that model may face compliance issues. This matters if you use third-party models or pre-trained components.

And then there's the practical precedent. In December 2024, the Italian DPA fined OpenAI 15 million euros for GDPR violations including failure to establish an appropriate lawful basis for processing personal data used to train ChatGPT, transparency failures, and inadequate risk assessments. Italy had previously temporarily banned ChatGPT in 2023 over these concerns.

The message is clear: regulators expect AI developers to have answers for how they handle personal data in the context of model training, and "it's technically difficult" is not an acceptable response.

The Computer Science Problem: Why Machine Unlearning Is So Hard

To understand why this is a nightmare, you need to understand how machine learning models actually work.

When you train a model, data flows through the system and adjusts internal parameters (weights and biases) through a process called gradient descent. Each training example nudges thousands or millions of parameters by tiny amounts. After training on thousands or millions of examples, the model's behaviour is the cumulative result of all those tiny nudges.

The problem: there's no undo button. You can't look at a trained model and say "these specific weights were influenced by User X's data." The influence is distributed across the entire parameter space, entangled with every other training example.

Let me give you an analogy. Imagine you make a soup by adding ingredients one at a time. After simmering for hours, someone asks you to remove the salt you added in step three. You can't. The salt has dissolved and distributed throughout. The only way to get a salt-free soup is to make a new soup from scratch without salt.

That's essentially what full retraining means for AI models: start from scratch with a dataset that excludes the requested data. It's accurate, but for large models it can take days or weeks and cost significant compute resources. Research suggests retraining GPT-4-scale models would cost millions of dollars. Even for smaller, domain-specific models like what I train for GrowCentric.ai or Stint.co, full retraining isn't something you want to do every time someone clicks "delete my account."

Researchers have proposed several alternatives:

Approximate unlearning adjusts model parameters to reduce the influence of specific training data without full retraining. It's faster but provides no formal guarantees that the data's influence has been completely removed. Testing relies on membership inference attacks (trying to detect whether specific data was in the training set), but these tests aren't reliable for well-trained models.

SISA training (Sharded, Isolated, Sliced, and Aggregated) divides training data into shards, trains sub-models on each shard, and aggregates results. When someone requests deletion, only the shard containing their data needs retraining. It's clever, but it changes your entire training architecture and may reduce model performance.

Influence functions estimate how much a specific training example affected model parameters, then apply the inverse update. Good in theory, computationally expensive in practice, and increasingly unreliable for large models.

Intentional misclassification deliberately mislabels the data points to be forgotten and fine-tunes the model, reducing their influence. It's computationally cheap but doesn't truly remove the influence.

None of these solve the fundamental problem. As the EDPB's own technical report acknowledged, machine unlearning methods "are the result of early-stage research" and "improvements and alternative approaches are" still needed.

The Practical Solution: Don't Train on Personal Data in the First Place

Here's where I stop describing the problem and start describing what I actually do about it.

The best solution to machine unlearning is to never need it. If personal data never enters your model in an identifiable form, there's nothing to unlearn. This doesn't mean you can't use data from your users. It means you need an architecture that separates personal data from model training data through aggregation, anonymisation, and careful data governance.

Here's how I approach this across my Rails applications.

Layer 1: Data Provenance Tracking

Before anything else, you need to know exactly what data went into which model. If you can't answer "was User X's data used to train Model Y?" you can't respond to a deletion request meaningfully.

class DataProvenanceTracker
  def record_training_usage(model_version:, data_sources:)
    data_sources.each do |source|
      TrainingDataRecord.create!(
        model_name: model_version.model_name,
        model_version: model_version.version_string,
        data_source_type: source[:type],
        data_source_id: source[:id],
        contains_personal_data: source[:personal_data],
        anonymisation_method: source[:anonymisation_method],
        aggregation_level: source[:aggregation_level],
        included_at: Time.current,
        data_hash: compute_data_hash(source)
      )
    end
  end

  def find_models_using_data_from(user:)
    # Which models were trained with this user's data?
    TrainingDataRecord.where(
      data_source_type: 'user_derived',
      data_source_id: user.id,
      contains_personal_data: true
    ).includes(:model_version)
  end

  def erasure_impact_assessment(user:)
    affected_models = find_models_using_data_from(user: user)
    {
      user_id: user.id,
      affected_model_count: affected_models.size,
      models: affected_models.map do |record|
        {
          model: record.model_name,
          version: record.model_version,
          anonymisation_method: record.anonymisation_method,
          aggregation_level: record.aggregation_level,
          personal_data_present: record.contains_personal_data,
          retraining_required: record.contains_personal_data && record.aggregation_level == 'individual'
        }
      end,
      assessed_at: Time.current
    }
  end
end

This is the first thing I implemented across GrowCentric.ai, the Stint.co dashboard, and Auto-Prammer.at. Every training run records exactly what data was used, at what aggregation level, and whether personal data was involved.

Layer 2: The Anonymisation Pipeline

The key architectural decision: personal data gets anonymised or aggregated before it ever reaches a training pipeline. Raw user data stays in the application database (where GDPR deletion is straightforward). Derived, anonymised features flow to the training system.

class TrainingDataAnonymiser
  # Transform raw user data into anonymised training features
  def prepare_training_features(user_data, purpose:)
    case purpose
    when :campaign_optimisation
      # GrowCentric: aggregate to campaign level, strip user identity
      anonymise_campaign_data(user_data)
    when :email_engagement
      # Stint.co: aggregate to segment level, no individual tracking
      anonymise_engagement_data(user_data)
    when :recommendation_engine
      # Auto-Prammer.at / Regios.at: anonymise browsing patterns
      anonymise_browsing_data(user_data)
    when :pricing_model
      # Auto-Prammer.at: aggregate market data, no individual prices
      anonymise_pricing_data(user_data)
    end
  end

  private

  def anonymise_campaign_data(data)
    # Strip all personal identifiers
    # Aggregate to cohort level (minimum 50 users per cohort)
    # Replace exact timestamps with time buckets
    # Replace exact locations with region codes
    data.group_by { |d| cohort_key(d) }
        .select { |_key, members| members.size >= 50 }
        .transform_values { |members| aggregate_metrics(members) }
  end

  def anonymise_engagement_data(data)
    # Stint.co email engagement: aggregate to segment/time bucket
    # No individual open/click tracking in training data
    # Only segment-level patterns: "tech segment, Tuesday 10am, 23% open rate"
    data.group_by { |d| [d[:segment], time_bucket(d[:timestamp])] }
        .select { |_key, members| members.size >= 100 }
        .transform_values do |members|
          {
            open_rate: members.count { |m| m[:opened] }.to_f / members.size,
            click_rate: members.count { |m| m[:clicked] }.to_f / members.size,
            segment_size: members.size
          }
        end
  end

  def anonymise_browsing_data(data)
    # Replace user IDs with anonymous session hashes
    # Aggregate product views to category level
    # Apply k-anonymity (minimum k=10 per group)
    data.map { |d| strip_identity(d) }
        .group_by { |d| generalise_attributes(d) }
        .select { |_key, group| group.size >= 10 }
  end

  def cohort_key(datum)
    [
      datum[:region_code],
      datum[:industry_vertical],
      time_bucket(datum[:timestamp]),
      device_category(datum[:device])
    ].join(':')
  end
end

The critical design principle: minimum cohort sizes. For GrowCentric.ai campaign data, no training cohort smaller than 50 users. For Stint.co email engagement, no segment smaller than 100 recipients. For Auto-Prammer.at and Regios.at browsing data, k-anonymity with k=10 minimum. This means no individual user's behaviour is distinguishable in the training data.

When a user requests deletion under Article 17, I delete their personal data from the application database. The training data was never personal to begin with, it was aggregated and anonymised before entering the training pipeline. There's nothing to unlearn because no individual's data was ever learned.

Layer 3: The Erasure Handler

When a deletion request comes in, the system needs to handle it comprehensively:

class GDPRErasureHandler
  def process_erasure_request(user:, requested_at: Time.current)
    request = ErasureRequest.create!(
      user: user,
      requested_at: requested_at,
      status: 'processing'
    )

    ActiveRecord::Base.transaction do
      # Layer 1: Delete all personal data from application databases
      delete_personal_data(user)

      # Layer 2: Assess model impact
      impact = DataProvenanceTracker.new.erasure_impact_assessment(user: user)

      # Layer 3: Handle any models that used non-anonymised data
      impact[:models].each do |model_info|
        if model_info[:retraining_required]
          schedule_model_remediation(model_info, user)
        else
          # Data was anonymised/aggregated before training
          # No model action needed, document why
          log_no_action_required(model_info, reason: 'training_data_anonymised')
        end
      end

      # Layer 4: Propagate to third-party processors
      propagate_to_processors(user)

      # Layer 5: Generate compliance documentation
      request.update!(
        status: 'completed',
        completed_at: Time.current,
        impact_assessment: impact,
        personal_data_deleted: true,
        model_remediation_required: impact[:models].any? { |m| m[:retraining_required] },
        documentation: generate_erasure_documentation(user, impact)
      )
    end

    # Notify the user within GDPR's "undue delay" (interpreted as one month)
    ErasureConfirmationMailer.send_confirmation(user, request).deliver_later
    request
  end

  private

  def delete_personal_data(user)
    # Systematic deletion across all data stores
    [
      UserProfileService,
      CampaignDataService,
      EmailEngagementService,
      BrowsingHistoryService,
      SupportTicketService,
      AuditLogService  # Anonymise, don't delete (legal requirement)
    ].each { |service| service.erase_for(user) }
  end

  def schedule_model_remediation(model_info, user)
    # If somehow non-anonymised data reached a model,
    # schedule retraining with the user's data excluded
    ModelRetrainingJob.perform_later(
      model_name: model_info[:model],
      exclude_user_id: user.id,
      reason: 'gdpr_erasure_request',
      priority: 'high'
    )
  end

  def propagate_to_processors(user)
    # Notify all third-party data processors
    DataProcessorRegistry.all.each do |processor|
      processor.request_erasure(user_identifier: user.anonymised_id)
    end
  end
end

The key insight: if your anonymisation pipeline works correctly, the model remediation path should almost never be triggered. The system handles it just in case, but the architecture is designed so that personal data never reaches the training pipeline in identifiable form.

Layer 4: Designing for Retrainability

Even with good anonymisation, you should design your training pipeline so that retraining is feasible if needed:

class RetrainableModelPipeline
  def train(model_config:, training_data:, excluded_users: [])
    # Always maintain the ability to retrain from scratch
    versioned_data = prepare_versioned_dataset(
      training_data: training_data,
      excluded_users: excluded_users
    )

    model = ModelTrainer.train(
      config: model_config,
      data: versioned_data[:features],
      validation: versioned_data[:validation]
    )

    # Record full provenance
    ModelVersion.create!(
      model_name: model_config[:name],
      version: next_version(model_config[:name]),
      trained_at: Time.current,
      training_data_hash: versioned_data[:hash],
      excluded_users: excluded_users,
      data_provenance: versioned_data[:provenance],
      performance_metrics: evaluate_model(model, versioned_data[:test]),
      anonymisation_verified: true
    )

    model
  end

  # SISA-inspired: maintain training data in shards
  # so partial retraining is possible if needed
  def prepare_versioned_dataset(training_data:, excluded_users: [])
    shards = training_data.each_slice(shard_size).map do |shard|
      {
        data: shard.reject { |d| excluded_users.include?(d[:source_user_id]) },
        hash: Digest::SHA256.hexdigest(shard.to_json)
      }
    end

    {
      features: shards.flat_map { |s| s[:data] },
      hash: Digest::SHA256.hexdigest(shards.map { |s| s[:hash] }.join),
      provenance: shards.map { |s| s[:hash] }
    }
  end
end

What This Looks Like Across Different Products

Let me map this architecture to the specific products I work on, because different products have different data flows and different risk profiles.

GrowCentric.ai (marketing optimisation SaaS): The highest risk because it processes campaign data from multiple clients. My approach: all client data is aggregated to campaign-level metrics before entering any training pipeline. Individual user behaviour (clicks, conversions, browsing) is never used at the individual level. The models learn patterns like "automotive campaigns in the DACH region with budget above 5,000 euros perform best with Thursday/Friday evening scheduling" not "User ID 47832 converts on Thursdays." This means deletion requests affect only the application database, not the models.

Stint.co (marketing dashboard with email campaigns): This handles large-volume email sends, real-time reporting, and insight generation. The insight generation component is the one that matters for unlearning. When the system notices that a particular email subject line pattern gets higher engagement, that insight was derived from individual engagement data. My approach: engagement metrics are aggregated to segment/time-bucket level before any analytical model touches them. The dashboard shows individual-level reporting to the account owner (who has the data processing relationship), but the learning systems only see "segment X, time Y, aggregate rate Z." When a recipient exercises their right to erasure, their engagement records are deleted from the reporting database. The aggregate insights remain because they were never tied to an individual.

Regios.at (regional platform): Handles local business listings, user interactions with regional content, and community features. If AI features like search ranking or content recommendations are trained on user behaviour, my approach is the same: browsing patterns are anonymised to session-level interactions with k-anonymity guarantees before entering any recommendation training pipeline. User deletion cleans the application database. Recommendations are based on anonymous behavioural patterns, not individual profiles.

Auto-Prammer.at (automotive marketplace on Solidus): Uses AI for vehicle recommendations, price predictions, and buyer-seller matching. As I described in my recommendation engine post, the system uses collaborative filtering, content-based signals, and contextual features. My approach: collaborative filtering features are computed from anonymised interaction matrices (user IDs replaced with random session hashes, interactions aggregated across sessions). Price prediction models use only market-level features (make, model, year, mileage, region, season) not individual buyer/seller data. Deletion of a user account cleans all personal records. The models were never trained on personally identifiable information.

The Article 17 Response Playbook

When someone exercises their right to erasure, here's the systematic response:

Step 1: Acknowledge within 72 hours. Not a requirement but best practice. Let the user know you've received their request and are processing it.

Step 2: Map the data. Use your provenance tracking to identify every location where this user's personal data exists: application databases, analytics systems, backup systems, third-party processors, and training datasets.

Step 3: Assess model impact. If your anonymisation architecture is working correctly, the assessment should show that training data was derived from the user's data but anonymised before reaching any model. Document this clearly.

Step 4: Delete personal data. Remove all identifiable personal data from all systems within the GDPR's "undue delay" timeframe (generally interpreted as one month).

Step 5: If model remediation is needed, schedule it. If, despite your architecture, personal data reached a model in identifiable form, schedule retraining with that data excluded. Document the timeline and complete it as quickly as feasible.

Step 6: Propagate to processors. Notify all third-party data processors of the erasure request.

Step 7: Document everything. The EDPB's enforcement actions focus on how organisations handle deletion requests. Having comprehensive documentation of your process, your provenance tracking, your anonymisation methods, and your assessment of model impact is your best protection.

Step 8: Confirm to the user. Within one month, confirm that erasure has been completed. If model remediation is still in progress, explain the timeline.

How This Connects to the Broader Regulatory Picture

This doesn't exist in isolation. As I covered in my EU AI Act post, the AI Act's transparency requirements and the GDPR's data subject rights create overlapping obligations. The AI Act requires technical documentation including training data information. The GDPR requires you to respond to erasure requests. Together, they mean you need to know exactly what data trained your models and be able to demonstrate that personal data was handled lawfully throughout the training process.

The Cyber Resilience Act adds security requirements for the data pipelines themselves. The NIS2 Directive adds incident reporting obligations if those pipelines are breached.

The Digital Omnibus proposed in November 2025 includes an interesting wrinkle: it proposes allowing "legitimate interest" as a legal basis for processing personal data to train AI models, and it would permit processing of special-category data (Article 9 GDPR) for de-biasing AI systems. If adopted, this would create clearer legal grounds for training on personal data, but it wouldn't remove the right to erasure. You'd still need to handle deletion requests.

And the EDPB's coordinated enforcement action on right to erasure? That's running throughout 2025 and into 2026. Thirty DPAs are actively investigating. The results will likely shape enforcement guidance for years to come.

The Bottom Line for Developers

The right to be forgotten is no longer just about deleting database rows. In the age of AI, it extends to the models those rows helped train. The EDPB has said so. The Italian DPA has enforced it. Thirty European data protection authorities are actively investigating it.

But here's the thing: if you architect your systems correctly from the start, this problem largely solves itself.

Separate personal data from training data through aggregation and anonymisation. Maintain complete data provenance so you know exactly what went into every model. Design your training pipelines for retrainability, even if you rarely need it. Build comprehensive erasure handling that assesses model impact, not just database deletion. And document everything.

This is what I do across every Rails application I build, whether it's GrowCentric.ai processing marketing campaign data, Stint.co generating email engagement insights, Regios.at powering regional search and recommendations, or Auto-Prammer.at running vehicle recommendations on Solidus.

The patterns are the same: data provenance tracking, anonymisation pipelines, minimum cohort sizes, SISA-inspired sharded training, comprehensive erasure handling, and audit logging. Build these once, apply them everywhere.

Because the question isn't whether a regulator will ask how you handle deletion requests for AI training data. The question is when.