Bulk Translation with ChatGPT: Scaling AI-Powered Translations

17 décembre 2024 par Moritz Thomas

While individual AI-powered translations are useful for targeted updates, modern translation management systems often need to handle large-scale translation tasks efficiently. This article explores how we implemented bulk translation capabilities in our Translation Management System, leveraging ChatGPT's structured output capabilities and parallel processing for optimal performance.

Overview

The bulk translation system combines several key components:

A simple user interface in the translation dashboard
A robust backend endpoint for handling translation requests
Parallel processing for improved performance
Structured response handling using Zod schemas
Flexible output handling with draft support

User Interface Integration

The bulk translation feature is integrated into the translation dashboard through a simple, focused interface:

const AddMissingTranslationsButton = () => {
  const [isLoading, setIsLoading] = useState(false);

  const handleAddMissingTranslations = async () => {
    setIsLoading(true);
    try {
      const response = await fetch("/api/ui-strings/translate/missing", {
        method: "get",
      });

      if (!response.ok) {
        throw new Error("Failed to add missing translations");
      }
    } catch (error) {
      console.error("Error adding missing translations:", error);
    } finally {
      setIsLoading(false);
      window.location.reload();
    }
  };

  return (
    <button onClick={handleAddMissingTranslations} disabled={isLoading}>
      {isLoading ? 
        "Working hard at translating everything..." : 
        "Add All Missing Translations"}
    </button>
  );
};

This minimalist interface belies the sophisticated processing happening behind the scenes. The single button triggers a comprehensive workflow that:

Identifies missing translations across all content
Processes them in efficient batches
Handles the results appropriately based on system settings

Backend Implementation

The bulk translation endpoint handles the complex task of coordinating multiple translations:

export const bulkTranslateEndpoint: Endpoint = {
  path: "/translate/missing",
  method: "get",
  handler: async (req, res) => {
    if (!rbacHas(ROLE_EDITOR)({ req })) {
      return res.status(401).send("Unauthorized");
    }

    // Fetch all UI strings and system settings
    const result = await req.payload.find({
      collection: "ui-strings",
      locale: "all",
      limit: 0,
    });

    const settings = await req.payload.findGlobal({
      slug: "payload-settings",
    });

    // Process and filter strings needing translation
    const cleanedUp = result.docs
      .map(({ id, description, text }) => ({ id, description, text }))
      .filter(({ text }) => Object.keys(text).length !== locales.length);

    // ... translation processing
  }
};

Structured Response Handling

A key feature of our implementation is the use of Zod schemas to ensure properly structured responses:

const zodTerms = {};
const missingTranslations = cleanedUp.slice(0, 100).map(({ id, text, description }) => {
  const missing = locales.filter((locale) => !text[locale]);

  const translations = {};
  missing.forEach((locale) => {
    translations[locale] = z.string();
  });

  zodTerms[id] = z.object({
    id: z.string(),
    translations: z.object(translations),
  });

  return {
    id,
    text: text[defaultLocale],
    description,
    missing,
  };
});

const Translations = z.object({
  translations: z.object(zodTerms),
});

This schema-based approach ensures:

Type safety throughout the translation process
Properly structured responses from the AI
Easy validation of translation results

Parallel Processing

To handle large numbers of translations efficiently, we implemented a parallel processing system:

async function parallelTranslate(
  missingTranslations, 
  Translations, 
  settings, 
  batchSize = 5
) {
  const batchId = crypto.randomBytes(4).toString("hex");
  
  // Split into batches
  const batches = [];
  for (let i = 0; i < missingTranslations.length; i += batchSize) {
    batches.push(missingTranslations.slice(i, i + batchSize));
  }

  // Process batches in parallel
  const batchResults = await Promise.all(
    batches.map(async (batch, index) => {
      const result = await translate(batch, Translations, settings);
      return result;
    })
  );

  // Combine results
  const combinedTranslations = batchResults.reduce((acc, result) => {
    Object.entries(result.translations).forEach(([key, value]) => {
      acc[key] = value;
    });
    return acc;
  }, {});

  return {
    batchId,
    result: { translations: combinedTranslations },
    performance: {
      totalItems: missingTranslations.length,
      batchCount: batches.length,
    },
  };
}

This parallel processing approach provides:

Improved throughput for large translation sets
Better error isolation (failed batches don't affect others)
Progress tracking through batch identifiers
Performance metrics for system monitoring

Flexible Output Handling

The system supports two modes of operation, controlled through settings:

Direct publication of translations
Creation of translation drafts for review

async function addAsDrafts(translations, existingDoc, req) {
  const newDrafts = Object.entries(translations).reduce((acc, [locale, text]) => {
    if (!acc[locale]) acc[locale] = [];
    
    // Only add if different from existing translation
    if (existingDoc.text[locale] !== text) {
      acc[locale].push({
        text,
        id: uuidv4(),
        lastModifiedBy: null,
      });
    }
    return acc;
  }, {});

  // Merge with existing drafts
  const updatedDrafts = Object.keys(newDrafts).reduce((acc, locale) => {
    acc[locale] = [
      ...(existingDoc.drafts?.[locale] || []), 
      ...newDrafts[locale]
    ];
    return acc;
  }, {});

  // Update each locale
  for (const [locale, drafts] of Object.entries(updatedDrafts)) {
    if (drafts.length === new Set(drafts.map(d => d.text)).size) {
      await req.payload.update({
        collection: slug,
        id: existingDoc.id,
        data: { drafts },
        locale,
        user: req.user,
      });
    }
  }
}

This flexibility allows organizations to:

Fast-track translations in development environments
Implement review processes in production
Maintain quality control through draft reviews
Track translation changes over time

Future Enhancements

Several areas have been identified for future improvement:

Enhanced Error Recovery

Retry logic for failed batches
Partial success handling
Detailed error reporting

Performance Optimization

Dynamic batch sizing
Priority queue implementation
Caching of common translations

Quality Assurance

Automated quality metrics
Consistency checking
Context-aware validation

User Interface Improvements

Progress indicators
Batch-level control
Result preview

Conclusion

The bulk translation system demonstrates how modern AI capabilities can be effectively scaled for production use. By combining parallel processing, structured responses, and flexible output handling, we've created a system that can efficiently handle large-scale translation tasks while maintaining quality control and system stability.

Key benefits include:

Efficient handling of large translation volumes
Robust error handling and validation
Flexible deployment options
Performance monitoring and optimization
Integration with existing workflow tools

This implementation provides a foundation for automated translation management that can evolve with changing requirements and technological capabilities.

retour à la vue d'ensemble du blog

Commentaires

Pas encore de commentaires, soyez le premier :

retour à la vue d'ensemble du blog

Legal Notice Data Privacy