INPUT IMAGE
Provide a CAPTCHA image via file upload, base64 string, or URL. The model accepts common image formats including PNG, JPG, and BMP.
A purpose-built OCR model that decodes CAPTCHA images with high accuracy. Pass an image, get the predicted text. Ready for Selenium, automation pipelines, and any workflow that needs CAPTCHA solving.
A streamlined pipeline that takes a CAPTCHA image and returns the decoded text. No preprocessing required.
Provide a CAPTCHA image via file upload, base64 string, or URL. The model accepts common image formats including PNG, JPG, and BMP.
The custom-trained model processes the image through a convolutional neural network optimized for distorted, noisy, and obfuscated character recognition.
Receive the decoded CAPTCHA text as a plain string. Average prediction time is under 100ms, making it suitable for real-time automation pipelines.
Upload any CAPTCHA image to see the model in action. Drag and drop or click to select a file. This demo calls the live API.
Drop a CAPTCHA image here
or click to browse
Upload an image to see the prediction
Integrate the CAPTCHA OCR model into your project with just a few lines of Python. Full API reference and integration examples below.
pip install captcha-ocr# Install the package
pip install captcha-ocr
# Import and use
from captcha_ocr import solve
result = solve("captcha.png")
print(result) # → "X7K9M2"Technical deep-dives on model training, CAPTCHA analysis, and production deployment strategies. Written by the team behind CAPTCHA OCR.
Building a custom OCR model from scratch is a rewarding exercise that forces you to understand every layer of the character recognition pipeline. Unlike fine-tuning a pretrained model, starting from zero means you control the data, the architecture, and the training loop. This article documents the exact process we used to build the CAPTCHA OCR engine.
The foundation of any OCR model is the training data. For CAPTCHA recognition, we generated a synthetic dataset of 500,000 images using a custom renderer. Each image contains 4 to 6 alphanumeric characters with varying distortions: rotation between -15 and +15 degrees, random noise injection, color variation, and overlapping lines. The key insight is that your synthetic data must match the distribution of real-world CAPTCHAs as closely as possible. We used Python's Pillow library combined with random affine transformations to achieve this. Each image is stored as a 128x48 grayscale PNG, paired with a label file containing the ground truth text.
The architecture follows a CRNN (Convolutional Recurrent Neural Network) pattern, which has become the standard for sequence-based text recognition. The convolutional backbone consists of 5 convolutional layers with batch normalization and ReLU activations, progressively reducing the spatial dimensions while increasing channel depth (1 -> 32 -> 64 -> 128 -> 256 -> 512). After the CNN, we reshape the feature maps into a sequence and feed them into a 2-layer bidirectional LSTM with 256 hidden units. The LSTM captures contextual dependencies between characters. Finally, a fully connected layer maps the LSTM output to the character vocabulary (36 classes: a-z + 0-9, plus a CTC blank token).
Connectionist Temporal Classification (CTC) is the critical component that allows us to train without needing exact character-level alignment. CTC works by summing over all possible alignments between the input sequence and the target label. During training, we use PyTorch's built-in `nn.CTCLoss` with zero infinity enabled for numerical stability. For decoding, we implement a greedy decoder that collapses repeated characters and removes blanks. For example, the raw output `--hh-ee-ll-ll-oo--` becomes `hello` after CTC decoding. We also experimented with beam search decoding but found greedy decoding sufficient for CAPTCHA text, since the character sequences are short and independent.
We train for 50 epochs using the AdamW optimizer with an initial learning rate of 1e-3 and a cosine annealing schedule with warm restarts. Batch size is 128, distributed across 2 GPUs using PyTorch's DataParallel. Data augmentation during training includes random brightness and contrast adjustment, elastic deformation, and Gaussian blur. We monitor both the CTC loss and a character-level accuracy metric computed on a held-out validation set of 50,000 images. Training converges around epoch 35, reaching 96.2% full-sequence accuracy on the validation set.
Three lessons from this process stand out. First, synthetic data quality matters more than quantity: 500K well-crafted samples outperformed 2M poorly generated ones. Second, batch normalization after every convolutional layer significantly stabilizes training with CTC loss. Third, the bidirectional LSTM is essential because characters in CAPTCHAs often overlap, and backward context helps resolve ambiguity. The final model weighs 12MB and runs inference in under 50ms on a CPU, making it practical for deployment in automation pipelines.
CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) have been the default defense against automated abuse since 2000. But the arms race between CAPTCHA generators and solvers has driven both sides to remarkable sophistication. Understanding modern CAPTCHA techniques is the first step toward building OCR systems that can handle them.
Classic text CAPTCHAs rely on a combination of distortion techniques. Character-level distortions include rotation (typically 10 to 30 degrees per character), scaling (80% to 120%), and skewing along one or both axes. Inter-character spacing is randomized to prevent simple segmentation. Background noise patterns include grid lines, random arcs, color gradients, and scattered dots. Some implementations overlay the text on photographic backgrounds to further confuse edge detection. The most challenging variants use overlapping characters where adjacent letters share pixel space, making segmentation-based approaches fail entirely.
Traditional OCR engines like Tesseract are designed for clean, well-formatted text in documents. They rely on connected component analysis to segment individual characters, then classify each one independently. This approach breaks down against CAPTCHAs for several reasons: the background noise creates false connected components, overlapping characters cannot be cleanly segmented, and the distortions push character appearances far outside the distribution of standard fonts. Attempting to preprocess CAPTCHA images with binarization and denoising often destroys the character strokes along with the noise.
Modern CAPTCHA OCR avoids segmentation entirely by treating the problem as sequence prediction. The input image is processed holistically: a CNN extracts features from the full image, and a recurrent layer (LSTM or Transformer) reads across the feature sequence to predict characters. CTC loss handles the alignment, so the model never needs to know where one character ends and the next begins. This approach is inherently robust to overlap, variable spacing, and background clutter because the model learns to focus on character-relevant features while ignoring noise patterns.
Many modern CAPTCHAs use multi-colored characters on complex backgrounds. Our approach converts all inputs to grayscale during preprocessing, which collapses color-based obfuscation into luminance differences. We then apply adaptive histogram equalization (CLAHE) to normalize contrast. This simple two-step preprocessing makes the model agnostic to color schemes while preserving character edge information. For CAPTCHAs with photographic backgrounds, we add an attention mechanism after the CNN that learns to focus on text regions.
Some CAPTCHA generators use adversarial techniques specifically designed to fool neural networks. These include imperceptible perturbations added to the image, font styles that are ambiguous even to humans (like confusing '1', 'l', and 'I'), and dynamic rendering that changes the distortion parameters per request. Our defense is a diverse training set: we generate CAPTCHAs using over 50 different font families, each with randomized weight and style. We also apply adversarial training during the last 10 epochs, adding FGSM perturbations to training images to harden the model.
We evaluated our model against five common CAPTCHA generators. Simple distorted text (like basic PHP CAPTCHAs): 98.1% accuracy. Medium complexity with line noise: 96.4%. Heavy overlap with background patterns: 93.7%. Multi-colored with gradients: 95.2%. Adversarially generated: 89.3%. The main failure mode is extreme character overlap where even human readers struggle, and adversarial variants specifically engineered against CNN architectures. For production use, we recommend confidence thresholding: reject predictions below 0.85 confidence and retry.
Having a trained CAPTCHA OCR model is only half the battle. The real value comes from integrating it into automated workflows where CAPTCHAs block programmatic access. This guide covers the most common integration patterns, from simple Selenium scripts to production-grade API deployments.
The most common use case is solving CAPTCHAs during web scraping or automated form submission with Selenium. The workflow is straightforward: navigate to the page, locate the CAPTCHA image element, take a screenshot of just that element, pass it to the OCR model, and type the predicted text into the input field. In Python with Selenium, the critical code looks like this: captcha_element = driver.find_element(By.ID, 'captcha-image') captcha_element.screenshot('captcha.png') result = model.predict('captcha.png') driver.find_element(By.ID, 'captcha-input').send_keys(result) The key detail is using element-level screenshots instead of full-page screenshots, which avoids scaling issues and captures the CAPTCHA at its native resolution.
For team-wide or service-to-service use, wrapping the model in a REST API is the standard approach. We use FastAPI for its async support and automatic OpenAPI documentation. The endpoint accepts a base64-encoded image in the request body and returns the prediction with confidence scores. The API runs the model on a single thread with a request queue to prevent memory issues from concurrent inference. A minimal deployment serves around 100 requests per second on a 4-core CPU instance, since each inference takes under 50ms.
CAPTCHA solving is inherently probabilistic, so your integration must handle failures gracefully. We recommend a three-tier strategy: First, check the model's confidence score. If it falls below 0.85, immediately retry by requesting a new CAPTCHA (most sites have a refresh button). Second, implement exponential backoff with a maximum of 3 attempts per CAPTCHA. Third, log failed predictions with the original image for later analysis and model improvement. In production, we see a first-attempt success rate of approximately 92%, rising to 99.1% with up to 3 retries.
When processing thousands of CAPTCHAs per hour, you need to think about infrastructure. The model is CPU-bound, not GPU-bound, since inference is fast and the bottleneck is usually network latency to the target site. We recommend horizontal scaling with a load balancer: each worker process loads its own copy of the model (12MB memory footprint) and handles requests independently. Docker containers make this trivial. For Kubernetes deployments, set resource limits to 0.5 CPU and 256MB RAM per pod, and use a Horizontal Pod Autoscaler triggered at 70% CPU utilization.
Production OCR systems need monitoring. Track three metrics: prediction confidence distribution (a shift toward lower confidence indicates the target site changed their CAPTCHA generator), end-to-end success rate (including the downstream form submission), and inference latency percentiles. When accuracy degrades, collect the failing images, add them to your training set, and retrain. We maintain a CI/CD pipeline that automatically retrains the model weekly if more than 100 new labeled samples have been added to the dataset. Model versioning with a simple A/B deployment strategy allows zero-downtime updates.
Before deploying CAPTCHA solving in production, understand the legal context. CAPTCHA bypass may violate the Terms of Service of the target website. It may also implicate the Computer Fraud and Abuse Act (CFAA) in the US or equivalent legislation in other jurisdictions, depending on the nature of the access. This tool is intended for legitimate use cases: automated testing of your own applications, accessibility tooling for users who cannot solve visual CAPTCHAs, and research purposes. Always ensure your use case is lawful and ethical before deployment.