How to Design Mock Production Data Pipelines Using Polyfactory with Dataclasses, Pydantic, Attrs, and Nested Models

In this lesson, we go through advanced testing, from the end of the Polyfactorywe focus on how to generate rich, pseudo-realistic data directly from Python type schemes. We begin by setting up the environment and continue to create data class factories, Pydantic models, and atrs-based classes, while demonstrating customization, overriding, calculated fields, and nested object generation. As we go through each snippet, we show how to control randomness, enforce constraints, and model real-world architecture, making this course directly applicable to testing, modeling, and data-driven workflow development. Check it out FULL CODES here.
import subprocess
import sys
def install_package(package):
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])
packages = [
"polyfactory",
"pydantic",
"email-validator",
"faker",
"msgspec",
"attrs"
]
for package in packages:
try:
install_package(package)
print(f"✓ Installed {package}")
except Exception as e:
print(f"✗ Failed to install {package}: {e}")
print("n")
print("=" * 80)
print("SECTION 2: Basic Dataclass Factories")
print("=" * 80)
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime, date
from uuid import UUID
from polyfactory.factories import DataclassFactory
@dataclass
class Address:
street: str
city: str
country: str
zip_code: str
@dataclass
class Person:
id: UUID
name: str
email: str
age: int
birth_date: date
is_active: bool
address: Address
phone_numbers: List[str]
bio: Optional[str] = None
class PersonFactory(DataclassFactory[Person]):
pass
person = PersonFactory.build()
print(f"Generated Person:")
print(f" ID: {person.id}")
print(f" Name: {person.name}")
print(f" Email: {person.email}")
print(f" Age: {person.age}")
print(f" Address: {person.address.city}, {person.address.country}")
print(f" Phone Numbers: {person.phone_numbers[:2]}")
print()
people = PersonFactory.batch(5)
print(f"Generated {len(people)} people:")
for i, p in enumerate(people, 1):
print(f" {i}. {p.name} - {p.email}")
print("n")
We set up the environment and make sure that all the necessary requirements are installed. We also present the basic idea of using Polyfactory to generate pseudo data from type schemes. By implementing the basic factories of the data class, we establish the basis for all the following examples.
print("=" * 80)
print("SECTION 3: Customizing Factory Behavior")
print("=" * 80)
from faker import Faker
from polyfactory.fields import Use, Ignore
@dataclass
class Employee:
employee_id: str
full_name: str
department: str
salary: float
hire_date: date
is_manager: bool
email: str
internal_notes: Optional[str] = None
class EmployeeFactory(DataclassFactory[Employee]):
__faker__ = Faker(locale="en_US")
__random_seed__ = 42
@classmethod
def employee_id(cls) -> str:
return f"EMP-{cls.__random__.randint(10000, 99999)}"
@classmethod
def full_name(cls) -> str:
return cls.__faker__.name()
@classmethod
def department(cls) -> str:
departments = ["Engineering", "Marketing", "Sales", "HR", "Finance"]
return cls.__random__.choice(departments)
@classmethod
def salary(cls) -> float:
return round(cls.__random__.uniform(50000, 150000), 2)
@classmethod
def email(cls) -> str:
return cls.__faker__.company_email()
employees = EmployeeFactory.batch(3)
print("Generated Employees:")
for emp in employees:
print(f" {emp.employee_id}: {emp.full_name}")
print(f" Department: {emp.department}")
print(f" Salary: ${emp.salary:,.2f}")
print(f" Email: {emp.email}")
print()
print()
print("=" * 80)
print("SECTION 4: Field Constraints and Calculated Fields")
print("=" * 80)
@dataclass
class Product:
product_id: str
name: str
description: str
price: float
discount_percentage: float
stock_quantity: int
final_price: Optional[float] = None
sku: Optional[str] = None
class ProductFactory(DataclassFactory[Product]):
@classmethod
def product_id(cls) -> str:
return f"PROD-{cls.__random__.randint(1000, 9999)}"
@classmethod
def name(cls) -> str:
adjectives = ["Premium", "Deluxe", "Classic", "Modern", "Eco"]
nouns = ["Widget", "Gadget", "Device", "Tool", "Appliance"]
return f"{cls.__random__.choice(adjectives)} {cls.__random__.choice(nouns)}"
@classmethod
def price(cls) -> float:
return round(cls.__random__.uniform(10.0, 1000.0), 2)
@classmethod
def discount_percentage(cls) -> float:
return round(cls.__random__.uniform(0, 30), 2)
@classmethod
def stock_quantity(cls) -> int:
return cls.__random__.randint(0, 500)
@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.final_price is None:
instance.final_price = round(
instance.price * (1 - instance.discount_percentage / 100), 2
)
if instance.sku is None:
name_part = instance.name.replace(" ", "-").upper()[:10]
instance.sku = f"{instance.product_id}-{name_part}"
return instance
products = ProductFactory.batch(3)
print("Generated Products:")
for prod in products:
print(f" {prod.sku}")
print(f" Name: {prod.name}")
print(f" Price: ${prod.price:.2f}")
print(f" Discount: {prod.discount_percentage}%")
print(f" Final Price: ${prod.final_price:.2f}")
print(f" Stock: {prod.stock_quantity} units")
print()
print()
We focus on generating simple yet realistic pseudo data using Polyfactory’s automated data classes and behaviors. We show how you can quickly create single instances and clusters without writing any custom logic. It helps us verify how Polyfactory automatically translates type hints to populate enclosed structures.
print("=" * 80)
print("SECTION 6: Complex Nested Structures")
print("=" * 80)
from enum import Enum
class OrderStatus(str, Enum):
PENDING = "pending"
PROCESSING = "processing"
SHIPPED = "shipped"
DELIVERED = "delivered"
CANCELLED = "cancelled"
@dataclass
class OrderItem:
product_name: str
quantity: int
unit_price: float
total_price: Optional[float] = None
@dataclass
class ShippingInfo:
carrier: str
tracking_number: str
estimated_delivery: date
@dataclass
class Order:
order_id: str
customer_name: str
customer_email: str
status: OrderStatus
items: List[OrderItem]
order_date: datetime
shipping_info: Optional[ShippingInfo] = None
total_amount: Optional[float] = None
notes: Optional[str] = None
class OrderItemFactory(DataclassFactory[OrderItem]):
@classmethod
def product_name(cls) -> str:
products = ["Laptop", "Mouse", "Keyboard", "Monitor", "Headphones",
"Webcam", "USB Cable", "Phone Case", "Charger", "Tablet"]
return cls.__random__.choice(products)
@classmethod
def quantity(cls) -> int:
return cls.__random__.randint(1, 5)
@classmethod
def unit_price(cls) -> float:
return round(cls.__random__.uniform(5.0, 500.0), 2)
@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_price is None:
instance.total_price = round(instance.quantity * instance.unit_price, 2)
return instance
class ShippingInfoFactory(DataclassFactory[ShippingInfo]):
@classmethod
def carrier(cls) -> str:
carriers = ["FedEx", "UPS", "DHL", "USPS"]
return cls.__random__.choice(carriers)
@classmethod
def tracking_number(cls) -> str:
return ''.join(cls.__random__.choices('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', k=12))
class OrderFactory(DataclassFactory[Order]):
@classmethod
def order_id(cls) -> str:
return f"ORD-{datetime.now().year}-{cls.__random__.randint(100000, 999999)}"
@classmethod
def items(cls) -> List[OrderItem]:
return OrderItemFactory.batch(cls.__random__.randint(1, 5))
@classmethod
def build(cls, **kwargs):
instance = super().build(**kwargs)
if instance.total_amount is None:
instance.total_amount = round(sum(item.total_price for item in instance.items), 2)
if instance.shipping_info is None and instance.status in [OrderStatus.SHIPPED, OrderStatus.DELIVERED]:
instance.shipping_info = ShippingInfoFactory.build()
return instance
orders = OrderFactory.batch(2)
print("Generated Orders:")
for order in orders:
print(f"n Order {order.order_id}")
print(f" Customer: {order.customer_name} ({order.customer_email})")
print(f" Status: {order.status.value}")
print(f" Items ({len(order.items)}):")
for item in order.items:
print(f" - {item.quantity}x {item.product_name} @ ${item.unit_price:.2f} = ${item.total_price:.2f}")
print(f" Total: ${order.total_amount:.2f}")
if order.shipping_info:
print(f" Shipping: {order.shipping_info.carrier} - {order.shipping_info.tracking_number}")
print("n")
We build complex domain logic by introducing calculated and dependent fields within fields. We show how to get prices such as final prices, prices, and shipping details after the item is created. This allows us to model realistic business rules directly within our test data generators.
print("=" * 80)
print("SECTION 7: Attrs Integration")
print("=" * 80)
import attrs
from polyfactory.factories.attrs_factory import AttrsFactory
@attrs.define
class BlogPost:
title: str
author: str
content: str
views: int = 0
likes: int = 0
published: bool = False
published_at: Optional[datetime] = None
tags: List[str] = attrs.field(factory=list)
class BlogPostFactory(AttrsFactory[BlogPost]):
@classmethod
def title(cls) -> str:
templates = [
"10 Tips for {}",
"Understanding {}",
"The Complete Guide to {}",
"Why {} Matters",
"Getting Started with {}"
]
topics = ["Python", "Data Science", "Machine Learning", "Web Development", "DevOps"]
template = cls.__random__.choice(templates)
topic = cls.__random__.choice(topics)
return template.format(topic)
@classmethod
def content(cls) -> str:
return " ".join(Faker().sentences(nb=cls.__random__.randint(3, 8)))
@classmethod
def views(cls) -> int:
return cls.__random__.randint(0, 10000)
@classmethod
def likes(cls) -> int:
return cls.__random__.randint(0, 1000)
@classmethod
def tags(cls) -> List[str]:
all_tags = ["python", "tutorial", "beginner", "advanced", "guide",
"tips", "best-practices", "2024"]
return cls.__random__.sample(all_tags, k=cls.__random__.randint(2, 5))
posts = BlogPostFactory.batch(3)
print("Generated Blog Posts:")
for post in posts:
print(f"n '{post.title}'")
print(f" Author: {post.author}")
print(f" Views: {post.views:,} | Likes: {post.likes:,}")
print(f" Published: {post.published}")
print(f" Tags: {', '.join(post.tags)}")
print(f" Preview: {post.content[:100]}...")
print("n")
print("=" * 80)
print("SECTION 8: Building with Specific Overrides")
print("=" * 80)
custom_person = PersonFactory.build(
name="Alice Johnson",
age=30,
email="[email protected]"
)
print(f"Custom Person:")
print(f" Name: {custom_person.name}")
print(f" Age: {custom_person.age}")
print(f" Email: {custom_person.email}")
print(f" ID (auto-generated): {custom_person.id}")
print()
vip_customers = PersonFactory.batch(
3,
bio="VIP Customer"
)
print("VIP Customers:")
for customer in vip_customers:
print(f" {customer.name}: {customer.bio}")
print("n")
We extend the use of Polyfactory to certified Pydantic models and classes based on atrs. We show how to respect field parameters, validators, and default behavior while generating valid data at scale. It ensures that our dummy data always matches the actual application schemas.
print("=" * 80)
print("SECTION 9: Field-Level Control with Use and Ignore")
print("=" * 80)
from polyfactory.fields import Use, Ignore
@dataclass
class Configuration:
app_name: str
version: str
debug: bool
created_at: datetime
api_key: str
secret_key: str
class ConfigFactory(DataclassFactory[Configuration]):
app_name = Use(lambda: "MyAwesomeApp")
version = Use(lambda: "1.0.0")
debug = Use(lambda: False)
@classmethod
def api_key(cls) -> str:
return f"api_key_{''.join(cls.__random__.choices('0123456789abcdef', k=32))}"
@classmethod
def secret_key(cls) -> str:
return f"secret_{''.join(cls.__random__.choices('0123456789abcdef', k=64))}"
configs = ConfigFactory.batch(2)
print("Generated Configurations:")
for config in configs:
print(f" App: {config.app_name} v{config.version}")
print(f" Debug: {config.debug}")
print(f" API Key: {config.api_key[:20]}...")
print(f" Created: {config.created_at}")
print()
print()
print("=" * 80)
print("SECTION 10: Model Coverage Testing")
print("=" * 80)
from pydantic import BaseModel, ConfigDict
from typing import Union
class PaymentMethod(BaseModel):
model_config = ConfigDict(use_enum_values=True)
type: str
card_number: Optional[str] = None
bank_name: Optional[str] = None
verified: bool = False
class PaymentMethodFactory(ModelFactory[PaymentMethod]):
__model__ = PaymentMethod
payment_methods = [
PaymentMethodFactory.build(type="card", card_number="4111111111111111"),
PaymentMethodFactory.build(type="bank", bank_name="Chase Bank"),
PaymentMethodFactory.build(verified=True),
]
print("Payment Method Coverage:")
for i, pm in enumerate(payment_methods, 1):
print(f" {i}. Type: {pm.type}")
if pm.card_number:
print(f" Card: {pm.card_number}")
if pm.bank_name:
print(f" Bank: {pm.bank_name}")
print(f" Verified: {pm.verified}")
print("n")
print("=" * 80)
print("TUTORIAL SUMMARY")
print("=" * 80)
print("""
This tutorial covered:
1. ✓ Basic Dataclass Factories - Simple mock data generation
2. ✓ Custom Field Generators - Controlling individual field values
3. ✓ Field Constraints - Using PostGenerated for calculated fields
4. ✓ Pydantic Integration - Working with validated models
5. ✓ Complex Nested Structures - Building related objects
6. ✓ Attrs Support - Alternative to dataclasses
7. ✓ Build Overrides - Customizing specific instances
8. ✓ Use and Ignore - Explicit field control
9. ✓ Coverage Testing - Ensuring comprehensive test data
Key Takeaways:
- Polyfactory automatically generates mock data from type hints
- Customize generation with classmethods and decorators
- Supports multiple libraries: dataclasses, Pydantic, attrs, msgspec
- Use PostGenerated for calculated/dependent fields
- Override specific values while keeping others random
- Perfect for testing, development, and prototyping
For more information:
- Documentation:
- GitHub:
""")
print("=" * 80)
We include advanced usage patterns such as implicit overrides, fixed field values, and coverage test cases. We show how to intentionally build edge cases and unique scenarios for robust testing. This last step ties everything together by showing how Polyfactory supports comprehensive test data strategies and the production range.
In conclusion, we showed how Polyfactory enables us to create extensive, flexible test data with minimal boilerplate while maintaining fine-grained control over the entire field. We’ve shown how to handle simple entities, complex aggregate structures, and Pydantic model validation, as well as field-specific documentation, within a single, consistent factory method. Overall, we’ve found that Polyfactory enables us to move faster and test with greater confidence, as it reliably generates realistic datasets that mimic production-like scenarios without sacrificing clarity or maintainability.
Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.



