Computer Vision

Deep learning is the dominant approach in modern computer vision. It uses artificial neural networks with multiple layers to automatically learn complex features from visual data.

Convolutional Neural Networks (CNNs)

The cornerstone of deep learning for computer vision, CNNs are designed to process pixel data effectively. They use convolutional layers to apply filters that detect low-level features like edges and textures, and progressively build up to high-level features like shapes and object parts.
How they work: The network applies small, learnable filters that slide over the input image. Each filter creates a "feature map" that responds to a specific visual feature. Pooling layers are then used to downsample the feature maps, reducing the amount of data while preserving important information.
Notable architectures: Popular CNN architectures include LeNet, AlexNet, VGG, ResNet, and EfficientNet, each introducing innovations to improve performance or efficiency.

Vision Transformers

Inspired by the success of transformer models in natural language processing, ViTs process images by dividing them into a sequence of patches, similar to how text is broken into tokens.
How they work: Unlike CNNs, which focus on local pixel relationships, ViTs use self-attention mechanisms to weigh the relationships between all patches of an image, allowing them to capture global contextual information.
Performance: While requiring large datasets for high accuracy, ViTs have shown impressive performance and computational efficiency on many vision tasks when trained properly.

Generative AI Models (Gen AI)

These models create new visual content by learning from existing data.
Generative Adversarial Networks (GANs): A GAN consists of a generator network that creates new images and a discriminator network that determines if the image is real or fake. This adversarial process refines the generator to produce highly realistic images.
Diffusion Models: These models work by learning to reverse a process of progressively adding noise to an image. By starting with random noise and denoising it over several steps, they can create high-quality, photorealistic images.