A PyTorch Image Model Descriptive Predictions

In today’s world, we live in a generation where we tend to generate and produce large amounts of data every day. It won’t be wrong to describe these times as an era of big data, where all areas of science and industry thrive on masses of data and associated technologies. Although this also presents us with unprecedented challenges regarding their analysis and interpretation. This is the only reason why there seems to be an urgent need for new methods of machine learning and artificial intelligence that can help make good use of data. Deep learning (DL) is a method that is currently receiving a lot of attention. DL can be described as a family of learning algorithms that can be used to teach our systems complex prediction models. Deep learning is and has been applied successfully to several application problems.

Deep learning models present us with a new learning paradigm in artificial intelligence (AI) and machine learning (ML). The recent revolutionary results obtained in image analysis and speech recognition have also helped generate massive interest in this field, as applications in many other fields inculcating big data seem possible. The mathematical and computational methodology underlying deep learning models is difficult, especially for interdisciplinary scientists. These models form the main basic architectures of the deep learning models currently in use and should always belong to a data scientist’s toolbox. Basic architectural building blocks can be flexibly composed to create new application-specific network architectures.

Register for our Workshop on How To Start Your Career In Data Science?

Data analysis varies from company to company based on requirements, so the data model should always be designed to meet the requirements. Predictive modeling is a major sub-part of data analysis that uses data mining and probabilistic methods to predict outcomes. Each model is built using many predictors which make them very favorable in determining future decisions. Once data is received for a specific type of prediction, an analytical model is formulated. Then, simple linear equations or a complex neural structure can be applied further and are described by relevant software. If, if necessary, additional data are available, the analytical model is revised. Predictive modeling also uses different regression algorithms and analyzes or statistics to estimate the probability of an event, applying detection theory and widely employed in fields related to artificial intelligence (AI).

PyTorch is an optimized tensor library primarily used for deep learning applications that use GPUs and processors to improve processing power. It is an open source machine learning library for Python, developed by the Facebook AI Research team and one of the most widely used machine learning libraries, the others being TensorFlow and Keras.

What are image transformers?

Image Transformer is a model entirely dependent on the self-attention mechanism, where the encoder generates a pixel channel representation of the source image. Despite the relatively low resources required for training, Image Transformer models are typically trained on images from the standard ImageNet dataset. Many applications of image models require the conditioning of additional information of various kinds: from images in enhancement or reconstruction tasks such as superresolution, painting, and denoising to text when synthesizing images to from descriptions in natural language.

In visualization tasks, image generation models using Transformers can predict future frames of video based on previous frames and actions taken. Image transformers treat pixel intensities as discrete categories or ordinal values, where the adjustment is subjective and depends on the distribution of the image data. For the image encoder and decoder during image preprocessing, the image transformer uses multiple stacks of self-attention and look-ahead layers depending on the position. The decoder uses an attention mechanism to take the representation of the encoder as input.

For parametric and conditional image preprocessing using Image Transformer, a decoder-only configuration is used. Each self-attention layer calculates a D dimensional representation for each position, i.e. each channel of each image pixel. To recalculate the representation of a given position, it first compares the current representation of the position to the representations of the other positions, obtaining a distribution of attention to the other positions. This distribution is then used to judge the contribution of the representations of the other posts to the next representation for the post in question. Vision Transformers (ViT) have been shown to achieve very competitive performance for a wide range of computer vision and image applications, such as image classification, object detection and semantic segmentation of images. It is generally seen that Vision Transformer relies more on regularizing the model or increasing the data, also known as “AugReg” for short when training on smaller training data sets.

Image source: Original paper

Getting started with the code for VIT-AugReg

This article will try to generate a descriptive prediction from a dataset of images using the VIT library. We will try to predict the dog breed in the picture and provide it with a tag using Vision Transformer. The following code is inspired by the creators of the library, whose Github link is accessible here.

Library installation

To create this prediction model, we will first install the Vision Transformer. The following code can be used for this,

# Install the vision_transformer Library.
![ -d vision_transformer ] || git clone --depth=1 https://github.com/google-research/vision_transformer

Install the other dependencies,

# Install dependencies.
!pip install -qr vision_transformer/vit_jax/requirements.txt
    |████████████████████████████████| 57 kB 2.8 MB/s 
     |████████████████████████████████| 76 kB 4.5 MB/s 
     |████████████████████████████████| 179 kB 24.6 MB/s 
     |████████████████████████████████| 88 kB 7.8 MB/s 
     |████████████████████████████████| 168.3 MB 15 kB/s 

Import of the AugReg model,

import sys
if './vision_transformer' not in sys.path:
  sys.path.append('./vision_transformer')
 
%load_ext autoreload
%autoreload 2
 
from vit_jax import checkpoint
from vit_jax import models
from vit_jax import train
from vit_jax.configs import augreg as augreg_config
from vit_jax.configs import models as models_config

Importing dependencies for analysis,

import glob
import os
import random
import shutil
import time
 
from absl import logging
import pandas as pd
import seaborn as sns
import tensorflow as tf
import tensorflow_datasets as tfds
from matplotlib import pyplot as plt
 
pd.options.display.max_colwidth = None
logging.set_verbosity(logging.INFO)

Loading data

We will now start to load our data table from the VIT cloud.

# Load master table from Cloud.
with tf.io.gfile.GFile('gs://vit_models/augreg/index.csv') as f:
  df = pd.read_csv(f)


# List the rows and columns

print(f'loaded {len(df):,} rows')
df.columns

#print length of dataset
len(set(df.filename)), len(set(df.adapt_filename))

# loading the dataset checkpoint
best_filenames = set(
    df.query('ds=="i21k"')
    .groupby('name')
    .apply(lambda df: df.sort_values('final_val').iloc[-1])
    .filename
)
 
# Fine Tuning these models.
best_df = df.loc[df.filename.apply(lambda filename: filename in best_filenames)]

Now that all of the essential model and model data checkpoints are loaded, we can build the prediction model.

Creation of the predictor model

We’ll start building the prediction model by first loading our pet image dataset. The following code can be used for this,

#loading the image dataset
ds, ds_info = tfds.load(tfds_name, with_info=True)
ds_info

# Get model instance
model = models.VisionTransformer(
    num_classes=ds_info.features['label'].num_classes, **model_config)


Now we are going to import a single random image from the pet dataset that we will be making our prediction on,

d = next(iter(ds['test']))

#display random image
def pp(img, sz):
 
  img = tf.cast(img, float) / 255.0
  img = tf.image.resize(img, [sz, sz])
  return img
 
plt.imshow(pp(d['image'], resolution));

Go out:

See also
The inspiring story of Ajai Devanathan moving from human resources to data science
# Applying the VIT-AugReg model on image
logits, = model.apply({'params': params}, [pp(d['image'], resolution)], train=False)

# Plotting the label probabilities.
plt.figure(figsize=(10, 4))
plt.bar(list(map(ds_info.features['label'].int2str, range(len(logits)))), logits)
plt.xticks(rotation=90);


Go out :

As we can see, the model predicted that the dog breed would be the Leonberger. Now let’s compare the result with an image of the Leonberger dog.

Image source

As we can see, our predictive model seems to have correctly predicted the dog breed tag for our sample image!

The created vision transformer can also be incorporated into other Pytorch image models. Let’s try with the Timm model,

# Installing the timm model library
!pip install timm
import timm
import torch


# Loading the model into timm
timm_model = timm.create_model(
    'vit_small_r26_s32_384', num_classes=ds_info.features['label'].num_classes)
 
if not tf.io.gfile.exists(f'{filename}.npz'):
  tf.io.gfile.copy(f'gs://vit_models/augreg/{filename}.npz', f'{filename}.npz')
timm.models.load_checkpoint(timm_model, f'{filename}.npz')

Image processing in the model,

#loading the image into model
def pp_torch(img, sz):
  
  img = pp(img, sz)
  img = img.numpy().transpose([2, 0, 1]) 
  return torch.tensor(img[None])
 
with torch.no_grad():
  logits, = timm_model(pp_torch(d['image'], resolution)).detach().numpy()

# Visualizing results for Timm
plt.figure(figsize=(10, 4))
plt.bar(list(map(ds_info.features['label'].int2str, range(len(logits)))), logits)
plt.xticks(rotation=90);

We can observe that our model always gives us a correctly predicted label.

End Notes

This article has tried to explore and understand image models and how they work. We also discovered a descriptive PyTorch image model known as VIT, where we implemented AugReg to create an image tag predictor. The following implementation can be found as a Colab notebook accessible using the link here.

Good learning!

The references

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.


Join our Telegram Group. Be part of an engaging community

Source link

About Laura J. Bell

Check Also

5 Stunning Online Image Editors To Make Photoshop Effects Without Photoshop

From removing backgrounds and objects from photos to turning small images into poster-sized prints, these …

Leave a Reply

Your email address will not be published. Required fields are marked *