Deep learning job interviews. A necessary evil. Most beginners in the industry break out in a cold sweat at the mere thought of a machine learning or a deep learning job interview. How do I prepare for my upcoming deep learning job interview? What kind of deep learning interview questions they are going to ask me? What questions should I ask them? These are just a few thoughts that run through the mind of any interviewee. The problem with most machine learning or deep learning interviews is that you never know whether you’ve to bring your lucky whiteboard marker or your lucky keyboard. Not to mention the deep learning questions that you will be asked in your next job interview are hardly predictable.
The good news? We’ve collated 100 deep learning technical interview questions from the insights of our industry experts on what kind of questions they ask most often. So, keep calm and read on to see what kind of questions you can expect in the hot seat in your next deep learning job interview. Ready to dive in? Then let’s get started!
The foremost step when deciding on choosing a neural network model is to have a good know-how of the data and then decide the best model for it. Also, factoring in whether it is a linearly separable problem or not is important when deciding on a neural network model. So, the task at hand and the data play a vital role in choosing the best neural network model for a given problem. However, it is always better to start with a simple model like multi-layer perceptron (MLP) that has just one hidden layer unlike CNN, LSTM, or RNN that require configuring the nodes and layers. MLP is considered the simplest neural network because the weight initialization is not sensitive and also there is no need to define a structure for the network beforehand.
The curse of dimensionality (the problems that arise when working with high-dimensional data) is a common problem when working on machine learning or deep learning projects. Curse of Dimensionality causes lots of difficulties while training a model because it requires training a lot of parameters on a scarce dataset leading to issues like overfitting, large training times, and poor generalization. PCA and autoencoders are used to tackle these issues. PCA is an unsupervised technique wherein the actual data is projected to the direction of high variance while autoencoders are neural networks used for compressing the data into a low dimensional latent space and then try to reconstruct the actual high dimensional data.
PCA or autoencoders are effective only when the features have some relationship with each other. A general thumb rule between choosing PCA and Autoencoders is the size of data. Autoencoders work great for larger datasets and PCA works well for smaller datasets. Autoencoders are usually preferred when there is a need for modeling non-linearities and relatively complex relationships. Autoencoders can encode a lot of information with fewer dimensions when there is a curvature in low dim structure or non-linearity, making them a better choice over PCA in such scenarios.
Autoencoders are usually preferred for identifying data anomalies rather than for reducing data. Anomalous data points can be identified using the reconstruction error, PCA is not good for reconstructing data particularly when there are non-linear relationships.
Given a business problem, there is no hard and fast rule to determine the exact number of neurons and hidden layers required to build a neural network architecture. The optimal size of the hidden layer in a neural network lies between the size of the output layers and the size of the input. However, here are some common approaches that have the advantage of making a great start to building a neural network architecture –
One common problem with using ANN’s for image classification is that ANN’s react differently to input images and their shifted versions. Let’s consider a simple example where you have the picture of a dog in the top left of an image and in another image, there is a picture of a dog at the bottom right. ANN will assume that a dog will always appear in this section of any image, however, that’s not the case. ANN’s require concrete data points meaning if you are building a deep learning model to distinguish between cats and dogs, the length of the ears, the width of the nose, and other features should be provided as data points while if using CNN for image classification spatial features are extracted from the input images. When there are thousands of features to be extracted, CNN is a better choice because it gathers features on its own, unlike ANN where each individual feature needs to be measured.
Training a neural network model becomes computationally heavy (requiring additional storage and processing capability) as the number of layers and parameters increases. Tuning the increased number of parameters can be a tedious task with ANN, unlike CNN where the time for tuning parameters is reduced making it an ideal choice for image classification problems.
A common problem with Tanh or Sigmoid functions is that they saturate. Once saturated, the learning algorithms cannot adapt to the weights and enhance the performance of the model. Thus, Sigmoid or Tanh activation functions prevent the neural network from learning effectively leading to a vanishing gradient problem. The vanishing gradient problem can be addressed with the use of Rectified Linear Activation Function (ReLu) instead of sigmoid and using a Xavier initialization.
When the model weights grow exponentially and become unexpectedly large in the end when training the model, exploding gradient problem happens. In a neural network with n hidden layers, n derivatives are multiplied together. If the weights that are multiplied are greater than 1 then the gradient increases exponentially greater than the usual one and eventually explodes as you propagate through the model. The situation wherein the value of weights is more than 1 makes the output exponentially larger hindering the model training and impacting the overall accuracy of the model is referred to as the exploding gradients problem. Exploding gradients is a serious problem because the model cannot learn from its training data resulting in a poor loss. One can deal with the exploding gradient problem either by gradient clipping, weight regularization, or with the use of LSTM’s.
Constant validation accuracy is a common problem when training any neural network because the network just remembers the sample and results in an overfitting problem. Overfitting of a model means that the neural network model works fantastic on the training sample but the performance of the model sinks in on the validation set. Here are some tips to try to fix the constant validation accuracy in CNN –
Learning rate is one of the most important configurable hyperparameters used in the training of a neural network. The value of the learning rate lies between 0 and 1. Choosing the learning rate is one of the most challenging aspects of training a neural network because it is the parameter that controls how quickly or slowly a neural network model adapts to a given problem and learns. A higher learning rate value means that the model requires few training epochs and results in rapid changes while a smaller learning rate implies that the model will take a long time to converge or might never converge and get stuck on a suboptimal solution. Thus, it is advisable not to use a learning rate that is too low or too high but instead a good learning rate value should be discovered through trial and error.
Every neural network has a hidden layer along with input and output layers. Neural networks that use a single hidden layer are known as shallow neural networks while those that use multiple hidden layers are referred to as deep neural networks. Both shallow and deep networks are capable of fitting into any function but shallow networks require a lot of parameters, unlike deep networks that can fit functions even with a limited number of parameters because of several layers. Deep networks are preferred today over shallow networks because at every layer the model learns a novel and abstract representation of the input. Also, they are much more efficient in terms of the number of parameters and computations compared to shallow networks.
Yes, there is a possibility that the neural network model will learn even if all the biases are initialized to 0.
No, it is not possible to train a model by initializing all the weights to 0 because the neural network will never learn to perform a given task. Initializing all weights to zeros will cause the derivatives to remain the same for every w in W  because of which neurons will learn the same features in every iteration. Not just 0, but any kind of constant initialization of weights is likely to produce a poor result.
Free access to solved code examples can be found here (these are ready-to-use for your Machine Learning and Deep Learning projects)
Without non-linearities, a neural network will act like a perceptron regardless of how many layers are there making the output linearly dependent on the input. In other words, having a neural network with n layers and m hidden units with linear activation functions is just like having a linear neural network without hidden layers that can only find linear separation boundaries. A neural network without non-linearities cannot find appropriate solutions and classify the data correctly for complex problems.
The problem with deep neural networks is that they are most likely to overfit training data with few examples. Overfitting can be reduced by ensembles of networks with different model configurations but this requires the additional effort of maintaining multiple models and is also computationally expensive. Dropout is one of the easiest and exceptionally successful methods to reduce dependencies in deep neural networks and overcome overfitting problems. When using the dropout regularization method, a single neural network model is used to similar different network architecture by dropping out nodes while training. It is considered an effective method of regularization as it improves generalization errors and is also computationally cheap.
You will need to know about One-Shot Learning for Face Recognition which is a classification task where is one or more examples(faces in this case) are used for classifying new examples(faces) in the future. One needs to know about the method of indexing data to retrieve a new face faster. A new face can be recognized by finding the vectors that are close )most similar) to the input face but in this case, the system would have become super slow if we were to calculate the distance to 12 million vectors. A convenient way would be to index data on real vector space by dividing the data into easy structures for querying (almost like a tree data structure). It is easier to find the vector that is in close proximity with time very quickly whenever new data is available. Techniques like Annoy Indexing, Locality Sensitive Hashing, and Approximate Nearest Neighbours can be used for this purpose.
Flexibility makes deep learning powerful. Neural networks are universal function approximators so even if it is a complex enough problem at hand(where the formula between input and output is not known), a neural network can be approximated. Also, transfer learning (where the trained weights of an existing neural network can be used to initialize the weights of another network that performs similar tasks) makes the application of deep learning much easier under situations when training a neural network from scratch is costly or almost impossible when there is data scarcity.
Faster and powerful computational resources are also a prime reason for the adoption of neural network architectures. One cannot deny the fact that it is faster to train a neural network in just minutes with GPU acceleration which would otherwise take days for the network to learn.
Yes, it is definitely possible to build deep networks using a linear function as the activation function for each layer if the problem is represented by a linear equation. However, a problem that is a composition of linear functions is a linear function and there is nothing extraordinary that can be achieved with the implementation of a deep network because adding more nodes to the network will not increase the predictive power of the machine learning model.
The decrease in the accuracy of a deep learning model after a few epochs implies that the model is learning from the characteristics of the dataset and not considering the features. This is referred to as the overfitting of the deep learning model. You can either use dropout regularization or early stopping to fix this issue. Early stopping as the phrase implies stops training the deep learning model any further the moment you notice a drop inaccuracy of the model. Dropout regularization is a technique wherein a few nodes or output layers are dropped so that the remaining nodes have varying weights.
With images as inputs, an improperly set learning rate can cause noisy features. Having an ill-chosen learning rate determines the prediction quality of a model and can result in an unconverged neural network.
19)What do you understand by the terms Batch, Iterations, and Epoch in training a neural network model?
20) Is it possible to calculate the learning rate for a model a priori?
For simple models, it could be possible to set the best learning rate value a priori. However, for complex models, it is not possible to calculate the best learning rate through theoretical deductions that can actually make accurate predictions. Observations and experiences do play a vital role in defining the optimal learning rate.
21) What is the theoretical foundation of neural networks?
To answer this question one needs to explain the universal approximation theorem that forms the base on why neural networks work.
Introducing non-linearity via an activation function allows us to approximate any function. It’s quite simple, really. — Elon Musk
According to the Universal Approximation Theorem, a neural network having a single hidden layer containing a finite number of neurons can approximate any continuous function to a reasonable accuracy for inputs in a specific range. However, if the function has large gaps it is not possible to approximate it. Meaning, if a neural network is trained with inputs between 20 and 30, we cannot be assured that it will work well for inputs between 60 and 70.
22) What are the commonly used approaches to set the learning rate?
23) Is there any difference between neural networks and deep learning?
Ideally, there is no significant difference between deep learning networks and neural networks. Deep learning networks are neural networks but with a slightly complex architecture than they were in 1990s. It is the availability of hardware and computational resources that has made it feasible to implement them now.
24) You want to train a deep learning model on a 10GB dataset but your machine has 4GB RAM. How will you go about implementing a solution to this deep learning problem?
One of the possible ways to answer this question would be to say that a neural network can be trained by loading the data into the NumPy array and defining a small batch size.NumPy doesn’t load the complete dataset into the memory but creates a complete mapping of the dataset. NumPy offers several tools for compressing large datasets that can be integrated with other NN packages like PyTorch, TensorFlow, or Keras.
25) How will the predictability of a neural network impact if you use a ReLu activation function and then use the Sigmoid function in the final layer of the network?
The neural network will predict only one class for all types of inputs because the output of a ReLu activation function is always a non-negative result.
26) What are the limitations of using a perceptron?
A major drawback to using a perceptron is that they can only linearly separable functions and cannot handle non-linear inputs.
27) How will you differentiate between a multi-class and multi-label classification problem?
In a multi-class classification problem, the classification task has more than two mutually exclusive classes whereas in a multi-label problem each label has a different classification task, however, the tasks are related somehow. For example, classifying a set of images of animals which may be cats, dogs, or bears is a multi-class classification problem that assumes that each sample has only one label meaning an image can be classified as either a cat or a dog but not both at the same time. Now imagine that you want to process the below image. The image shown below needs to be classified as both cat and dog because the image shows both the animals. In a multi-label classification problem, a set of labels are assigned to each sample and the classes are not mutually exclusive. So, a pattern can belong to one or more classes in a multi-label classification problem.
28) What do you understand by transfer learning?
You know how to ride a bicycle, so it will be easy for you to learn to drive a bike. This is transfer learning. You have some skill and you can learn a new skill that relates to it without having to learn it from scratch. Transfer learning is a process in which the learning can be transferred from one model to another without having to make the model learn everything from scratch. The features and weights can be used for training the new model providing reusability. Transfer learning works well in training a model easily when there is limited data.
29) What is fine-tuning and how is it different from transfer learning?
In transfer learning, the feature extraction part remains untouched and only the prediction layer is retrained by changing the weights based on the application. On the contrary in fine-tuning, the prediction layer along with the feature extraction stage can be retrained making the process flexible.
30) Why do we use convolutions for images instead of using fully connected layers?
Each convolution kernel in a CNN acts like its own feature detector and has a partially in-built translation in-variance. Using convolutions lets one preserve, encode and make use of the spatial information from the image, unlike fully connected layers that do not have any relative spatial information.
31) What do you understand by Gradient Clipping?
Gradient Clipping is used to deal with the exploding gradient problem that occurs during the backpropagation. The gradient values are forced element-wise to a particular minimum or maximum value if the gradient has crossed the expected range. Gradient clipping provides numerical stability while training a neural network but does not provide any performance improvements.
32) What do you understand by end-to-end learning?
It is a deep learning process where a model gets raw data as the input and all the various parts are trained simultaneously to produce the desired outcome with no intermediate tasks. The advantage of end-to-end learning is that there is no need for implicitly doing feature engineering which usually leads to a lower bias. A good example that you can quote in the content of end-to-end learning is driverless cars. They use human-provided input as guidance and are trained to automatically learn and process the information using a CNN to complete tasks.
33) Are convolutional neural networks translation-invariant?
Convolutional neural networks are translation invariant only to a certain extent but pooling can make them translation invariant. Making a CNN completely translation-invariant might not be possible. However, by feeding the right kind of data this can be achieved although this might not be a feasible solution.
34) What is the advantage of using small kernels like 3x3 than using a few large ones.
Smaller kernels let you use more filters so you can use a greater number of activations functions and let the CNN learn a more discriminative mapping function. Also, smaller kernels capture more spatial context and use fewer computations and parameters making them a better choice over large ones.
35) How can you generate a dataset on multiple cores in real-time that can be fed to the deep learning model?
One of the major challenges today in CV is the need to load large datasets of videos and images but there is not enough memory on the machine. In such situations, data generators act as a magic wand when it comes to loading a dataset that is memory-consuming. You can talk about the various data generators Keras model class provides. When working with big data, in most of the cases it might not be required to load all the data into RAM as it would be memory wastage, could lead to memory overflow, and also take a longer time to process. Making use of generative functions is highly beneficial then as they generate the data to be directly fed into the model in each batch for training.
36) How do you bring balance to the force when handling imbalanced datasets in deep learning?
It is next to impossible to have a perfectly balanced real-world dataset when working on deep learning problems so there will be some level of class imbalance within the data that can be tackled either by –
37) What are the benefits of using batch normalization when training a neural network?
38) Which is better LSTM or GRU?
LSTM works well for problems where accuracy is critical and sequence is large whereas if you want less memory consumption and faster operations, opt for GRU. Refer here for detailed Answer: https://www.dezyre.com/recipes/what-is-difference-between-gru-and-lstm-explain-with-example
39) RMSProp and Adam optimizer adjust gradients? Does this mean that they perform gradient clipping?
This does not inherently mean that they perform gradient clipping because gradient clipping involves setting up predetermined values beyond which the gradients cannot go, unlike Adam and RMSProp that make multiplicative adjustments to gradients.
40) Can you name a few hyperparameters used for training a neural network.
When training any neural networks there are two types of hyperparameters-one that define the structure of the neural network and the other determining how a neural network is trained. Listed are a few hyperparameters that are set before training any neural network –
41) When is multi-task learning usually preferred?
Multi-task learning with deep neural networks is a subfield wherein several tasks are learned by a shared model. This reduces overfitting, enhances data efficiency, and speeds up the learning process with the use of auxiliary information. Multi-task learning is useful when there is a small amount of data for any given task and we can benefit from training a deep learning model on a large dataset.
42) Explain the Adam Optimizer in one minute.
Adaptive momentum or Adam optimizer is an optimization algorithm designed to deal with sparse gradients on noisy problems. Adam optimizer improves convergence through momentum that ensures that a model does not get stuck in saddle point and also provides per-parameter updates for faster convergence.
43) Which loss function is preferred for multi-category classification?
Cross-Entropy loss function
44) To what kind of problems can the cross-entropy loss function be applied?
45) List the steps to implement a gradient descent algorithm.
46) How important is it to shuffle the training data when using batch gradient descent?
Shuffling the training dataset will not make much of a difference because the gradient is calculated at every epoch using the complete training dataset.
47) What is the benefit of using max-pooling in classification convolutional neural networks?
The feature maps become smaller after max-pooling in CNN and hence help reduce the computation and also give more translation in-variance. Also, we don’t lose much semantic information because we’re taking the maximum activation.
48) Can you name a few data structures that are commonly used in deep learning?
You can talk about computational graphs, tensors, matrices, data frames, and lists.
49) Can you add an L2 regularization to a recurrent neural network to overcome the vanishing gradient problem?
This can actually worsen the vanishing gradient problem because the L2 regularization will shrink weights towards zero.
50) How will you implement Batch Normalization in RNN?
It is not possible to use batch normalization in RNN because statistics are computed per batch and thus batch normalization will not consider the recurrent part of the neural network. An alternative to this could be layer normalization in RNN or reparameterizing the LSTM layer that allows the use of batch normalization.
What is Deep Learning?
Which deep learning framework do you prefer to work with – PyTorch or TensorFlow and why?
Talk about a deep learning project you’ve worked on and the tools you used?
Have you used the ReLu activation function in your neural network? Can you explain how does the ReLu activation function works?
How often do you use pre-trained models for your neural network?
What does the future of video analysis look like with the use of deep learning solutions? How effective/good is video analysis currently?
Tell us about your passion for deep learning. Do you like to participate in deep learning/machine learning hackathons, write blogs around novel deep learning tools, or attend local meetups, etc ?
Describe the last time you felt frustrated solving a deep learning challenge, and how did you overcome it?
What is more important to you the performance of your deep learning model or its accuracy?
Given the dataset, how will you decide which deep learning model to use and how to implement it?
What is the last deep learning research paper you’ve read?
What are the most commonly used neural network paradigms ? (Hint: Talk about Encoder-Decoder Structures, LSTM, GAN, and CNN)
Is it possible to use a neural network as a tool of dimensionality reduction?
How deep learning models tackle the curse of dimensionality?
What are the pros and cons of using neural networks?
How is a Capsule Neural Network different from a Convolutional Neural Network?
What is a GAN and what are the different types of GAN you’ve worked with?
For any given problem, how do you decide if you have to use transfer learning or fine-tuning?
Can you share some tricks or techniques that you use to fight to overfit a deep learning model and get better generalization?
Explain the difference between Gradient Descent and Stochastic Gradient Descent.
Which one do you think is more powerful – a two-layer NN without any activation function or a two-layer decision tree?
Can you name the breakthrough project that garnered the popularity and adoption of deep learning?
Differentiate between bias and variance with respect to deep learning models and how can you achieve a balance between the two?
What are your thoughts about using GPT3 for our business?
Can you train a neural network without using back-propagation? If yes, what technique will you use to accomplish this?
Describe your research experience in the field of deep learning?
Explain the working of a perceptron.
Differentiate between a feed-forward neural network and a recurrent neural network.
Why don’t we see the exploding or vanishing gradient problem in feed-forward neural networks?
How do you decide the size of the filter when performing a convolution operation in a CNN?
When designing a CNN, can we find out how many convolutional layers should we use?
What do you understand by a computational graph?
Differentiate between PCA and Autoencoders.
Which one is better for reconstruction linear autoencoder or PCA?
How is deep learning related to representation learning?
Explain the Borel Measurable function.
How are Gradient Boosting and Gradient Descent different from each other?
In a logistic regression model, will all the gradient descent algorithms lead to the same model if run for a long time?
What is the benefit of shuffling a training dataset when using batch gradient descent?
Explain the cross-entropy loss function.
Why is cross-entropy preferred as the cost function for multi-class classification problems?
What happens if you do not use any activation functions in a neural network?
What is the importance of having residual neural networks?
There is a neuron in the hidden layer which always results in a large error in backpropagation. What could be the reason for this?
Explain the working of forward propagation and backpropagation in deep learning.
Is there any difference between feature learning and feature extraction?
Do you know the difference between the padding parameters valid and the same padding in a CNN?
How does deep learning outperform traditional machine learning models in time series analysis?
Can you explain the parameter sharing concept in deep learning?
How many trainable parameters are there in a Gated Recurrent Unit cell and in a Long Short Term Memory cell?
So that pretty much makes it for this post – the most common deep learning engineer interview questions and answers. Whether you’re a beginner or a seasoned professional, hopefully, these deep learning job interview questions and answers have been useful and been able to boost your confidence for your next deep learning engineer job interview.
Congrats! You now have the know-how on the kind of deep learning interview questions you can expect in your next job interview. However, there is still a lot to learn to solidify your deep learning knowledge and get hands-on experience working with diverse deep learning projects and all the deep learning frameworks like PyTorch, TensorFlow, and Keras. ProjectPro helps you move right into practice with over 60+ end-to-end solved data science and machine learning projects where you will learn how to develop machine learning/deep learning models from scratch and develop a high-level ability to think about productionized machine learning systems. Get started today to take your deep learning skills to the next level and build a fantastic job-winning portfolio of projects.
We would love to hear your own machine learning or deep learning interview experiences. If you have any other interesting deep learning interview questions to share that can be helpful, please send an email with the questions and answers to firstname.lastname@example.org to make the learning experience for the community enriching and valuable. All the questions and answers shared would be posted on the blog with due credit to the author.