Top new questions this week:
|
By substituting the optimal policy $\pi_{\star}$ into the Bellman equation, we get the Bellman equation for $v_{\pi_{\star}}(s)=v_{\star}(s)$: $$ v_{\star}(s) = \sum\limits_a \pi_{\star}(a|s) \sum\…
|
From Deep Learning (Courville, Goodfellow, Bengio), a ReLU activation often “dies” because One drawback to rectified linear units is that they cannot learn via gradient based methods on …
|
I am training a ResNet model on CIFAR10 dataset. For the training subset, I selected a random 1% of the train data from the default train/test split. For the test subset I used the whole default test …
|
The maximum derivative of most of the currently existing activation functions is around 1. Can an activation function with derivatives higher than 1, say 1000 (a), cause exploding gradient problem? …
|
Greatest hits from previous weeks:
|
In semi-supervised learning, there are hard labels and soft labels. Could someone tell me the meaning and definition of the two things?
|
What is the difference between artificial intelligence and robots?
|
This question is about Reinforcement Learning and variable action spaces for every/some states. Variable action space Let’s say you have an MDP, where the number of actions varies between states (for …
|
In hill climbing methods, at each step, the current solution is replaced with the best neighbour (that is, the neighbour with highest/smallest value). In simulated annealing, “downhills” moves are …
|
As far as I can tell, BERT is a type of Transformer architecture. What I do not understand is: How is Bert different from the original transformer architecture? What tasks are better suited for BERT,…
|
The following paragraph is from page no 331 of the textbook Natural Language Processing by Jacob Eisenstein. It mentions about certain type of tasks called as downstream tasks. But, it provide no …
|
What are the differences between the A* algorithm and the greedy best-first search algorithm? Which one should I use? Which algorithm is the better one, and why?
|