Imports

openmp, tensor, p_nnp_checks, p_logsumexp

Procs

proc softmax_cross_entropy[T](input, target: Tensor[T]): T: Softmax function + Cross-Entropy loss fused in one layer.

Input:

A Tensor of shape batch_size, predicted_labels_probabilities

The target values of shape batchsize, truth_labels_probability

Returns:

Apply a softmax activation and returns the cross-entropy loss.

Softmax_cross_entropy measures the cross-entropy error for multiclass classification. Classes are mutually exclusive (only 1 label is true) but the truth labels (target) need not be.

Note: Instead of one-hot-encoded labels, it is more efficient to use sparse_softmax_cross_entropy instead of feeding softmax_cross_entropy.

For example if your true probablities are (car: 0.10, airplane: 0.60, bike: 0.05, bus: 0.25), you have to use softmax_cross_entropy

However if your true probablities are (car: 0, airplane: 1, bike: 0, bus: 0) (a one-hot-encoded vector), you should prefer sparse_softmax_cross_entropy
Source Edit
proc softmax_cross_entropy_backward[T](gradient: Tensor[T] or T; cached_tensor: Tensor[T]; target: Tensor[T]): Tensor[T] {.noinit.}: Derivatives of softmax_cross_entropy Input:
The input gradient as a scalar or a Tensor

A cache tensor that contains data from before the forward pass

The target values

Shape:

Both the cache and target shape should be batchsize, features i.e. number of samples as first dimension

Source Edit
proc sparse_softmax_cross_entropy[T; Idx: SomeNumber or byte or char or enum]( input: Tensor[T]; target: Tensor[Idx]): T: Softmax function + Cross-Entropy loss fused in one layer.

Input:

A Tensor of shape batchsize, predicted_labels_probabilities

The target values of shape batchsize containing the truth label id

Returns:

Apply a softmax activation and returns the cross-entropy loss.

sparse_softmax_cross_entropy measures the cross-entropy error for multiclass classification. Classes are mutually exclusive (only 1 label is true).

Important: 0, 0, 1 means label 2 is true i.e. labels start at 0

Note: Instead of one-hot-encoded labels, it is more efficient to use sparse_softmax_cross_entropy instead of feeding softmax_cross_entropy.

For example if your true probablities are (car: 0.10, airplane: 0.60, bike: 0.05, bus: 0.25), you have to use softmax_cross_entropy

However if your true probablities are (car: 0, airplane: 1, bike: 0, bus: 0) (a one-hot-encoded vector), you should prefer sparse_softmax_cross_entropy
Source Edit
proc sparse_softmax_cross_entropy_backward[T; Idx: SomeNumber or byte or char or enum](gradient: Tensor[T] or T; cached_tensor: Tensor[T]; target: Tensor[Idx]): Tensor[T] {.noinit.}: Derivatives of sparse_softmax_cross_entropy Input:
The input gradient as a scalar or a Tensor

A cache tensor that contains data from before the forward pass

The target values

Shape:

Both the cache should be features, batchsize i.e. number of samples as last dimension

target shape should be batchsize

Source Edit