To use Resnet-50 to run CIFAR100 dataset, I wrote a program by using Tensorflow. But when running it, the loss seems keeping in about 4.5~4.6 forever:
step: 199, loss: 4.61291, accuracy: 0 step: 200, loss: 4.60952, accuracy: 0 step: 201, loss: 4.60763, accuracy: 0 step: 202, loss: 4.62495, accuracy: 0 step: 203, loss: 4.62312, accuracy: 0 step: 204, loss: 4.60703, accuracy: 0 step: 205, loss: 4.60947, accuracy: 0 step: 206, loss: 4.59816, accuracy: 0 step: 207, loss: 4.62643, accuracy: 0 step: 208, loss: 4.59422, accuracy: 0 ...
After changed models (from Resnet to fully-connect-net), optimizers (from AdamOptimizer to AdagradOptimizer), and even learning rate (from 1e-3 to even 1e-7), the phenomena didn’t change at all.
Finally, I checked the loss and the output vector step by step, and found that the problem is not in model but dataset code:
def next_batch(self, batch_size = 64):
images = []
labels = []
for i in range(self.pos, self.pos + batch_size):
image = self.data['data'][self.pos]
image = image.reshape(3, 32, 32)
image = image.transpose(1, 2, 0)
image = image.astype(np.float32) / 255.0
images.append(image)
label = self.data['fine_labels'][self.pos]
labels.append(label)
if (self.pos + batch_size) >= CIFAR100_TRAIN_SAMPLES:
self.pos = 0
else:
self.pos = self.pos + batch_size
return [images, labels]
Every batch of data have the same pictures and same labels! Than’t why the model didn’t converge. I should have used ‘i’ instead of ‘self.pos’ as index to fetch data and labels.
So in DeepLearning area, problems comes not only from models and hyper-parameters, but also dataset, or faulty codes…