so assuming your not allowed to use transfer methodologies (like take an already exisiting elephant object detector) my recommendation is to train a CNN classifier (labels are binary-- elephant exist, elephant doesnt exist) and then use strategies founded in like grad cam. Note there does exist a gradcam++ but because you can assure theres only one instance, it isnt necessary and is just more complicated.
Note that since you just need the location and not the pixel specificity, you dont even need to do the guided backprop, but just the relation with respect to the last convoluitional map.
A quick description is that its using the gradient of the class loss w.r.t the last feature map to see which locations helped make the classification, and from there you can upscale to the receptive field that those neurons touch
Hope this helped!