The minimal algorithm for convolution in $\mathbb{R}^2$ is a four dimensional iteration.
for all vertical kernel positions
for all horizontal kernel positions
initialize the value at the output position to the bias
for all vertical positions in the kernel
for all horizontal positions in the kernel
add the product of the input value to that of the output position
In $\mathbb{R}^n$ it is a $2n$ dimensional iteration following this pattern.
The minimal algorithm for regression of bounding boxes orthogonal with respect to the image grid (no tilting) is this.
until number of boxes reaches max
make first guess of two coordinates
until number of guesses reaches max or matching criteria is met
evaluate guess
remember guess and guess results
improve on guess based on evaluation results and
possibly injected randomness,
excluding locations already covered
if some intermediate criteria is met
change the nature of the guessing, evaluation, and improving
as is appropriate for the criteria match
(this covers approaches that have multiple phases)
if no guess matched criteria
break
That's approaching concepts from the top down. When approaching from the other direction, reverse engineer the best code. In the case of RCNN, it is unadvisable to find implementations following the first paper expressing the approach. Reading the first paper may be helpful to get the gist of the approach, but reverse engineer the best one, which, in this case, may be Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, 2016. Study the implementation they pushed to git at https://github.com/rbgirshick/py-faster-rcnn/tree/master/. The algorithm is in lib/fast_rcnn.
The reason this algorithm isn't spelled out in their paper or any paper from the first on down through the lineage to their paper is simple.
- The pseudo-code above is universal across all convolutions and all bounding box regressions, so that doesn't need to be restated with each approach.
- The main features of an approach like RCNN, SSD, or YOLO are not algorithmic. They are algebraic expressions of the guess, the evaluation, the improvement upon the guess, and the test for the criteria.
- The use of objects and functional programming makes the implementation more readable, so it can be easier to read the implementation than read a huge chunk of the above pseudo-code with all the algebra and test branches plugged in.
- For the above reasons, it is rare that pseudo-code would be used prior to the implementation when the paper is written.
- The return on investment of reverse engineering from code to pseudo-code is only sufficient motivation if one is going to improve the algorithm and write another paper, and on the way to finishing the prior paper's pseudo-code, the new paper and the new code gets finished first.
Since the author of this question seems interested in writing their own code, it may be reasonable to assume the same author may be interested in thinking their own thoughts, so I'll add this.
None of these algorithms are object recognition. Recognition has to do with cognition, and these approaches do not even touch upon cognitive processing, another branch of AI not related to convolution and probably not closely related to formal regression either. Additionally, bounding boxes are not the way animal vision systems work. Early gestalt experiments in vision indicate a complete independence of human vision from rectilinear formalities. In lay terms, humans and other organisms with vision systems don't have any conception of Cartesian coordinates. We can still read books if tilted slightly relative to the plane passing through our eyes. We don't zoom or tilt in Cartesian coordinates.
These facts may not be necessary to comprehend to create an automated vehicle driving system that produces a better safety record than average human drivers, but that is only because humans don't set that bar very high and because cars roll in the plane of the road. These facts are indeed necessary in aeronautic system used in military applications, where nothing is particularly Cartesian and the meaning of horizontal and vertical is ambiguous. For that reason, it is unlikely that bounding boxes will be the edge of vision technology for very long.
If one wishes to transcend current mediocrity, consider bounding circles with fuzzy boundaries, which would be more like the systems that evolved over millions of biological iterations. If the computer hardware is poorly fit to radial processing, design new hardware in which radial processing is native and in which Cartesian coordinates may be foreign and cumbersome.
Regarding the classifier, the classifier papers do generally include the algorithm, so those can be found by doing an academic search for the original paper describing the classifier being used.