Various examples of hybrid metric-topological camera-based localization are described. A single image sensor captures an input image of an environment. The input image is localized to one of a plurality of topological nodes of a hybrid simultaneous localization and mapping (SLAM) metric-topological map which describes the environment as the plurality of topological nodes at a plurality of discrete locations in the environment. A metric pose of the image sensor can be determined using a Perspective-n-Point (PnP) projection algorithm. A convolutional neural network (CNN) can be trained to localize the input image to one of the plurality of topological nodes and a direction of traversal through the environment.