When an image is shot with a camera, the camera is the observer.
In the actual set up the camera is placed slightly above the box so the box appears slightly below of the image horizon's line (V1,V2).
In this set up the shaded triangles are similar and hence a/d=A/D. If we know the dimensions A and D of the box and can measure the size a=(V1,V2) in the image, we can determine the distance 'd' from observer to the image, or focus distance in the scenario with a camera.
For my camera and a box with dimensions (A=73mm, D=175mm) I've got d=a*(D/A)=363pix(175mm/73mm)=870pix
The OpenCV camera calibration routines give for my camera with resolution=(1280,720) the focus distance fx=981.73. I think the error=981-870=111pix is due to inaccurate placement of the camera which have to be parallel to the front of the box.
The other scenario is when we record a video of a box, which moves parallel to OX or OZ directions, without rotations. Then the d=const because camera is the same. Hence the (V1,V2) must be constant too. I should say it is somewhat nonintuitive. But it turns out to be true. In the video below, the box moves parallel to OZ and the (V1,V2) is more or less fixed.
The (V1,V2) distance oscillates around value of 360 pixels.
To make the measurment more precise, one need to recognize the corners of the box more accurately. Second, one has to accurately place the camera, so that the image is parallel to the front of the box
 "Viewpoints. Mathematical Perspective and Fractal Geometry in Art", Marc Frantz, 2011.
(Dev env: Win8 x64, Python 2.7.8)