See if this reasoning helps. In 3D computer graphics, everything is rendered from the viewpoint of an imaginary camera, or viewer. The 3D content being viewed is represented by coordinates in a world coordinate system, and the camera location and orientation is also specified in world coordinates.
The viewing transform (the guts of all 3D graphics) consists of two 3D transforms, followed by a 3D --> 2D projection.
Coordinate translate: World to Eye. First, all world points must be put into local viewing coordinates where the camera defines the origin. This is done as a coordinate translation, where the camera's position coordinates are subtracted from each world point's coordinates....the result now is that all points revolve around an origin at the camera. This is how you want it, since if the camera (viewer) pans or tilts, points stay the same distance away, just changing direction.
Coordinate rotate: Local world to Eye gaze rotator. In the most general formulation, the camera can adopt any possible 3D orientation. Not only can the camera's center axis point out in any 3D direction, held in that pointing direction, the camera can be spun around the axis 360 degrees. (I'm gonna hurl!)
Therefore, a general purpose way to represent the orientation of the camera (viewer) is to use a 3x3 rotational matrix (avoiding the use of angles). The camera has its own local x-y-z axes. Some camera defs have them looking out the local x-axis (and some look out the local y-axis), and z points out the top of the camera. In any event, the orientation of the camera in world coordinates is represented by 3 direction vectors comprising the columns of the 3x3:
3x3 matrix R =
[ camera_x_axis_dir
camera_y_axis_dir
camera_z_axis_dir ]
Visually, you can imagine what these numbers mean. Imagine the camera coming with its 3 axes as 1-meter long sticks that point out from the camera, forming perfect right angles...x sticks out to the right, y-sticks out forward, and z-sticks out upward. Whereever the camera is setup at whatever orientation, slide the camera over to the world origin, without upsetting its orientation. Construct a unit sphere about the origin. Where does the cam's x-axis stick touching the unit sphere? These [ x y z ] world coordinates are the numbers in camera_x_axis_dir. Same for the y and z sticks. All camera orientations can be represented numerically in this simple manner. This is a 1:1 representation...there is only one set of 9 numbers to represent a unique camera orientation. This is better for software than using angles to represent orientation, since angles are not 1:1. (The information about 3D orientation is said to be overcompressed in roll-pitch-yaw angle form).
Once you know the camera orientation, the next step in rendering a 3D point is to coordinate-rotate the point so that the apparent location of the point is expressed in camera-oriented coordinates. If I'm looking directly at a 3D point, after coordinate rotation the point will fall on the camera's local y-axis (x and z coords will be zero). If the point is directly behind the camera, its transformed coordinates put it on the negative y-axis. If the camera is looking at the world upside-down, the coordinate rotate will flip the world upside down.
The last step is (assuming the optical axis is +y), to assume the viewing screen is directly in front of the viewer (perpendicular to the optical axis), and so points are projected onto the screen. The ray going out from the camera to the point is intersected (conceptually) with the view screen, and the intersection points (x, z) say where to draw the 3D point on the 2D view screen. (Da Vinci was the first tech-artist to popularize this model of the projective transform).
Not mentioned, but there is also some backspace clipping of 3D content unviewable because it is located behind the camera. The rotate transform clarifies this decision, as all 3D content with negative y-coords (in local, rotated camera coordinates) is in the viewer's backspace.
I hope this clarifies what the 3D viewing transform steps are, why they are needed, and the order undertaken.
To sum up the discussion in the notes: The OP knows that the cuboid shares the same center as the cube and has six points of contact with the cube, and also knows the orientation of the cuboid. Also, the OP knows the equations of the planes of the faces of the cube. Therefore, the solution method is to let $l,w,h$ be the unknown dimensions of the gray cuboid. Then based on its orientation, you can write an expression in $l,w,h$ for the the "top" vertex of the cuboid. Since that touches the top face of the cube, you know it must satisfy the equation of that plane, which yields one equation in $l,w,h$. Do the same for the two other independent contact points (i.e., not the "bottom" vertex, which by symmetry will give you an equation equivalent to the first), and you will have three equations in $l,w,h$ which you solve, and then you know everything about the cuboid.
Best Answer
It looks like you're looking for an affine transformation, i.e. a function $\varphi: \Bbb R^3 \to \Bbb R^3$ such that
$$\varphi(x) = Ax + b = \begin{pmatrix} a_{11} &a_{12} &a_{13}\\a_{21}&a_{22}&a_{23}\\a_{31}&a_{32}&a_{33} \end{pmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix} + \begin{pmatrix}b_1\\b_2\\b_3\end{pmatrix} $$
that maps every point $(x,y,z)$. Roughly speaking, matrix $A$ is responsible for rotations and scaling, and the column vector $b$ is responsible for transformations. For instance, the affine transformation that would move your shape up one unit along the $z$ axis corresponds to
$$\varphi: x \to x + \begin{pmatrix}0\\0\\1\end{pmatrix}$$
so in this case $A = I_3$ and $b = (0,0,1)^T$. The affine transformation that rotates about the $z$ axis corresponds to
$$ \varphi: x \to \begin{pmatrix}\cos(\theta)&-\sin(\theta)&0\\\sin(\theta)&\cos(\theta)&0\\0&0&1 \end{pmatrix}x. $$
Here, $b = (0,0,0)^T$.
In your case, you are given a number of inputs $(x_1,x_2,x_3)^T$ and outputs $\varphi((x_1,x_2,x_3)^T)$. You actually don't have to worry about figuring out what component of the transformation is rotation, translation or scaling. You just have to have enough given points on your shape so that you can find unique values of $A$ and $b$ that solve your problem.
Assuming a solution exists (which it should, given the nature of your problem), you will need $4$ given points to solve the problem. Call these points $w = (w_1,w_2,w_3)$, $x = (x_1,x_2,x_3)$, $y = (y_1,y_2,y_3)$ and $z = (z_1,z_2,z_3)$. Similarly, let $\varphi(w) = (w_1',w_2',w_3')$ et cetera. Multiplying the equation $\varphi(x) = Ax+b$ out, we see that
$$\tag{1} \begin{pmatrix} w_1'\\w_2'\\w_3' \end{pmatrix} = \begin{pmatrix} a_{11} &a_{12} &a_{13}\\a_{21}&a_{22}&a_{23}\\a_{31}&a_{32}&a_{33} \end{pmatrix}\begin{pmatrix}w_1\\w_2\\w_3\end{pmatrix} + \begin{pmatrix}b_1\\b_2\\b_3\end{pmatrix} $$
This gives a system of three simultaneous equations, one in each row. The equation given by just looking at the first row looks like this:
$$ w_1' = a_{11}w_1 + a_{12}w_2 + a_{13}w_3 + b_1$$
When we plug the other three points $x$, $y$, and $z$ into the equation $\varphi(x) = Ax + b$ and just look at the first row, we get similar equations:
$$ x_1' = a_{11}x_1 + a_{12}x_2 + a_{13}x_3 + b_1$$ $$ y_1' = a_{11}y_1 + a_{12}y_2 + a_{13}y_3 + b_1$$ $$ z_1' = a_{11}z_1 + a_{12}z_2 + a_{13}z_3 + b_1$$
Now it should be clear why we needed $4$ points. There are four unknowns: $a_{11}, a_{12}, a_{13}$ and $b_1$. The values of $x$, $y$, $z$ and $w$ (and their images $x'$, $y'$, $z'$ and $w'$) are given. So we need four linear equations in these four unknowns to find a solution. To actually find the solution, notice that this is a system of equations in four variables of the form
$$ \begin{pmatrix}w_1'\\x_1'\\y_1'\\z_1'\end{pmatrix} = \begin{pmatrix} w_1&w_2&w_3&1\\x_1&x_2&x_3&1\\y_1&y_2&y_3&1\\z_1&z_2&z_3&1\end{pmatrix}\begin{pmatrix}a_{11}\\a_{12}\\a_{13}\\b_1 \end{pmatrix} $$
which you should be able to solve for $(a_{11},a_{12},a_{13},b_1)^T$ using typical methods. If you do the same thing for the other two rows of the original matrix equation $(1)$, you can similarly find $(a_{21},a_{22},a_{23},b_2)^T$ and $(a_{31},a_{32},a_{33},b_3)^T$ . That gives you the entire affine transformation $Ax + b$.