If it was indeed $PA = PLU$, then $P$ would be useless. Just premultiply with $P^{-1} = P^T$ and you get $A = LU$. No, it has to be $PA = LU$.
You do it in steps:
Find the pivot row and get it to the first position, say by swapping it with the first row. This will give you a permutation $P_1$.
Do the first column elimination of $P_1 A$ to obtain
$$P_1 A = L_1 A_1,$$
where $L_1$ is lower triangular with a possibly non-zero first column and $1$ on its diagonal, while $A_1$ has zeros below the diagonal in the first column.
Repeat on the $A_1[2:n,2:n]$ (i.e., $A_1$ minus its first row and column) to obtain $P_2$, $L_2$, $A_2$, and so on.
When done, multiply (or smartly combine) all $P_k$ and $L_k$ (extended to dimension $n$ by identity) to get $P$ and $L$, respectively. You get $U$ by combining the first rows from all $A_k$.
You should probably check out Crout and LUP algorithms, where a similar - but I think a somewhat better - algorithm is described.
A bit on the stability of LU
A question in the comments:
But you see, $PA = LU$ is the $LU$ decomposition of $PA$. I want the $LU$ decomposition of $A$. But since $A$ is not stable, from what I understood, I need to find $LU$ such that $A = LU$ by doing an LU factorization on $PA$. As in - find the $LU$ factorization of $A$, by running the $LU$ factorization of the stable $PA$. does that not make sense?
Sensible as it may seem, it is not.
Take any $A$ such that $A_{1,1} = 0$ and you will not be able to do LU at all.
Now, take $A_{1,1}>0$, but very small, say $A_{1,1}=10^{-15}$, while all other elements of $A$ are non-zero integers. See what happens.
This is where the instability comes from and the result itself is unstable, so there is no algorithm that can fix it because any algorithm would have to get to that same final solution (LU is unique if $A$ is nonsingular).
By introducing the partial pivoting on the left side of $A$, you are avoiding this effect.
Effectively, these permutations are only permuting equations of your system, nothing more. So your LU of $PA$ is really the Gaussian eliminations algorithm performed on the same equations system, only these equations are in a different order.
Similarly, partial pivoting of columns would be the same as permuting the variables (thus introducing the need to be careful which variable gets which value in the end). The full pivoting would permute both the equations and the variables, for the most stable way to solve the system.
Variable-wise, all of these (when they can be performed) give you the same solution to a system of linear equations. But, in finite arithmetic, some solutions produce more errors than others. So, you'd rather get $x_1 = x_2$ than $x_1 = 10^{17}(1 + 10^{-17}x_2 - 1)$. Notice that the latter would actualy give you $x_1 = 0$ if $x_2 \approx 1$, even though it is (in exact arithmetic) the same solution as $x_1 = x_2$.
That's why $A = LU$ can be a problem, not just they way how you computed it.
Best Answer
The $LU$ factorization of a matrix has other advantages than just speeding up Gaussian elimination. It is also really efficient from a storage point of view. Example: take the matrix $$A = \begin{pmatrix} \phantom{-}2 & \phantom{-}1 & \phantom{-}3 \\ \phantom{-}4 & -1 & \phantom{-}3 \\ -2 & \phantom{-}5 & \phantom{-}5 \end{pmatrix}.$$ Applying $LU$ factorization we get the identity $$A = \begin{pmatrix} \phantom{-}1 & \phantom{-}0 & \phantom{-}0 \\ \phantom{-}2 & \phantom{-}1 & \phantom{-}0 \\ -1 & -2 & \phantom{-}1 \end{pmatrix} \begin{pmatrix} \phantom{-}2 & \phantom{-}1 & \phantom{-}3 \\ \phantom{-}0 & -3 & -3 \\ \phantom{-}0& \phantom{-}0 & \phantom{-}2 \end{pmatrix} = LU.$$ Notice how we can cleverly store the $LU$ factorization in the computer using the matrix $B$ defined as the matrix with elements $$b_{ij} = \begin{cases} l_{ij} & \text{if } i > j \\ u_{ij} & \text{otherwise,} \end{cases}$$ that is, in this particular example, the matrix $$B = \begin{pmatrix} \phantom{-}2 & -1 & \phantom{-}3 \\ \phantom{-}2 & -3 & -3 \\ -1 & -2 & \phantom{-}2 \end{pmatrix}.$$ So you can see that the $LU$ factorization of a matrix has some really convenient storage applications. Think about calculating determinant of $A$. All you need is $$2 \times (-3) \times 2 $$ from the diagonal of $B$.