MATLAB: Reshape a column vector containing several stocks returns into a matrix format

column to matrixcrspreshapetransformation

Hello everyone,
I have a column vector containing dates and the respective stock prices for a large set of stocks with different starting and ending dates (as some stocks were only listed after the start of the sample or were delisted during the sample period). Hence, the first 100 rows of the vector might containg the dates and stock returns of stock 1, rows 101 to 350 the dates and prices of stock 2, rows 351 to 400 the same data of stock 3 and so on. I obtained the data set from the Wharton CRSP database.
Now, i would like to transform this column vector into a matrix which contains the dates for the largest available time period in the a first column and the respective stock prices in the subsequent columns such that column 2 contains the stock prices of stock 1, column 3 the stock prices of stock 2 and so on.
I already tried several merging or reshaping commands without any success. Therefore, I would be very thankful for any advice.
Kind regards, Maximilian

Best Answer

Here is one way to process your data. You'll have to understand and check that it is correct (at least, check a dozen numbers in the array of prices/returns). I attached the M file, but I copy/pasted the code below as well, to facilitate the discussion. The output of block 3 is the plot below
Note 1 - you may want to update the terminology. What I named "prices" initially is the "returns" that you mention in your last comment.
Note 2 - base functions in MATLAB don't compute with NaNs:
>> sum(prices(1,:))
ans =
NaN
If you have the Financial toolbox, you can use special function for that purpose, e.g.
>> nansum(prices(1,:))
ans =
326.5000
If not, you can extract non-nan elements and aply base operations on these
>> id = ~isnan( prices(1,:) ) ;
>> sum( prices(1,id) )
ans =
326.5000
----[ CODE ]---------------------------------------------------------------------------------------------------
%%--- 1. Read data.
fprintf( 'Reading data.. ' ) ;
tic ;
data = xlsread( 'demo_data.xlsx' ) ;
fprintf( ': done in %.2es\n', toc ) ;
%%--- 2. Process data.
fprintf( 'Processing data.. ' ) ;
tic ;
% - 2a. Build list of unique PERMCOs. This corresponds to the horizontal
% header of the table of prices defined below.

permcos = unique( data(:,1) ).' ;
% - 2b. Build list of unique dates. This corresponds the the vertical
% header of the table of prices defined below.
dates = unique( data(:,2) ) ;
minDate = min( dates ) ;
% - 2c. This is the complicated part. Let's start with an example. We have
% a vector of dates
% 1] 19260102
% 2] 19260104
% 3] 19260105
% 4] 19260106
% .. ...

% 23283] 20131231
%



% where the number behind ] is the index of the element that follows. We
% will need a table which makes the inverse correspondance, which means
% making the association 19260102 -> 1, 19260104 -> 2, etc. We could build
% a giant array where element 19260102 is 1, etc.
%
% 19260102] 1
% 19260104] 2
% 19260105] 3
% 19260106] 4
% ... ..
% 20131231] 23283
%
% but it would be quite large in memory (~162MB). Instead, we build a
% mechanism which makes the correspondance between (date - min(dates) + 1)
% and indices
% 1] 1
% 3] 2 % Observe the discont. (*)
% 4] 3
% 5] 4
% .. ...
% 871129] 23283
%
% (*) which corresponds to the gap between 19260102 and 19260104. This
% array takes less than 7MB RAM, which is reasonable. We call that a lookup
% table; it associates a series of numbers/indices to another series of
% numbers.
shiftedDates = dates - minDate + 1 ;
dates_LT = zeros( max(shiftedDates), 1 ) ;
dates_LT(shiftedDates) = 1 : numel( shiftedDates ) ;
% If you display dates_LT(1:10), you'll see that it worked: element 1 is 1,
% element 3 is 2, etc.
% - 2c. Build array of prices per date/PERMCO.
% Prealloc an array of NaNs.
prices = nan( numel(dates), numel(permcos) ) ;
% Iterate through PERMCOs and define corresponding columns.
for pId = 1 : numel(permcos)
% Identify (build a logical index) rows of the data table associated
% with current PERMCO. The logical index will then be used to extract
% relevant dates/prices.
lId = data(:,1) == permcos(pId) ;
% Extract relevant dates.
currentDates = data(lId,2) ;
% Build the index, in the table of prices, of entries corresponding to
% these dates: shift them and then use lookup table.
dId = dates_LT(currentDates - minDate + 1) ;
% Store current prices at relevant "places" in the array of prices. The
% column is defined by the PERMCO ID (pId) and rows are defined by date
% IDs that we just computed.
prices(dId,pId) = data(lId,3) ;
end
fprintf( ': done in %.2es\n', toc ) ;
%%--- 3. Plot time series.
% - 3a. Plot, grid, ylabel.
plot( dates, nansum(prices, 2), 'Color', 0.85*[1,1,1], 'LineWidth', 2 ) ;
hold on ; grid on ;
plot( dates, prices ) ;
ylabel( 'Return [$]' ) ;
% - 3b. Legend.
legenItems = arrayfun( @(pId) sprintf( 'PERMCO %d', pId), permcos, ...
'UniformOutput', false ) ;
legend( [{'Sum portfolio (?)'}, legenItems{:}], 2 ) ;