Getting familiar with some of the pandas methods will make your life easier when it comes to handling data. We will discuss a few commonly used pandas methods below.
dropna()
dropna()
>>> prices = yfinance.download([ "AAPL", "BTC-USD", "TSLA" ], start="2020-01-02", end="2020-12-31")["Adj Close"]
[*********************100%***********************] 3 of 3 completed
>>>
>>> prices
AAPL BTC-USD TSLA
Date
2020-01-02 73.683571 6985.470215 28.684000
2020-01-03 72.967201 7344.884277 29.534000
2020-01-04 NaN 7410.656738 NaN
2020-01-05 NaN 7411.317383 NaN
2020-01-06 73.548637 7769.219238 30.102667
... ... ... ...
2020-12-26 NaN 26437.037109 NaN
2020-12-27 NaN 26272.294922 NaN
2020-12-28 135.292664 27084.808594 221.229996
2020-12-29 133.491257 27362.437500 221.996674
2020-12-30 132.353012 28840.953125 231.593338
[364 rows x 3 columns]
get rid of NaN
>>> prices = prices.dropna()
>>> prices
AAPL BTC-USD TSLA
Date
2020-01-02 73.683571 6985.470215 28.684000
2020-01-03 72.967201 7344.884277 29.534000
2020-01-06 73.548637 7769.219238 30.102667
2020-01-07 73.202728 8163.692383 31.270666
2020-01-08 74.380287 8079.862793 32.809334
... ... ... ...
2020-12-23 129.621231 23241.345703 215.326660
2020-12-24 130.620895 23735.949219 220.589996
2020-12-28 135.292664 27084.808594 221.229996
2020-12-29 133.491257 27362.437500 221.996674
2020-12-30 132.353012 28840.953125 231.593338
[252 rows x 3 columns]
shift(1)
>>> prices
AAPL BTC-USD TSLA
Date
2020-01-02 73.683571 6985.470215 28.684000
2020-01-03 72.967201 7344.884277 29.534000
2020-01-06 73.548637 7769.219238 30.102667
2020-01-07 73.202728 8163.692383 31.270666
2020-01-08 74.380287 8079.862793 32.809334
... ... ... ...
2020-12-23 129.621231 23241.345703 215.326660
2020-12-24 130.620895 23735.949219 220.589996
2020-12-28 135.292664 27084.808594 221.229996
2020-12-29 133.491257 27362.437500 221.996674
2020-12-30 132.353012 28840.953125 231.593338
[252 rows x 3 columns]
>>> prices.shift(1)
AAPL BTC-USD TSLA
Date
2020-01-02 NaN NaN NaN
2020-01-03 73.683571 6985.470215 28.684000
2020-01-06 72.967201 7344.884277 29.534000
2020-01-07 73.548637 7769.219238 30.102667
2020-01-08 73.202728 8163.692383 31.270666
... ... ... ...
2020-12-23 130.531815 23783.029297 213.446671
2020-12-24 129.621231 23241.345703 215.326660
2020-12-28 130.620895 23735.949219 220.589996
2020-12-29 135.292664 27084.808594 221.229996
2020-12-30 133.491257 27362.437500 221.996674
[252 rows x 3 columns]
Calculate returns
>>> returns = prices / prices.shift(1) - 1
>>> returns
AAPL BTC-USD TSLA
Date
2020-01-02 NaN NaN NaN
2020-01-03 -0.009722 0.051452 0.029633
2020-01-06 0.007968 0.057773 0.019255
2020-01-07 -0.004703 0.050774 0.038801
2020-01-08 0.016086 -0.010269 0.049205
... ... ... ...
2020-12-23 -0.006976 -0.022776 0.008808
2020-12-24 0.007712 0.021281 0.024443
2020-12-28 0.035766 0.141088 0.002901
2020-12-29 -0.013315 0.010250 0.003466
2020-12-30 -0.008527 0.054034 0.043229
[252 rows x 3 columns]
Get rid of NaN in the result
>>> returns = returns.dropna()
>>> returns
AAPL BTC-USD TSLA
Date
2020-01-03 -0.009722 0.051452 0.029633
2020-01-06 0.007968 0.057773 0.019255
2020-01-07 -0.004703 0.050774 0.038801
2020-01-08 0.016086 -0.010269 0.049205
2020-01-09 0.021241 -0.024851 -0.021945
... ... ... ...
2020-12-23 -0.006976 -0.022776 0.008808
2020-12-24 0.007712 0.021281 0.024443
2020-12-28 0.035766 0.141088 0.002901
2020-12-29 -0.013315 0.010250 0.003466
2020-12-30 -0.008527 0.054034 0.043229
[251 rows x 3 columns]
fillna()
Replace NaN with whatever reasonable value you want. Using the returns example above:
>>> returns = prices / prices.shift(1) - 1
>>> returns.head()
AAPL BTC-USD TSLA
Date
2020-01-02 NaN NaN NaN
2020-01-03 -0.009722 0.051452 0.029633
2020-01-06 0.007968 0.057773 0.019255
2020-01-07 -0.004703 0.050774 0.038801
2020-01-08 0.016086 -0.010269 0.049205
>>>
>>> returns = returns.fillna(0)
>>> returns.head()
AAPL BTC-USD TSLA
Date
2020-01-02 0.000000 0.000000 0.000000
2020-01-03 -0.009722 0.051452 0.029633
2020-01-06 0.007968 0.057773 0.019255
2020-01-07 -0.004703 0.050774 0.038801
2020-01-08 0.016086 -0.010269 0.049205
iloc(): position based selection
Select rows
>>> prices.iloc[0:8]
AAPL BTC-USD TSLA
Date
2020-01-02 73.683571 6985.470215 28.684000
2020-01-03 72.967201 7344.884277 29.534000
2020-01-06 73.548637 7769.219238 30.102667
2020-01-07 73.202728 8163.692383 31.270666
2020-01-08 74.380287 8079.862793 32.809334
2020-01-09 75.960182 7879.071289 32.089333
2020-01-10 76.131920 8166.554199 31.876667
2020-01-13 77.758423 8144.194336 34.990665
select columns
>>> prices.iloc[:,[0,2]]
AAPL TSLA
Date
2020-01-02 73.683571 28.684000
2020-01-03 72.967201 29.534000
2020-01-06 73.548637 30.102667
2020-01-07 73.202728 31.270666
2020-01-08 74.380287 32.809334
... ... ...
2020-12-23 129.621231 215.326660
2020-12-24 130.620895 220.589996
2020-12-28 135.292664 221.229996
2020-12-29 133.491257 221.996674
2020-12-30 132.353012 231.593338
[252 rows x 2 columns]
Mixed rows and columns selection
>>> prices.iloc[0:8,[0,2]]
AAPL TSLA
Date
2020-01-02 73.683571 28.684000
2020-01-03 72.967201 29.534000
2020-01-06 73.548637 30.102667
2020-01-07 73.202728 31.270666
2020-01-08 74.380287 32.809334
2020-01-09 75.960182 32.089333
2020-01-10 76.131920 31.876667
2020-01-13 77.758423 34.990665
loc(): Label based selection
Select rows with row indexes
>>> prices.loc[["2020-01-02","2020-01-08", "2020-12-30"]]
AAPL BTC-USD TSLA
Date
2020-01-02 73.683571 6985.470215 28.684000
2020-01-08 74.380287 8079.862793 32.809334
2020-12-30 132.353012 28840.953125 231.593338
Select columns with column names
>>> prices.loc[:, ["AAPL", "TSLA"]]
AAPL TSLA
Date
2020-01-02 73.683571 28.684000
2020-01-03 72.967201 29.534000
2020-01-06 73.548637 30.102667
2020-01-07 73.202728 31.270666
2020-01-08 74.380287 32.809334
... ... ...
2020-12-23 129.621231 215.326660
2020-12-24 130.620895 220.589996
2020-12-28 135.292664 221.229996
2020-12-29 133.491257 221.996674
2020-12-30 132.353012 231.593338
[252 rows x 2 columns]
Mixed row and column selection
>>> prices.loc[["2020-01-02","2020-01-08", "2020-12-30"], ["AAPL", "TSLA"]]
AAPL TSLA
Date
2020-01-02 73.683571 28.684000
2020-01-08 74.380287 32.809334
2020-12-30 132.353012 231.593338
Logical selection for rows
>>> prices.loc[prices.index == "2020-12-30"]
AAPL BTC-USD TSLA
Date
2020-12-30 132.353012 28840.953125 231.593338
rows with columns
>>> prices.loc[prices.index == "2020-12-30", ["AAPL","TSLA"] ]
AAPL TSLA
Date
2020-12-30 132.353012 231.593338
Select a particular month
>>> prices.loc[prices.index.month == 12]
AAPL BTC-USD TSLA
Date
2020-12-01 121.465462 18802.998047 194.919998
2020-12-02 121.821777 19201.091797 189.606674
2020-12-03 121.683220 19445.398438 197.793335
2020-12-04 121.000267 18699.765625 199.679993
2020-12-07 122.484932 19191.630859 213.919998
2020-12-08 123.108482 18321.144531 216.626663
2020-12-09 120.535065 18553.916016 201.493332
2020-12-10 121.980141 18264.992188 209.023331
2020-12-11 121.158630 18058.904297 203.330002
2020-12-14 120.535065 19246.644531 213.276672
2020-12-15 126.572701 19417.076172 211.083328
2020-12-16 126.503426 21310.597656 207.589996
2020-12-17 127.384331 22805.162109 218.633331
2020-12-18 125.365181 23137.960938 231.666672
2020-12-21 126.919128 22803.082031 216.619995
2020-12-22 130.531815 23783.029297 213.446671
2020-12-23 129.621231 23241.345703 215.326660
2020-12-24 130.620895 23735.949219 220.589996
2020-12-28 135.292664 27084.808594 221.229996
2020-12-29 133.491257 27362.437500 221.996674
2020-12-30 132.353012 28840.953125 231.593338
>>> prices.loc[prices.index == pd.to_datetime("2020-12-30")]
AAPL BTC-USD TSLA
Date
2020-12-30 132.353012 28840.953125 231.593338