NumerAI: Week 3

My third week included getting my results back (bottom 15%) and looking at different ways that I can use the feaures.

256 Results

While I won’t know my real results (the action of my model predictions versus the real life market movement) I did get the results from the “test” data. It wasn’t great. I was in the bottom 15%. I expected this so it wasn’t a big deal.

CORR Rank5096/6025
CORR Reputation-0.0977
MMC Reputation-0.0984
FNC Reputation-0.0978
Current Reputation (Build over the last 20 rounds)
Validation Sharpe0.636314.97%
Validation Corr0.016020.05%
Validation FNC0.011232.74%
Corr + MMC Sharpe0.512411.01%
MMC Mean0.004175.73%
Corr With Example Preds0.4026
Diagnostic Results

Data Analysis

The first thing I am looking at doing is trying to figure out “what” each column/section means. In doing so I was checking to see what the average of each column was and it turns out it is 0.5.

NumerAI: Week 2

My second week had me actually submit my solution. It is a straight forward neural network so I didn’t expect anything great. With that said, they do give you 0.01 NMR (~$4) so it feels like you have some skin in the game. Once I get a model a bit higher (>0.02 Spearmen Correlation) I will put in more money. I added $100 that I can stake at some point.

Google Colab

It appears that if you add [“”] to the Colab settings you get 2x the memory. It is probably best that I didn’t find this sooner to make sure I cleaned up my code.

Large CSV Files

After looking around at some documentation I saw that converting each of the numeric fields into float16 saved me over 74% on memory. It still takes a while to get the CSV into my code but after that I can do quite a bit with the DataFrame pretty quickly.

Custom Loss/MEtric Method

I am still trying to find the TF version of the Spearman loss function. There is an implementation in PyTorch that I might be able to use.

NumerAI: Week 1

My first week was really just getting everything set up. My biggest issue was trying to figure out the best way to handle over 3GB of data in a csv file.

Overall, I think it was a good start but I will have to keep touching up my model to see if I can squeeze out some performance. Below are some things I discovered.

Google Colab

I have never stored data in the local session before and was surprised when all of my work was gone when I came back to it. Note to self, stored in Google Drive.

I also run into issues while trying to build all the datasets. Since they are so large that is a lot to store in memory.

I really wish I had more knowledge and power within python/colab to release some memory.

Large CSV Files

I have either used a relational database for large datasets or used CSV for smaller datasets. This is the first time I am getting large datasets in CSV format. The training CSV is ~750mb and the validation dataset is >2GB.

I need to be aware of the sizes and not load too much into memory.

Custom Loss/MEtric Method

I am trying to implement their scoring metric (Spearman Pearson) and it is a giant pain in the butt. I have seen a few places that have done it but it doesn’t work with TFv2. Fun times.

NumerAI Tournament

I was sent a link about NumerAI’s tournament. The idea is that the knowledge of the crowd is better than the knowledge of a few. A bunch (2,424 as of right now) of different models are created based on the same dataset and then the company uses all of them to predict price movement. My assumption is that something like a random forest is used.

I am going to use this opportunity to create a web series about my experiences. This should keep me busy and my data science skills sharp until next basketball season when I release my soon to be build models