Welcome to the Forum

Creating an account is currently only possible via registration at SimFin.

cache files not being used on a different machine

Hi!

I have a Virtual Server that I want to use with a Jupyter Labs install so it can be accesible from anywhere but the server has only 6Gb of RAM so I cannot compute the signals for daily variants.

So I downloaded the notebook to my Mac with plenty of RAM, ran the computations, verify that on the second run it reads the cache, then I upload all files (even the entire folder) to my Virtual Server. Everything seems to work fine, it finds the file on the disk due to the hash in the name, but then it reruns the calculations instead of loading the data from the pickle thus giving an Out of Memory error.

Examples below:
  • Virtual Server

    %%time
    df_val_signals_d = hub.val_signals(variant='daily')

    Dataset "us-shareprices-daily" on disk (0 days old).
    - Loading from disk ... Done!
    Cache-file 'val_signals-77ca1f39.pickle' on disk (0 days old).
    - Running function val_signals() ...
    ---------------------------------------------------------------------------
    MemoryError Traceback (most recent call last)
    in
  • Macbook

    %%time
    df_val_signals_d = hub.val_signals(variant='daily')

    Dataset "us-shareprices-daily" on disk (0 days old).
    - Loading from disk ... Done!
    Cache-file 'val_signals-77ca1f39.pickle' on disk (0 days old).
    - Loading from disk ... Done!
    CPU times: user 8.26 s, sys: 2.07 s, total: 10.3 s
    Wall time: 10.5 s
Any ideas how to fix this?

Thanks!

Comments

  • hmm not sure about this. maybe the pickled dataframe just doesn't fit in memory on your server?
  • But we'll double check again if the calculations are really performed twice even after loading the file from disk.
  • hmm not sure about this. maybe the pickled dataframe just doesn't fit in memory on your server?

    there is no check in your code regarding this
  • But we'll double check again if the calculations are really performed twice even after loading the file from disk.

    I suspect it may be from the section before line 259 here, depends on the value of compute. I was thinking that it may be due to the comparison of days instead of a pure timestamp (like days > 0 is false) but that would have applied also to the previous dataset that was cached…

    I need to check some more, maybe play with the timestamp of the files, maybe something regarding the OS related functions (Ubuntu vs OS X).

    I am not sure how I can “debug” the code in the simfin package while being used in the notebook…

    Any suggestions welcome :)

    PS: btw, you may want to assure the code is working for machines with less RAM or make a way to compute the data on one machine and then be able to move it around to others. I believe you use a powerhorse machine for dev/test from what I understand from your hobbies/activities in the day to day life :)
  • PS: btw, you may want to assure the code is working for machines with less RAM or make a way to compute the data on one machine and then be able to move it around to others. I believe you use a powerhorse machine for dev/test from what I understand from your hobbies/activities in the day to day life :)


    Yeah this is something we didn't fully consider when building the Python API more than a year ago. Basically the share price files which were already quite big then increase by around 3k data points every day, so they are already considerably larger now than they used to be. For that reason we also introduced the pre-calculated ratios some time ago.
    The problem is a bit in general that Pandas requires to load all the data in memory AFAIK, and we use Pandas dataframes for everything. I guess we could do something like splitting the data files into several pieces or do some filtering (starting year etc) before loading everything in the dataframes.
Sign In to comment.