Welcome to the Forum

Creating an account is currently only possible via registration at SimFin.

Bulk Data Format to Aid Data Science Related Work

edited December 2018 in Feature suggestions
Hi all. I suggest that the format of the bulk download data be changed (or an additionally formatted dataset be added) to facilitate data science related work - which is presumably what most people would want to do with the bulk data. The new format would have columns: Quarter (YYYYQ), Ticker, SimFinID, IndustryCode, -IndicatorNames-. The indicator names here would each be separate columns. Their values would be the average value over that quarter. Is there any appetite for this? I expect that's what most people expect when they come from a data science perspective. I tried to create this myself from the narrow format data but it's hard due to the enormous number of NAs on each date.


  • hi ste,

    what do you mean by "enormous number of NAs on each date"? are you referring to the stock prices that are updated daily while all the fundamentals are only updated quarterly? so you would want some stock price average for the period or something else?
  • Hi tflassbeck, thanks for your reply and Merry Christmas.

    Yeah, I realised that was the issue recently. I think what I need to do is split the data (narrow bulk download data) by rows where the Indicator is stock price or common shares outstanding etc (the more densely populated ones) then proceed with separate treatments (such as averaging stock prices over the period etc) from there before joining the two again.

    I don't know if something like that sounds reasonable or useful to others, it's just a brief suggestion. I'll continue to try to preprocess the data myself in the meantime anyway.

  • Hi- I am currently trying something similar. Did you suceed?
  • Something that transforms the data of different density into something that can e.g. Be easily digested by the Default sklearn interface would be great. Happy to Share my thoughts if I make progress
  • I created a tool that converts the bulk data set into table format for analysis: https://github.com/jmrichardson/simfin

    Hope it helps,
Sign In to comment.