Subject: Re: ML for MI question
Hello there tedthedog ..... Great thoughts and questions!

Also lizgdal ..... Thanks for the list ... its interesting.

A couple of comments : QUOTE : "because LLMs aren't great with numbers/arithmetic" ENDQUOTE

This statement is misguided - actually all "AI/ML" tools in the end are numerical encoded ( ie binary/continuous) - LLMs or Generative AI is based on CNNs ( Convolution Neural Networks) and NNs cant handle anything other than numeric data at the final processing end. What's happening in the intermediate is how they have tokenized the text/image components etc - which might lead you to believe that categorization is the way to go.

However this is the fundamental truth in all machine algorithms - Numeric to Categorization = Loss of Fidelity/Info - ie you are ALWAYS better off to feed in the original numerical values ( Unless in some extreme cases where the information is full of noise and a human expert can make better sense)

NET : There's no need to recode Numeric data into arbitrary categories to feed into MLs. The main thing to remember about such numeric data is that algos will naturally attribute a distance metric to the underlying data and for any reason the data is actually ordinal ( ie the values are ordered but the difference between is not representative) - you are better of using those as Factors ( eg say Zacks Rank or Value Line Score etc)

Where LLMs can potentially be useful - since they are basically a "complete the sentence/thought" is to use text/language inputs like Twitter/Reddit/Filings commentary etc to see whether the LLMs produce better extracted features from them - which can then be processed as factors

If you read some of the good books on Deep Learners - you will see that the authors (Definitely Hinton's school of thought/disciples) self-admit that its only in the "interpretative class" of problems like Image/Text/Thought etc where Deep Learners are useful. Numerical problems are typically better handled thru what I tend to refer as the "Stat-MLs" ie RandomForest and Boosters ( XGB,LGBM,CatBoost etc) or SVMs ( if you have a 2 class problem) etc

One novel use of CNNs could be to feed actually say Stockcharts images of Stock History and try to predict the future forward - ie total pattern classification. [Basically what all the CandleStick etc try to do with numerical rules] - Very likely will only help to prove the futility of things.

But there's a lot of attempts to use LSTM ( a specific class of generative learners) to do this in a numerical input way.

The list of the variables tried tells the story by itself:

(a) Majority of the variables are ANNUAL - which almost means the outcome of the stock performance is 1+-2 years out
(b) One of the key attributes of MLs is the "learning" capability - if data is not getting updated the model is stale
(c)Otherwise - this sort of an exercise boils down to

"Let me see if there's some SUPER COMPLEX non-linear representation which can fit the data I am feeding which can beat a simpler model". The outcomes of these become obvious - they wont generalize going forward typically and its just an extreme case of Data Mining

Hope this helps!
Best