# [https://sharing.hopper.com/invite/ianm1xn](https://sharing.hopper.com/invite/ianm1xn)
Trend Engine: AI-Powered News & Trends
Where AI Meets What's Trending
-
# The Situation
I’ve been wrestling with a messy freeform text dataset using BERTopic for the past few weeks, and I’m to the point of crowdsourcing solutions.
The core issue is a pretty classic garbage-in, garbage-out situation: The input set consists of only 12.5k records of loosely structured, freeform comments, usually from internal company agents or reviewers. Around 40% of the records include copy/pasted questionnaires, which vary by department, and are inconsistenly pasted in the text field by the agent. The questionaires are prevalent enough, however, to strongly dominate the embedding space due to repeated word structures and identical phrasing.
This leads to severe collinearity, reinforcing patterns that aren’t semantically meaningful. BERTopic naturally treats these recurring forms as important features, which muddies topic resolution.
## Issues & Desired Outcomes
### Symptoms
* Extremely mixed topic signals.
* Number of topics per run ranges wildly (anywhere from 2 to 115).
* Approx. 50–60% of records are consistently flagged as outliers.Topic signal coherance is issue #1; I feel like I’ll be able to explain the outliers if I can just get clearer, more consistant signals.
There is categorical data available, but it is inconsistently correct. The only way I can think to include this information during topic analysis is through concatenation, which just introduces it’s own set of problems (ironically related to what I’m trying to fix). The result is that emergent topics are subdued and noise gets added due to the inconsistency of correct entries.
### Things I’ve Tried
* Stopword tuning: Both manual and through vectorizer\_model. Minor improvements.
* “Breadcrumbing” cleanup: Identified boilerplate/questionnaire language by comparing nonsensical topic keywords to source records, then removed entire boilerplate statements (statements only; no single words removed).
* N-gram adjustment via CountVectorizer: No significant difference.
* Text normalization: Lowercasing and converting to simple ASCII to clean up formatting inconsistencies. Helped enforce stopwords and improved model performance in conjunction with breadcrumbing.
* Outlier reduction via BERTopic’s built-in method.
* Multiple embedding models: “all-mpnet-base-v2”, “all-MiniLM-L6-v2”, and some custom GPT embeddings.### HDBSCAN Tuning
I attempted tuning HDBScan through two primary means.
1. Manual tuning via Topic Tuner – Tried a range of min\_cluster\_size and min\_samples combinations, using sparse, dense, and random search patterns. No stable or interpretable pattern emerged; results were all over the place.
2. Brute-force Monte Carlo – Ran simulations across a broad grid of HDBSCAN parameters, and measured number of topics and outlier counts. Confirmed that the distribution of topic outputs is highly multimodal. I was able to garner some expectations of topic and outliers counts out of this method, which at least told me what to expect on any given run.### A Few Other Failures
* Attempted to stratify the data via department and model the subset, letting BERTopic omit the problem words beased on their prevalence – resultant sets were too small to model on.
* Attempted to segment the data via department and scrub out the messy freeform text, with the intent of re-combining and then modeling – this was unsuccessful as well.## Next Steps?
At this point, I’m leaning toward preprocessing the entire dataset through an LLM before modeling, to summarize or at least normalize the input records and reduce variance. But I’m curious:
Is there anything else I could try before handing the problem off to an LLM?
EDIT – A SOLUTION:
We eventually got approval to move forward with an LLM pre-processing step, which worked very well. We used 4o-mini and instructed the prompt to gather only the facts and intent of each record. My colleague suggested to add the parameter (paraphrasing) “If any question answer pairs exist, include information from the answers to support your response,” which worked exceptionally well.
We wrote an evaluation prompt to help assess if any egregious factual errors existed across a random sample of 1k records – none were indicated. We then went through these by hand to verify, and none were found.
Of note: I believe this may be a strong case for the use of 4o-mini. We sampled the results in 4o with the same prompt and saw very little difference; given the nature of the prompt, I think this is very expected. The performance and cost were much lower with 4o-mini – an added bonus. We saw far more variation in the evaluation prompt between 4o and 4o-mini. 4o was more succinct and able to reason “no significant problems” more easily. This was helpful in the final evaluation, but for the full pipeline 4o-mini is a great fit for this usecase.
-
* A lot of potential, but ultimately disappointing right now
* It completed the first task I gave it decently (taking a list of 200 companies I found on a Forbes link spread out over five pages, and putting them into a spreadsheet), especially compared with Deep Research which I tried to get to do the same task yesterday and failed miserably. However, even though the agent was able to ultimately complete the task, it stopped working several times due to context limits and confusion, and had to be re-prompted.
* Continuing on from the above task, I then asked it to find the LinkedIn links for every company and put them in a new column in the spreadsheet. ~~Again, it achieved this pretty admirably but it stopped several times and needed to be told to “continue”.~~ **EDIT -** I just looked at the spreadsheet and it didn’t actually complete the task. It stopped halfway through, leaving half of the spreadsheet entries without a Linkedin link.
* It appears that Agent can’t open and read PDF documents when linked on a webpage. It will click the link, but the tab it opens up in its browser is blank.
* I tried to ask it to complete several steps on a website that involved clicking on different links and putting some documents into different “stages”. It followed the first part of my instructions, but completely ignored the second part. **I try to prompt it very explicitly, just like I’m explaining to a person. Maybe this is not the right approach?**
* The “browsing context” limit appears to be really short. Maybe that’s common knowledge for everyone else. I’m not a power user, so I haven’t come up against this problem before. I tried an experiment where I asked the agent to log into my grocery store account, look at all my purchases from 2025, dedupe them, and put it into a spreadsheet. It did decently from a technical standpoint (clicking around on the right things, putting into a spreadsheet in the correct format, etc), but it gave up far before completing the task due to running out of browser context.I haven’t found any task yet that I could just “set and forget” like in the OpenAI videos. Every task needed to be babysat from afar just incase it stopped halfway through (which each one did).
As I said at the beginning, there is a ton of potential here, and I’m going to keep testing. It was exciting to see it complete the one task successfully, and attempt to complete the others.
**Is anyone else coming up against the browser context limit?**
**Has anyone else been able to get it to open and read PDFs by clicking on a link in a browser?**
-
[Million-unit AI robot army no longer a dream: Analyzing Foxconn’s three-pronged strategy](https://www.digitimes.com/news/a20250721PD203/foxconn-ai-robot-robotics-production.html?)
[TSMC Reportedly Eyes 10-Year Boom from Humanoids, Backed by NVIDIA Jetson and Tesla’s Chips](https://www.trendforce.com/news/2025/06/27/news-tsmc-reportedly-eyes-10-year-boom-from-humanoids-backed-by-nvidia-jetson-and-teslas-ai-chips/)
-
Hello,
I’ve day trade successfully in 2024 (always cash out before market close). I was making 2k USD+ per trading day for about 7 months consistently causing my ego to balloon that I finally figured it out after years of learning the stock market. Doesn’t matter if the it goes up, down, it’s just green by end of day. Hence, I felt invincible and untouchable. Even sent a nice resignation letter to my previous job.
Until…
I tilted one day and lost to my emotions and broke pretty much all my rules and went the unspeakable, forbidden no-no. I went… yolo. I was simply like Icarus.
Good thing I’m always on cash accounts. In a nutshell my finance basically ended up like your average joe smuck.
Unfortunately, I couldn’t trade for a while after that blow because my strategy requires significant capital to safely execute.
But after a year, I’m closer to my ideal capital again to execute my strategy.
But this time.
I’m trying to get the emotion out of the equation. Hence, algo trading. What I learned from that experience is my worst enemy is myself.
I have fullstack knowledge in web dev. Enough to build my own web apps and launch them.
Here’s the setup I’m thinking. Forgive me as I never done algo trading before. Only manual day trading (specifically scalp trades – 250+trades or more per day)
– I’m thinking of building my own private web app that communicate to a broker using restapi. The broker has a way to send market data on a specific stock (ideally in json) especially option ask/bid price and I my web app will communicate back also (ideally in json) to execute trades.
So I’m looking for a broker that accomodate that kind of trading even if there are monthly or data fees involved. A Canadian or a US broker is preferred. I’ve been a user of questrade. I just need broker names, and I will start from that direction.
Thanks in advance.