It’s a 23M parameter model based on the Llama 3 architecture and plays at around 1400 Elo.
Trend Engine: AI-Powered News & Trends
Where AI Meets What's Trending
-
Hey guys, my husband is turning 40 next year and I would like to organize us a nice trip to Portugal in spring as a special treat. Nothing too long, 4-5 nights tops. Any recommendations outside of Lisbon and Porto? We visited both together already (personally loved Porto, Lisbon not so much, liked Cascais or Sintra way more) and also he has been to Lisbon many times before. No preferences other than the fact that the place(s) would need to be available via public transport, as we won’t be renting a car. Any nice ideas? 🙂
-

Creating an environment that values learning and growth is beneficial. Encouraging continual education through workshops, online courses, and mentorship opportunities can help cultivate a team that’s enthusiastic and committed. Engage in regular feedback sessions with new graduates, focusing on strengths and areas for growth. This not only helps them develop professionally but also strengthens your relationship as a manager, reinforcing their role as a valued team member.
By implementing these strategies, you not only enhance your team’s function but also promote a positive, productive work environment. Remember, the goal is to guide new graduates, unleashing their potential while ensuring your team remains cohesive and effective. Whether it’s through structured guidance or fostering a culture of openness, managing new graduates can lead to rewarding team dynamics and success.
-
The core issue is a pretty classic garbage-in, garbage-out situation: The input set consists of only 12.5k records of loosely structured, freeform comments, usually from internal company agents or reviewers. Around 40% of the records include copy/pasted questionnaires, which vary by department, and are inconsistenly pasted in the text field by the agent. The questionaires are prevalent enough, however, to strongly dominate the embedding space due to repeated word structures and identical phrasing.
This leads to severe collinearity, reinforcing patterns that aren’t semantically meaningful. BERTopic naturally treats these recurring forms as important features, which muddies topic resolution.
## Issues & Desired Outcomes
### Symptoms
* Extremely mixed topic signals.
* Number of topics per run ranges wildly (anywhere from 2 to 115).
* Approx. 50–60% of records are consistently flagged as outliers.Topic signal coherance is issue #1; I feel like I’ll be able to explain the outliers if I can just get clearer, more consistant signals.
There is categorical data available, but it is inconsistently correct. The only way I can think to include this information during topic analysis is through concatenation, which just introduces it’s own set of problems (ironically related to what I’m trying to fix). The result is that emergent topics are subdued and noise gets added due to the inconsistency of correct entries.
### Things I’ve Tried
* Stopword tuning: Both manual and through vectorizer\_model. Minor improvements.
* “Breadcrumbing” cleanup: Identified boilerplate/questionnaire language by comparing nonsensical topic keywords to source records, then removed entire boilerplate statements (statements only; no single words removed).
* N-gram adjustment via CountVectorizer: No significant difference.
* Text normalization: Lowercasing and converting to simple ASCII to clean up formatting inconsistencies. Helped enforce stopwords and improved model performance in conjunction with breadcrumbing.
* Outlier reduction via BERTopic’s built-in method.
* Multiple embedding models: “all-mpnet-base-v2”, “all-MiniLM-L6-v2”, and some custom GPT embeddings.### HDBSCAN Tuning
I attempted tuning HDBScan through two primary means.
1. Manual tuning via Topic Tuner – Tried a range of min\_cluster\_size and min\_samples combinations, using sparse, dense, and random search patterns. No stable or interpretable pattern emerged; results were all over the place.
2. Brute-force Monte Carlo – Ran simulations across a broad grid of HDBSCAN parameters, and measured number of topics and outlier counts. Confirmed that the distribution of topic outputs is highly multimodal. I was able to garner some expectations of topic and outliers counts out of this method, which at least told me what to expect on any given run.### A Few Other Failures
* Attempted to stratify the data via department and model the subset, letting BERTopic omit the problem words beased on their prevalence – resultant sets were too small to model on.
* Attempted to segment the data via department and scrub out the messy freeform text, with the intent of re-combining and then modeling – this was unsuccessful as well.## Next Steps?
At this point, I’m leaning toward preprocessing the entire dataset through an LLM before modeling, to summarize or at least normalize the input records and reduce variance. But I’m curious:
Is there anything else I could try before handing the problem off to an LLM?
EDIT – A SOLUTION:
We eventually got approval to move forward with an LLM pre-processing step, which worked very well. We used 4o-mini and instructed the prompt to gather only the facts and intent of each record. My colleague suggested to add the parameter (paraphrasing) “If any question answer pairs exist, include information from the answers to support your response,” which worked exceptionally well.
We wrote an evaluation prompt to help assess if any egregious factual errors existed across a random sample of 1k records – none were indicated. We then went through these by hand to verify, and none were found.
Of note: I believe this may be a strong case for the use of 4o-mini. We sampled the results in 4o with the same prompt and saw very little difference; given the nature of the prompt, I think this is very expected. The performance and cost were much lower with 4o-mini – an added bonus. We saw far more variation in the evaluation prompt between 4o and 4o-mini. 4o was more succinct and able to reason “no significant problems” more easily. This was helpful in the final evaluation, but for the full pipeline 4o-mini is a great fit for this usecase.
-
I’m also an accent coach and a speechwork professional working with actors, so I’m in good at phonetics, prosody and speech in general. Is there any good master degree in Europe where I can study this?
Also, which kind of jobs could be suitable for this speciality of speech technology? Is there work in this field nowadays? I would love to work in something related to accents or dialects (maybe identifying different accents or being able to create accent models for IA). Is it something realistic?
Thanks!
-
**Final Thoughts**
Selling a large portfolio on Coinbase doesn’t have to be stressful. By understanding the process, preparing your account, and staying mindful of fees, you can ensure a smooth transition from crypto to dollars. Remember, informed decisions will save you time and money in the end. Whether you’re a seasoned trader or a crypto newbie, being strategic about your transactions will always pay off. Happy trading!