• I am Australian and applied to all of the major quant firms in OCE for their summer internships (Dec to Feb). I was wondering if I could (or if anyone has tried to) apply to the same firms again but for their Amsterdam/US/UK summer internship cycle (June to August)? Specifically looking at IMC, Optiver, SIG here.

    Also, in case anyone asks, yes, firms in Amsterdam, UK and (maybe but not sure yet) US hire from AU.

  • # The Situation

    I’ve been wrestling with a messy freeform text dataset using BERTopic for the past few weeks, and I’m to the point of crowdsourcing solutions.

    The core issue is a pretty classic garbage-in, garbage-out situation: The input set consists of only 12.5k records of loosely structured, freeform comments, usually from internal company agents or reviewers. Around 40% of the records include copy/pasted questionnaires, which vary by department, and are inconsistenly pasted in the text field by the agent. The questionaires are prevalent enough, however, to strongly dominate the embedding space due to repeated word structures and identical phrasing.

    This leads to severe collinearity, reinforcing patterns that aren’t semantically meaningful. BERTopic naturally treats these recurring forms as important features, which muddies topic resolution.

    ## Issues & Desired Outcomes

    ### Symptoms

    * Extremely mixed topic signals.
    * Number of topics per run ranges wildly (anywhere from 2 to 115).
    * Approx. 50–60% of records are consistently flagged as outliers.

    Topic signal coherance is issue #1; I feel like I’ll be able to explain the outliers if I can just get clearer, more consistant signals.

    There is categorical data available, but it is inconsistently correct. The only way I can think to include this information during topic analysis is through concatenation, which just introduces it’s own set of problems (ironically related to what I’m trying to fix). The result is that emergent topics are subdued and noise gets added due to the inconsistency of correct entries.

    ### Things I’ve Tried

    * Stopword tuning: Both manual and through vectorizer\_model. Minor improvements.
    * “Breadcrumbing” cleanup: Identified boilerplate/questionnaire language by comparing nonsensical topic keywords to source records, then removed entire boilerplate statements (statements only; no single words removed).
    * N-gram adjustment via CountVectorizer: No significant difference.
    * Text normalization: Lowercasing and converting to simple ASCII to clean up formatting inconsistencies. Helped enforce stopwords and improved model performance in conjunction with breadcrumbing.
    * Outlier reduction via BERTopic’s built-in method.
    * Multiple embedding models: “all-mpnet-base-v2”, “all-MiniLM-L6-v2”, and some custom GPT embeddings.

    ### HDBSCAN Tuning

    I attempted tuning HDBScan through two primary means.

    1. Manual tuning via Topic Tuner – Tried a range of min\_cluster\_size and min\_samples combinations, using sparse, dense, and random search patterns. No stable or interpretable pattern emerged; results were all over the place.
    2. Brute-force Monte Carlo – Ran simulations across a broad grid of HDBSCAN parameters, and measured number of topics and outlier counts. Confirmed that the distribution of topic outputs is highly multimodal. I was able to garner some expectations of topic and outliers counts out of this method, which at least told me what to expect on any given run.

    ### A Few Other Failures

    * Attempted to stratify the data via department and model the subset, letting BERTopic omit the problem words beased on their prevalence – resultant sets were too small to model on.
    * Attempted to segment the data via department and scrub out the messy freeform text, with the intent of re-combining and then modeling – this was unsuccessful as well.

    ## Next Steps?

    At this point, I’m leaning toward preprocessing the entire dataset through an LLM before modeling, to summarize or at least normalize the input records and reduce variance. But I’m curious:

    Is there anything else I could try before handing the problem off to an LLM?

    EDIT – A SOLUTION:

    We eventually got approval to move forward with an LLM pre-processing step, which worked very well. We used 4o-mini and instructed the prompt to gather only the facts and intent of each record. My colleague suggested to add the parameter (paraphrasing) “If any question answer pairs exist, include information from the answers to support your response,” which worked exceptionally well.

    We wrote an evaluation prompt to help assess if any egregious factual errors existed across a random sample of 1k records – none were indicated. We then went through these by hand to verify, and none were found.

    Of note: I believe this may be a strong case for the use of 4o-mini. We sampled the results in 4o with the same prompt and saw very little difference; given the nature of the prompt, I think this is very expected. The performance and cost were much lower with 4o-mini – an added bonus. We saw far more variation in the evaluation prompt between 4o and 4o-mini. 4o was more succinct and able to reason “no significant problems” more easily. This was helpful in the final evaluation, but for the full pipeline 4o-mini is a great fit for this usecase.

  • Untitled Post
  • I just launched a new B2B website (pharma niche), and I’m starting SEO from scratch. No backlinks or traffic yet.

    Right now, I’m:

    * Writing content for long-tail keywords
    * Keeping posts clear and helpful
    * Skipping backlinks for now — just focusing on content

    What helped you most in the first 1–2 months of starting SEO?
    Would love to learn from your early experiences.

  • I should begin by disclosing that I have a very respectful position of ETH and BTH and started stacking ETH at around $2,200. I’ve always been a momentum investor and right now, after doing a decent dive of due diligence I’m confident that both BTC and ETC are going to make some exceptional gains.

    Yeah, the tech is cool but if you step back for a few minutes you might be able to see just how dangerous it is for our personal freedoms. ETH more so than BTC. Over the past six months an unprecedented amount of our personal data that has always been segmented into individual silos has been correlated into single and very broad reaching individual dossiers. I’m talking everything, biometrics, spending, online conduct including social media, employment, voting, civil conduct… I’m talking absolutely EVERYTHING. The uprising of the “Tech Bros” is a very real thing. For those that aren’t aware I’d invite you to do some research on Peter Thiel and Palantir. This is the company whose stock shot up from $12 to $145 in just months. All because they are doing, under Federal contracts, exactly what I’ve described.

    Now not only are we about to surrender our money to the “Tech Bros” they have an extremely friendly government behind them. I’ve made my entire career based on Tech from the age of 17 and my philosophy has always been “what‘s next” and I’ve made a very successful career doing that, mostly hooking up with specific management teams doing start-ups And what I’m absolutely 100% confident about is that these technologies will indeed replace our fiat currency. Especially ETH and stablecoins. BTW, I’m buying as much as I possibly can and I know that you’ve watched the insane amount of corporate and institutional inflows, well, that’s for a reason. How many of you have read The Genius Act? That is some scary legislation. In summary, what we‘re doing is handing over the reins to these companies while saying out loud, “Nah, we don’t need to not just Monitor you but we won’t protect the consumer at all. You’re free to do what ever you wish for the next 10 years.

    I know that many will disagree but I see this as one of the most dangerous events/periods in America’s history with regards to our personal freedoms. I’m very open to hearing your opinions.

  • Easy to put together, not super expensive. Options for the Moon phase, sunrise/sunset time and UV Index, so you would be able to see if the Moon influences the timechain and when to sun your balls 😀
    Short video demo: [https://www.youtube.com/watch?v=7DtQNCBLffI](https://www.youtube.com/watch?v=7DtQNCBLffI)
    Github: [https://github.com/kovrom/circle](https://github.com/kovrom/circle)