Generative Gear Design

This is an ongoing personal project. The goal is to fine-tune an open-source large language model capable of automatically generating a DXF file of a gear.

After scraping 11,000 data samples from KHK Gears, I am now in the fine-tuning stage.

Dataset

GitHub 🤗 Prompted 🤗 Unstructured

To the best of my knowledge, this is the first DXF dataset on Huggingface.

Prompted: Structured input-output data designed to train a language model.

Unstructured: Dictionary-DXF pairs where the dictionary items quantify the aspects of the gear.

Each gear in the KHK collection contains quantitative specifications about its geometric construction—information that nearly uniquely determines each gear. Along with this information is a downloadable DXF document of the associated gear. I scraped all of this data using the BeautifulSoup and Selenium libraries.

One massive (literally) issue: The DXF files contain a massive amount of text—a corpus long enough to exceed the context window of open-source models that I (and my hardware) can run, let alone train.

Upshot: Most of the information in the DXF files is common across gears and/or could be easily reconstructed algorithmically after generating the unique gear-specific details.

Solution: I automated the scraping of all the DXF files in my dataset, retaining only the essential information.

Each DXF file in the original dataset contained, on average, 2665 lines. This was reduced to an average of 965 lines, achieving a compression rate of nearly 65%.

Model Training

GitHub

There are several training methodologies that I have attempted:

Fine-tuning: Training an existing language model to produce DXF code.

Training from Scratch: Exploiting a truncated vocabulary and structured outputs to train a small autogrressive transformer.

Unfortunately, I am GPU poor; just severely computationally limited. Sacrificing context window, model size and batch size are necessary in order for back-propagation to be feasible. These sacrifices result in truly abysmal test performance.

I am continuing to explore methods for training a generative model which can generalize beyond the gear catalog. Seq2Seq models are a feasible option.

Check back soon for more progress!

Murdock Aubry

Generative Gear Design

Dataset

Model Training