The rapid advancements in large language model (LLM) coding assistants have sparked debates about the future of coding, with some even predicting doomsday for coders. But in Heroku, we see a different trend — LLMs are democratizing coding, so much so that we are beginning to see a wave of novice software developers jumping into the business.
This influx of programmers presents some unique challenges for AI assistants. Experienced developers, for example, are better equipped to notice when an assistant is leading them down the wrong coding path whereas junior developers might fail to notice something is amiss.
Additionally, an LLM’s code quality can vary greatly depending on language choice and application area. Utilizing the assistant to write a Python or Java backend is usually not a problem; however, asking it to write Zig or a newer language like Mojo can end in disappointment. For example, when asked how to install Python packages, an assistant will direct a user to use pip or conda as those are the most common tools in their training data. Meanwhile, developers are off late preferring Poetry for dependency and management packaging in Python; AI tools do not ideally use or consider that unless a developer asks for it specifically.
Outdated information can also be a problem, especially when languages and frameworks try to phase out old features or when libraries that were once in use contain security vulnerabilities that haven’t been fixed yet.
Harnessing Data to Supercharge LLMs for Developers
So, how do we help assistants (and by extension, developers) avoid such snags? Firstly, it is critical to have current data informing LLMs. While larger organizations can use their existing codebases and operational data to partially address this issue, complete access to comprehensive, up-to-date datasets is unavailable. In fact, 62% of global workers say training AI on out-of-date data breaks their trust in the tool. Similarly, companies lack enough relevant data when introducing new languages or trying to change the status quo.
The necessary data curation can be addressed in a collaborative community-focused manner following patterns established by open-source projects. This does not mean we must agree on a single set of best practices. On the contrary, it makes sense for existing language, tooling and framework communities to each produce a dataset or Framework Knowledge Base (FKB) that is curated specifically for their area of interest.
The optimal method for augmenting existing models with these FKBs is not clear. But that should not prevent anyone from producing them. Whether it is wholly pushed into the context window, accessed via retrieval augmented generation (RAG), or used for fine-tuning, having timely, accurate and relevant data that remains consistent is the best first step.
Framework Knowledge Bases for LLMs
How does a user interact with these FKBs? Imagine a menu for users to select from when they activate their coding assistant. The user would choose one or more FKBs tailored to a particular goal. Each FKB would include a getting-started template for the assistant to work from. In addition to the template, the FKB would include code samples, best-practice guides and recommended libraries.
To maximize success, the datasets would be created and curated by language and framework experts. However, every journey starts with a single step, or in this case, a person with the capacity and willingness to forge ahead and make the first FKB. There are a few questions this pioneer would need to answer:
1. What license should the FKB use? Ideally, these datasets would be available to everyone. To ensure the broadest usage, one must pick something extremely permissive, such as Unlicense or CC0.
2. Where must the FKB be stored? Ideally, this is a place where people can collaborate on datasets and have an easy method to modify the dataset if they disagree on some choices. For example, a GitHub repository is a good fit because the template can be used directly and allows for stars, pull requests and forks. A Huggingface dataset could also work. Another idea is to include an ai-hints.json file directly in an existing repository.
3. What test data should be included for LLMs? In future, one of the challenges would be evaluating an LLM’s performance using a dataset. These evaluations would require test data as well as training data. To address this, the FKB must contain domain-specific examples that can be used as a test set to evaluate the performance of LLMs.
Coding Assistants Become Extraordinary Tools
Fostering a collaborative, community-driven approach to curating and sharing knowledge can help ensure that coding assistants become extraordinary tools that empower developers of all skill levels. This journey begins with the collective effort of the tech community to create and refine datasets that keep pace with the ever-changing landscape of programming languages and tools. Together we can build a future where coding assistants simplify coding and inspire innovation across the global developer ecosystem.