AI to grade notebooks for initial check

Notebooks go through multiple checks. As a student, would you be comfortable with AI giving your notebook the initial check to sort out the notebooks?

Some teams will already be doing stuff like this to perfect their notebooks. If I had this I would grade my students each week based on what the AI told me… to an extent.

While I think using AI to grade notebooks to find ways to improve might be good, I don’t think it should be used for the actual judging process.

I wouldn’t trust an AI with understanding the notebook judging process properly because LLMs are known to just make stuff up. As an example, I just asked ChatGPT to “explan the vex robotics competition notebook judging process” and it outputted some stuff which is just blatantly wrong (all judging interviews are before matches start, students use slideshows and walk through their notebook during the interview, etc). I also had it judge a random entry from my team’s notebook last year and it deducted us for not doing something which we explained we didn’t have time for, and for not coming up with more alternative solutions even though we already had 3. Of course you would do much more prompt engineering then I did to get it to understand the judging process, but it would still likely get stuff wrong

Also, the AI would probably grade very inconsistently because it will only ever see one notebook before having it’s memory erased, and LLMs have randomness designed into them to make responses more unique. There’s no “bar” set for the AI on what makes a good notebook, so it could arbitrarily decide good notebooks are worse then they are and vise versa. I had ChatGPT grade the same notebook entry again with some integral parts of the engineering design process removed, like identifying the problem and developing multiple solutions, and instead of marking the new entry down for missing these steps, it instead marked the notebook down because “some technical terms might not be immediately clear to those unfamiliar with Vex Robotics,” and “it could be improved with the inclusion of diagrams or sketches to visually represent the design and functioning of the vertical wings.” This is despite both versions of the entry having the same terminology and the pictures from the meeting not being given to the AI both times

I also don’t think it would work great for any parts of the notebook other then text, like CAD, pictures, and general layout/organization. Assuming digital submission, you can copy-paste the text to give it to the AI, but your only way to tell it anything else is to take a picture/screenshot of the notebook and send that to it, which it is less likely to interpret properly (not to mention that AI’s don’t always support this). Not even copy-pasting would work for a scanned physical notebook, meaning you could only rely on pictures. Both copy-pasting and pictures would also take a long time to do for a whole notebook, meaning you would need to handpick a couple random meetings to give to the AI, giving it less insight into the notebook as a whole then a human reader who can turn pages.

All in all I think it would probably take more effort to mess with an AI to get it to judge properly then to just judge it manually. Even after the prompt engineering and inputting of many pages to get the AI to give a good response, you end up needing to interpret the AI’s output and transfer it into the actual notebook rubric, when you could have just skimmed the original notebook for much less of a hassle.

11 Likes

This is an interesting idea. I have done a bit of research for LLM’s, specifically in the field of reducing the size of the model while maintaining accuracy. Especially with large LLMs like ChatGPT, it would be very difficult to get something small enough to run on a laptop.

However, with a bit of work, I do think it would be possible (and not impossibly difficult) to train a model to detect the difference between a developing notebook and a developed notebook. It would be possible to have it first go through a character recognition model to convert a scanned handwritten notebook into text and images and to upload those as a series of tokens as the input. This would still be quite difficult.

Additionally, I would not actually recommend using a preexisting LLM for this process without significant retraining as they are general purpose and their scope is too large to adequately address this problem.

I think that making an AI to judge notebook accurately would be possible, but it would require several very complex stages and you would need a lot of notebooks to use as input data. I don’t think that any publicly available models allow users to upload 100’s of pages as an input.

6 Likes

Speaking as a participant in VIQRC, I wouldn’t mind if AI did the initial check as that could ease the pressure on the judges. But where I would take issue, would be if the AI went unchecked, I’m not sure if AI could quite understand the design process. What I mean is, a judge can understand, this team has said one thing which shows they have used the design process, and another team has said something different that also shows the design process. I don’t think an AI would be able to accurately read and understand a notebook. Also I would imagine it could have a hard time reading images and some handwriting.
I think it could definitely work as a first check.

1 Like

I’m going to be brutally honest here…

My Past

I was a teaching assistant for 2 years in the College of Engineering at Texas A&M, and I have graded thousands of college student papers for $11.50/hour. I’m going to be honest with you, the money is NOT worth the efforts in a realistic sense but as a student it made do to at least put food on my plates while in college. Additionally, I have a feeling that nowadays teachers are quite underpaid for the efforts they are pursuing and the financial shortcomings are getting worse as many cities are pulling the financial rug on teachers financially.

However,

The State of AI

I have used numerous AI chat systems including NVIDIA Chat with RTX which runs a chatting system locally on your PC, as well as ChatGPT and META's Llama. And consistently I have concluded that AI is not to any satisfactory level to be relied in for academia. To put it bluntly, AI is really helpful as a starter template, but I have never relied on AI code a solution that worked the first time. This is because the state of AI right now, as we know it, is that the AI system like ChatGPT hallucinates. That is it mixes rules and standardized systems with rules and standardized systems of other things it learned. For example, if you ask ChatGPT to write an essay with a Works Cited and have proper citations and sources, it wouldn't even ask what format but may instead write an essay with APA citations, MLA works cited, and paragraphs in Chicago style. Even if you ask it to write an essay in MLA, it may deviate occasionally using rules from APA. Additionally, the sources and links it generates oftentimes do not exist as a valid source. And even if you point it out the AI is still likely to spit out the same problem or just make up a random irrelevant solution to the issue. These sort of responses are known as "Hallucinations" where the AI makes up its own rules and gaslights itself that it is correct. That being said, systems like ChatGPT is incapable of being fully relied upon to grade papers and therefore would not be reliable.

The Truth about AI Grading Solutions

The only solution I know of is Packback, and my oh my I really dislike that system. First of all I know of peers who have circumvented the AI, tricking the AI to give full marks by using special characters, formatting, etc. That being said, if you create an AI system, there is a high probability it is flawed and ultimately over time the student body may figure out the AI's patterns and easily trick it into giving full marks, completely discrediting the students who genuinely do put in the effort.

The Verdict

Although I understand the situation of professors/teachers being underpaid, I feel like I would rather bite the bullet and make sure that the student body is graded respectively to the effort that they put into their notebooks, without the reliance on AI.

Alternative Solutions

Because Engineering Notebooks are writing-intensive, I would highly suggest finding alternatives. For example, the students could take AP English to further improve English comprehension and writing skills, or to see if the department could consider having your class labeled as "Writing Intensive" and make AP English a corequisite, with the added bonus of students having the potential to get college credit as well. Ultimately, students with Engineering Notebooks come down to
  1. How well can you write?
  2. How well can you follow directions of a notebook template?
  3. How organized and structured can you make the notebook?

Honing in on the obvious weaknesses and trying to perfect the shortfalls may help students better improve their notebooks naturally, without needing to rely on external factors like a judge to tell you what is wrong. And ultimately, by doing so, the notebook will surpass state level and be able to compete with world level by simple self-assessment and following the notebooks guidelines with academic-level writing skills and practices.

9 Likes

This is a mildly inflammatory like-farming comment for all the mentors who think it would be reprehensible to abdicate anything of the assessment and guidance of young engineers to a piece of software that explicitly only knows what other children have written, and what teachers wrote about them - often just as a number or grade - and does not understand why. Run a spelling and grammar checker. Style is there to make for more ergonomic reading, and is independent of the content we’re supposed to be interacting over. Young engineers: don’t hit the like button. You don’t know me, so you should look to someone with more experience whom you do know. They might write something in response. And your opinion is absolutely crucial in all these discussions because if you’re wrong so far, we should be working to teach you, and if you’re right already, we absolutely must listen carefully and recognize that you are.

1 Like

Because of its inconsistentcies (and stubborness), I do not think using AI to grade your notebooks is a wise idea. However, AI could be used to check spelling, conciseness, and readabillity of your notebook. Other than this, listening to AI’s comments on your notebook mighn not be the best solution.

(Foster, feel free to correct my spelling, grammer and word choice!)

2 Likes

I am going to pipe in only once on this topic - this past season my judging team has reported use of AI generated passages in engineering notebooks and code documentation - not checking of spelling, grammar, or quality of engineering design process.

In my book, AI generated content without context of how tools are used does not represent the skill level of the team.

End of PSA.

I do not know which teams nor impact on judging process.

8 Likes

:face_with_raised_eyebrow: :thinking: :face_with_open_eyes_and_hand_over_mouth:
VexForum conspiracies…

2 Likes

My mildly informed two cents on the topic.

If RECF were to use award-winning notebooks from the last few Worlds to train the model, then I think it could have a lot of use as a means for teams to gather feedback on what their notebook might be lacking, or other feedback like how clear their wording is. This could allow teams to get feedback and make progressive progress that they usually cannot get from the regular judging process.

At this point, I would not trust it with even the first-pass at judging without a lot of parallel trials where the AI judging wouldn’t count, but could be compared to human judgement. I think the other thing that could have a lot of difficulty in the end is the variety of pictures, schematics, and diagrams that could be present across notebooks. Not being an expert in the technology, I still suspect that humans are far better equipped to make a judgment of what is in a diagram or picture and how well it relates to the text. I know the diagrams that my elementary kids try to draw up our frequently crude and not to a consistent scale. I think AI would have a very difficult time judging something like that, even though there may be cases that the diagram towels are very good story even without a good artist behind it.

Now, how likely is any of the above to actually happen? I’m not holding my breath, but maybe there is some of the impetus behind having digital notebooks as a requirement in the future, should that come to pass. Lots of potential training material that could more quickly get to a model that would start getting usable with enough training and correction.

2 Likes

I would definitely agree with the parallel trials. I don’t know how familiar you are with the judging process, but from my experience, the first pass through a notebook generally takes less than 5 minutes for a human to go through and determine whether or not the notebook is “developing” or “developed.” I haven’t judged in other regions, but I suspect that more competitive regions have slightly higher standards for what counts as developed and vice versa for less competitive regions. This would be tricky to train an AI on.
This would also be the simplest AI. The most complex AI, one that graded the entire notebook according to the rubric and explained why it graded the notebook that way… Let’s just say I know professor and graduate students working on the cutting edge of explainable AI which would publish several papers off something that complex.

2 Likes

This might be a good experiment the RECF could conduct for notebook grading. The notebooks are a reperesentation of the teams skill, so the AI might be comparing apples to oranges.

I blame (Poll) What shooting mechanism will you use for rapid relay - VEX IQ General Discussion - VEX Forum.

3 Likes

I think you significantly over estimate the amount of time it takes to sort a notebook into developing or developed and under estimate the amount of time it would take to train an ai to sort them. As judge with a fair amount of experience, I can sort a notebook into developing or developed in about on average 10 seconds. Maybe 30 for ones closer to the edge. Ai really isn’t going to help much.

2 Likes

15 Likes