在本文中,我们将探讨如何使用谷歌预训练模型 T5 对基于大型语言的模型进行细调的有关技巧。
译者 | 朱先忠
审校 | 孙淑娟
还记得第一次开始构建一些SQL查询来分析数据吗?相信大多数时候,你只是想看看“有哪些畅销产品”或“每周产品访问次数”。那么,为什么要编写SQL查询,而不只是用自然语言询问自己的想法呢?
由于NLP(Natural Language Processing,自然语言处理)技术的最新进展,现在这种想法已经变成可能。人们现在不仅可以使用LLM(Large Language Model,大型语言模型),还可以教它们新的技能,这称作是迁移学习。该方法中,可以使用预训练模型作为起点,而且即使使用较小的标记数据集,与单独使用数据进行训练相比,您仍然可以获得出色的性能。
在本教程中,我们将使用谷歌的文本到文本(text-to-text)生成模型T5,并使用自定义数据进行迁移学习,以便将基本问题转换为SQL查询。我们将在T5中添加一个名为“将英语翻译成SQL”的新任务。通过本教程的学习,您将拥有一个经过培训的模型,并且可以把以下示例查询:
Cars built after 2020 and manufactured in Italy1.
翻译成下面的SQL查询语句:
SELECT name FROM cars WHERE location = 'Italy' AND date > 20201.
注意,你可以在相应链接处找到本文完整的Gradio演示程序(https://huggingface.co/spaces/mecevit/english-to-sql)和图层项目(https://app.layer.ai/layer/t5-fine-tuning-with-layer)的完整源码。
1.建立训练数据
通的语言到语言的翻译数据集不同,我们可以借助模板以编程方式构建自定义的英语到SQL语句的翻译配对。下面,我们来看一下这方面的一些模板:
templates = [ ["[prop1] of [nns]","SELECT [prop1] FROM [nns]"], ["[agg] [prop1] for each [breakdown]","SELECT [agg]([prop1]) , [breakdown] FROM [prop1] GROUP BY [breakdown]"], ["[prop1] of [nns] by [breakdown]","SELECT [prop1] , [breakdown] FROM [nns] GROUP BY [breakdown]"], ["[prop1] of [nns] in [location] by [breakdown]","SELECT [prop1] , [breakdown] FROM [nns] WHERE location = '[location]' GROUP BY [breakdown]"], ["[nns] having [prop1] between [number1] and [number2]","SELECT name FROM [nns] WHERE [prop1] > [number1] and [prop1] < [number2]"], ["[prop] by [breakdown]","SELECT name , [breakdown] FROM [prop] GROUP BY [breakdown]"], ["[agg] of [prop1] of [nn]","SELECT [agg]([prop1]) FROM [nn]"], ["[prop1] of [nns] before [year]","SELECT [prop1] FROM [nns] WHERE date < [year]"], ["[prop1] of [nns] after [year] in [location]","SELECT [prop1] FROM [nns] WHERE date > [year] AND location='[location]'"], ["[nns] [verb] after [year] in [location]","SELECT name FROM [nns] WHERE location = '[location]' AND date > [year]"], ["[nns] having [prop1] between [number1] and [number2] by [breakdown]","SELECT name , [breakdown] FROM [nns] WHERE [prop1] < [number1] AND [prop1] > [number2] GROUP BY [breakdown]"], ["[nns] with a [prop1] of maximum [number1] by their [breakdown]","SELECT name , [breakdown] FROM [nns] WHERE [prop1] <= [number1] GROUP BY [breakdown]"], ["[prop1] and [prop2] of [nns] since [year]","SELECT [prop1] , [prop2] FROM [nns] WHERE date > [year]"], ["[nns] which have both [prop1] and [prop2]","SELECT name FROM [nns] WHERE [prop1] IS true AND [prop2] IS true"], ["Top [number1] [nns] by [prop1]","SELECT name FROM [nns] ORDER BY [prop1] DESC LIMIT [number1]"]]template = random.choice(templates)print("Sample Query Template :", template[0])print("SQL Translation :", template[1])1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.
正如你所看到的,我们在这里使用了Layer中的@dataset装饰器。现在,我们可以通过以下方式轻松地将此功能传递到Layer层:
layer.run([build_dataset])1.
一旦上面语句运行完成,接下来,我们就可以开始构建自定义数据集加载程序实现微调T5模型了。
2.创建数据加载器Dataloader
在本示例项目中,我们要实现的数据集基本上是一个PyTorch数据集的定制数据集实现。请参考以下代码:
from torch.utils.data import Dataset class EnglishToSQLDataSet(Dataset): def __init__(self, dataframe, tokenizer, source_len, target_len, source_text, target_text): self.tokenizer = tokenizer self.data = dataframe self.source_len = source_len self.target_len = target_len self.target_text = self.data[target_text] self.source_text = self.data[source_text] self.data["query"] = "translate English to SQL: "+self.data["query"] self.data["sql"] = "<pad>" + self.data["sql"] + "</s>" def __len__(self): return len(self.target_text) def __getitem__(self, index): source_text = str(self.source_text[index]) target_text = str(self.target_text[index]) source_text = ' '.join(source_text.split()) target_text = ' '.join(target_text.split()) source = self.tokenizer.batch_encode_plus([source_text], max_length= self.source_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt') target = self.tokenizer.batch_encode_plus([target_text], max_length= self.target_len, pad_to_max_length=True, truncation=True, padding="max_length", return_tensors='pt') source_ids = source['input_ids'].squeeze() source_mask = source['attention_mask'].squeeze() target_ids = target['input_ids'].squeeze() target_mask = target['attention_mask'].squeeze() return { 'source_ids': source_ids.to(dtype=torch.long), 'source_mask': source_mask.to(dtype=torch.long), 'target_ids': target_ids.to(dtype=torch.long), 'target_ids_y': target_ids.to(dtype=torch.long) }1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.
3.细调T5模型
至此,我们的数据集已准备就绪并注册到Layer层。现在,我们将着手开发微调逻辑部分。此处,我们使用@model装饰函数并将其传递给Layer层。这将在Layer层对模型进行训练,并将其注册到我们的项目中。
def train(epoch, tokenizer, model, device, loader, optimizer): import torch model.train() for _,data in enumerate(loader, 0): y = data['target_ids'].to(device, dtype = torch.long) y_ids = y[:, :-1].contiguous() lm_labels = y[:, 1:].clone().detach() lm_labels[y[:, 1:] == tokenizer.pad_token_id] = -100 ids = data['source_ids'].to(device, dtype = torch.long) mask = data['source_mask'].to(device, dtype = torch.long) outputs = model(input_ids = ids, attention_mask = mask, decoder_input_ids=y_ids, labels=lm_labels) loss = outputs[0] step = (epoch * len(loader)) + _ layer.log({"loss": float(loss)}, step) optimizer.zero_grad() loss.backward() optimizer.step()1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.
在上述代码中,我们使用了三个独立的Layer层装饰器:
@model:告诉Layer层这个函数用于训练一个ML模型。
@fabric:用于告诉Layer层训练模型所需的计算资源(CPU、GPU等)。由于T5是一个大型模型,所以我们需要使用GPU对其进行微调。下面列举的是一个你可以使用Layer层操作的组装列表。
@pip_requirements:指示Python包需要对我们的模型进行微调。
@model("t5-tokenizer")@fabric("f-medium")@pip_requirements(packages=["torch","transformers","sentencepiece"])def build_tokenizer(): from transformers import T5Tokenizer #从Hugging face加载分词器 tokenizer = T5Tokenizer.from_pretrained("t5-small") return tokenizer @model("t5-english-to-sql")@fabric("f-gpu-small")@pip_requirements(packages=["torch","transformers","sentencepiece"])def build_model(): from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler from transformers import T5Tokenizer, T5ForConditionalGeneration import torch.nn.functional as F from torch import cuda import torch parameters={ "BATCH_SIZE":8, "EPOCHS":3, "LEARNING_RATE":2e-05, "MAX_SOURCE_TEXT_LENGTH":75, "MAX_TARGET_TEXT_LENGTH":75, "SEED": 42 } #把参数加载到Layer层中 layer.log(parameters) #为重复性设定种子参数 torch.manual_seed(parameters["SEED"]) np.random.seed(parameters["SEED"]) torch.backends.cudnn.deterministic = True #从Layer层加载分词器 tokenizer = layer.get_model("t5-tokenizer").get_train() #从Hugging face中加载预训练模型 model = T5ForConditionalGeneration.from_pretrained("t5-small") device = 'cuda' if cuda.is_available() else 'cpu' model.to(device) dataframe = layer.get_dataset("english_sql_translations").to_pandas() source_text = "query" target_text = "sql" dataframe = dataframe[[source_text,target_text]] train_dataset = dataframe.sample(frac=0.8,random_state = parameters["SEED"]) train_dataset = train_dataset.reset_index(drop=True) layer.log({"FULL Dataset": str(dataframe.shape), "TRAIN Dataset": str(train_dataset.shape) }) training_set = EnglishToSQLDataSet(train_dataset, tokenizer, parameters["MAX_SOURCE_TEXT_LENGTH"], parameters["MAX_TARGET_TEXT_LENGTH"], source_text, target_text) dataloader_paramaters = { 'batch_size': parameters["BATCH_SIZE"], 'shuffle': True, 'num_workers': 0 } training_loader = DataLoader(training_set, **dataloader_paramaters) optimizer = torch.optim.Adam(params = model.parameters(), lr=parameters["LEARNING_RATE"]) for epoch in range(parameters["EPOCHS"]): train(epoch, tokenizer, model, device, training_loader, optimizer) return model1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.
现在,我们可以将分词器和模型训练函数传递给Layer层,以便在远程GPU实例上训练我们的模型。
layer.run([build_tokenizer, build_model], debug=True)1.
训练完成后,我们可以在Layer层用户界面中找到我们的模型和度量指标。下图显示的是我们所使用的训练过程中的损失曲线:
4.开发Gradio演示程序
Gradio(https://gradio.app/)是使用友好的Web界面演示机器学习模型的最快方法,任何人都可以在任何地方使用它!
接下来,我们将用Gradio构建一个交互式演示程序,以便为尝试本文中提供模型的读者提供一个用户界面。
接下来,让我们开始编写代码——创建一个Python文件app.py,并输入以下代码:
import gradio as gr import layer model = layer.get_model('layer/t5-fine-tuning-with-layer/models/t5-english-to-sql').get_train()tokenizer = layer.get_model('layer/t5-fine-tuning-with-layer/models/t5-tokenizer').get_train()def greet(query): input_ids = tokenizer.encode(f"translate English to SQL: {query}", return_tensors="pt") outputs = model.generate(input_ids, max_length=1024) sql = tokenizer.decode(outputs[0], skip_special_tokens=True) return sql iface = gr.Interface(fn=greet, inputs="text", outputs="text", examples=[ "Show me the average price of wines in Italy by provinces", "Cars built after 2020 and manufactured in Italy", "Top 10 cities by their population"])iface.launch()1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.
在上述代码中:
我们从Layer层中获取经过微调的模型和相关分词器
使用Gradio创建一个简单的用户界面:其中,使用一个输入文本字段用于查询输入,一个输出文本字段用于显示预测的SQL查询
为了顺序运行这个小型Python应用程序,我们还需要一些额外的库。因此,我们创建一个包含以下内容的文件:
layer-sdk==0.9.350435 torch==1.11.0 sentencepiece==0.1.961.2.3.
现在,我们已准备好发布Gradio应用程序了:
(1)打开Hugging face官网(译者注:Hugging Face是美国的一家开源创业公司,其业务领域已经从聊天机器人扩展到机器学习等领域),创建一个空间。
(2)别忘了选择Gradio作为Space SDK 2。
现在,使用以下命令将您的仓库代码克隆到本地目录:
$ git clone [YOUR_HUGGINGFACE_SPACE_URL]1.
然后,将requirements.txt文件和app.py文件放到复制的目录中,并在终端中运行以下命令:
$ git add app.py $ git add requirements.txt $ git commit -m "Add application files" $ git push1.2.3.4.
现在,切换到你前面创建的空间。你会观察到在你创建的示例应用程序部署完毕后的程序界面。
5.小结
在本文中,我们学习了如何使用谷歌的T5模型框架来微调大型语言的模型相关技巧。在仔细阅读完本文后,我相信您可以着手设计自己的任务并使用T5模型来微调你自己的应用程序模型了。
最后,你也可以查看下载并分析微调T5项目(https://app.layer.ai/layer/t5-fine-tuning-with-layer),并根据自己的任务对其进行修改。
原文及参考资料:
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
https://app.layer.ai/layer/t5-fine-tuning-with-layer
https://huggingface.co/spaces
https://www.kdnuggets.com/2022/05/query-table-t5.html
译者介绍
朱先忠,51CTO社区编辑,51CTO专家博客、讲师,潍坊一所高校计算机教师,自由编程界老兵一枚。早期专注各种微软技术(编著成ASP.NET AJX、Cocos 2d-X相关三本技术图书),近十多年投身于开源世界(熟悉流行全栈Web开发技术),了解基于OneNet/AliOS+Arduino/ESP32/树莓派等物联网开发技术与Scala+Hadoop+Spark+Flink等大数据开发技术。
责任编辑:武晓燕来源: 51CTO技术栈